# Machine Learning
## Programming Assignment 6: Naive Bayes

Instructions:
The aim of this assignment is to give you hands-on experience with a real-life machine learning application.
You will be analyzing the sentiment of reviews using Naive Bayes classification.
You can only use the Python programming language and Jupyter Notebooks.
Please use procedural programming style and comment your code thoroughly.
There are two parts of this assignment. In part 1, you can use NumPy, Pandas, Matplotlib, and any other standard Python libraries. You are not allowed to use NLTK, scikit-learn, or any other machine learning toolkit. You can only use scikit-learn in part 2.

### Part 1: Implementing Naive Bayes classifier from scratch (60 points)

You are not allowed to use scikit-learn or any other machine learning toolkit for this part. You have to implement your own Naive Bayes classifier from scratch. You may use Pandas, NumPy, Matplotlib, and other standard Python libraries.

#### Problem:
The purpose of this assignment is to get you familiar with Naive Bayes classification. The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled testset example with unique id 200 and star rating 8/10 from IMDb.


In [None]:
## Here are the libraries you will need for this part/
import pandas as pd
import numpy as np
import scipy.spatial as sc
import matplotlib.pyplot as plt
import re
import random
%matplotlib inline
## Here we have added the standard libraries
import zipfile
import os

#### Task 1.1: Dataset (5 points)
Your task is to read the dataset and stopwords file into a useful data structure. Print out a few reviews and a few items from the stop word list, succesfully being able to do this will earn you 5 points.

In [None]:
zip_path = 'Naive Bayes Data.zip'
extract_to = './naive_bayes_data'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)
stopwords_path = os.path.join(extract_to, 'stop_words.txt') ## to load stop words from the stop_words.txt file
with open(stopwords_path, 'r') as f:
    stop_words = f.read().split()
print("Stop Words:", stop_words[:5])
def load_reviews_from_folder(folder_path, label): ## to load all reviews and assign a label to them
    reviews = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
                reviews.append((content, label))
    return reviews
train_pos = load_reviews_from_folder(os.path.join(extract_to, 'train', 'pos'), label=1) ## to load positive and negative training data
train_neg = load_reviews_from_folder(os.path.join(extract_to, 'train', 'neg'), label=0)
test_pos = load_reviews_from_folder(os.path.join(extract_to, 'test', 'pos'), label=1) ## to load positive and negative testing data
test_neg = load_reviews_from_folder(os.path.join(extract_to, 'test', 'neg'), label=0)
train_data = train_pos + train_neg ## ombine positive and negative reviews
test_data = test_pos + test_neg
random.shuffle(train_data)
random.shuffle(test_data)
##OUTPUT (the rubric requires minimum two however I have printed five tweets and five words)
for i in range(5):
    print(f"{i+1}. Label: {train_data[i][1]} | Review: {train_data[i][0]}")

Stop Words: ['i', "i'm", 'me', 'my', 'myself']
1. Label: 1 | Review: I never thought an old cartoon would bring tears to my eyes! When I first purchased Casper & Friends: Spooking About Africa, I so much wanted to see the very first Casper cartoon entitled The Friendly Ghost (1945), But when I saw the next cartoon, There's Good Boos To-Night (1948), It made me break down! I couldn't believe how sad and tragic it was after seeing Casper's fox get killed! I never saw anything like that in the other Casper cartoons! This is the saddest one of all! It was so depressing, I just couldn't watch it again. It's just like seeing Lassie die at the end of a movie. I know it's a classic,But it's too much for us old cartoon fans to handle like me! If I wanted to watch something old and classic, I rather watch something happy and funny! But when I think about this Casper cartoon, I think about my cats!
2. Label: 0 | Review: If the writer/director is reading this (and I imagine you are since you shoul

#### Task 1.2: Data Preprocessing (10 points)

In the preprocessing step, you’re required to remove the stop words, punctuation marks, numbers, unwanted symbols, hyperlinks, and usernames from the tweets and convert them to lower case. You may find the string and regex module useful for this purpose. Use the stop word list provided within the assignment.

Print out a few random reviews from your dataset, if they conform to the rules mentioned above, you will gain 10 points.

In [None]:
def preprocess_text(text, stop_words):
    text = text.lower() ## to convert to lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text) ## to remove hyperlinks
    text = re.sub(r'@\w+', '', text) ## to remove usernames
    text = re.sub(r'[^a-z\s]', '', text) # to remove all non-alphabetic characters
    words = text.split()
    words = [word for word in words if word not in stop_words] ## to remove stop words
    return ' '.join(words)
processed_reviews = [(preprocess_text(review, stop_words), label) for review, label in train_data]
for i in range(5):
    print(f"{i+1}.", processed_reviews[i][0])
print("\nWords from the tweets shown in quotes as per PA6 tutorial:")
for i in range(5):
    words = processed_reviews[i][0].split()
    quoted_words = ', '.join([f"'{word}'" for word in words])
    print(f"{i+1}. {quoted_words}")

1. never thought old cartoon would bring tears eyes first purchased casper friends spooking africa much wanted see first casper cartoon entitled friendly ghost saw next cartoon theres good boos tonight made break couldnt believe sad tragic seeing caspers fox get killed never saw anything like casper cartoons saddest one depressing couldnt watch like seeing lassie die end movie know classicbut much us old cartoon fans handle like wanted watch something old classic rather watch something happy funny think casper cartoon think cats
2. writerdirector reading imagine since work must tell seen bad movies time one gets distinction worst premise ive ever heardbr br spoilers nothing happens br br total waste time laughed loud end br br side note whole movie coma scene sleeps guy mean someone raped knocked outbr br utter rubbish
3. know sounds odd coming someone born almost years show stopped airing love show dont know enjoy watching love adam best disappointing thing place found buy seasons dvd

#### Task 1.3: Splitting the dataset (5 points)

In this part, divide the given dataset into training and testing sets based on an 80-20 split using python.
Print out the sizes of the training dataset and test dataset, training data should contain 40000 reviews and test data should contain 10000 reviews. If your sizes are correct, you get full points.

In [None]:
all_data = train_data + test_data ## combine both training and testing data
processed_reviews = [(preprocess_text(review, stop_words), label) for review, label in all_data]
split_index = int(0.8 * len(processed_reviews)) ## split the data 80-20 split
train_split = processed_reviews[:split_index]
test_split = processed_reviews[split_index:]
## OUTPUT to display the sizes
print("Training set size:", len(train_split), "reviews")
print("Test set size:", len(test_split), "reviews")
print("Total dataset size (after combining train and test folders):", len(processed_reviews))

Training set size: 40000 reviews
Test set size: 10000 reviews
Total dataset size (after combining train and test folders): 50000


#### Task 1.4: Create Naive Bayes classifier (30 points)

You will create your own Naive Neighbors classifier function by implementing the following algorithm

In [None]:
##from IPython.display import Image, display
##display(Image(filename='NBAlgo.png'))

In [None]:
def train_naive_bayes(data): ## to train a Naive Bayes classifier
    bigdoc = {}
    label_counts = {}
    vocab = set()
    for text, label in data: ## to loop through each review and aggregate words by class
        words = text.split()
        if label not in label_counts:
            label_counts[label] = 0
            bigdoc[label] = []
        label_counts[label] += 1
        bigdoc[label].extend(words)
        vocab.update(words)
    total_docs = len(data)
    V = vocab ## vocabulary set
    logprior = {} ## to store log prior
    loglikelihood = {} ## to store log likelihood
    for c in label_counts:
        logprior[c] = np.log(label_counts[c] / total_docs) ## to compute logprior and loglikelihood
    for c in label_counts:
        word_counts = {}
        for word in bigdoc[c]:
            word_counts[word] = word_counts.get(word, 0) + 1
        total_wc = sum(word_counts.values())
        for word in V:
            count = word_counts.get(word, 0)
            loglikelihood[(word, c)] = np.log((count + 1) / (total_wc + len(V)))
    return logprior, loglikelihood, V
def predict(text, logprior, loglikelihood, V): ## to predict the class of a given text based on trained model
    words = text.split()
    scores = {}
    for c in logprior:
        scores[c] = logprior[c]
        for word in words:
            if word in V:
                scores[c] += loglikelihood.get((word, c), 0)
    return max(scores, key=scores.get) ## to return the class with the highest score
train_data = train_split ## already preprocessed in Task 1.3
test_data = test_split
logprior, loglikelihood, V = train_naive_bayes(train_data)
print(f"Vocabulary created with {len(V)} unique words.") ## to display vocabulary size
sample_text = test_data[0][0] ## to confirm that the classifier is running and returning predictions correctly
sample_label = test_data[0][1]
sample_pred = predict(sample_text, logprior, loglikelihood, V)
print(f"Classifier test run: predicted {sample_pred}, actual {sample_label}")
print("The vocabulary has been made and the classifier is running perfectly by returning the argmax of the likelihood.\n")
print("Sample Predictions:") ## to display sample predictions
shown_pos = shown_neg = 0
for text, label in test_data:
    pred = predict(text, logprior, loglikelihood, V)
    if label == 1 and shown_pos == 0:
        print("Tweet (actual: positive):")
        print(f"'{text}'")
        print(f"Predicted: {'positive' if pred == 1 else 'negative'}")
        shown_pos = 1
    if label == 0 and shown_neg == 0:
        print("Tweet (actual: negative):")
        print(f"'{text}'")
        print(f"Predicted: {'positive' if pred == 1 else 'negative'}")
        shown_neg = 1
    if shown_pos and shown_neg:
        break
print("\nLogprior:") # to display the prior log probabilities for each class
for c in logprior:
    print(f"  Class {c}: {logprior[c]}")
print("\nSample Loglikelihoods:") ## to display a few loglikelihood values
i = 0
for key in loglikelihood:
    print(f"  Word '{key[0]}' | Class {key[1]}: {loglikelihood[key]}")
    i += 1
    if i == 5:
        break
correct = 0 ## to calculate accuracy
for text, label in test_data:
    pred = predict(text, logprior, loglikelihood, V)
    if pred == label:
        correct += 1
accuracy = correct / len(test_data) ## to calculate accuracy
print("\nAccuracy on test set:", round(accuracy * 100, 2), "%")

Vocabulary created with 155005 unique words.
Classifier test run: predicted 0, actual 0
The vocabulary has been made and the classifier is running perfectly by returning the argmax of the likelihood.

Sample Predictions:
Tweet (actual: negative):
'makes movie damn bad lame subpar juvenile humor could horrid trendy suck ass music perhaps uninspired go nowhere story maybe even fact traci lords gives worst acting performance ever add insult injury keeps clothes throughout length steaming turd sandwich regardless matter reason film sucks fact remains really really never wished could watching movie dean cameron instead watching life ski school masterpiece comic genius compared travestybr br grade f br br eye candy nikol nesbitt buffy tyler suzanne stokes unleash tupperware titsbr br saw starz demand'
Predicted: negative
Tweet (actual: positive):
'recently saw movie first time enjoyed much went right bought dvd movie pure genius gets funnier viewing anyone write jokes funny dialog actors mem

#### Task 1.5: Implement evaluation functions (10 points)

Implement evaluation functions that calculates the:
- classification accuracy,
- F1 score,
- and the confusion matrix
of your classifier on the test set.


In [None]:
def evaluate(predictions, true_labels): ## to evaluate prediction results
    tp = sum(1 for p, t in zip(predictions, true_labels) if p == 1 and t == 1) ## count true positives
    tn = sum(1 for p, t in zip(predictions, true_labels) if p == 0 and t == 0) ## count true negatives
    fp = sum(1 for p, t in zip(predictions, true_labels) if p == 1 and t == 0) ## count false positives
    fn = sum(1 for p, t in zip(predictions, true_labels) if p == 0 and t == 1) ## count false negatives
    accuracy = (tp + tn) / len(predictions) ## to calculate accuracy
    precision = tp / (tp + fp) if (tp + fp) != 0 else 0
    recall    = tp / (tp + fn) if (tp + fn) != 0 else 0
    f1        = (2 * precision * recall) / (precision + recall) if (precision + recall) != 0 else 0
    return accuracy * 100, f1 * 100, [[tp, fn], [fp, tn]]
preds = [predict(text, logprior, loglikelihood, V) for text, _ in test_data] ## to predict on test data
true = [label for _, label in test_data]
acc, f1, cm = evaluate(preds, true)
##OUTPUT to print accuracy, F1 score, and confusion matrix
print("Accuracy:", round(acc, 2), "%")
print("F1 Score:", round(f1, 2), "%")
print("Confusion Matrix:")
print("{:<12} {:<15} {:<15}".format("", "Predicted Pos", "Predicted Neg"))
print("{:<12} {:<15} {:<15}".format("Actual Pos", cm[0][0], cm[0][1]))
print("{:<12} {:<15} {:<15}".format("Actual Neg", cm[1][0], cm[1][1]))

Accuracy: 85.01 %
F1 Score: 84.48 %
Confusion Matrix:
             Predicted Pos   Predicted Neg  
Actual Pos   4079            915            
Actual Neg   584             4422           


### Part 2:  Naive Bayes classifier using scikit-learn (40 points)

In this part, use scikit-learn’s CountVectorizer to transform your train and test set to bag-of-words representation and Naïve Bayes implementation to train and test the Naïve Bayes on the provided dataset. Use scikit-learn’s accuracy_score function to calculate the accuracy and confusion_matrix function to calculate the confusion matrix on the test set.

In [None]:
# Here are the libraries and specific functions you will be needing for this part

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
all_data = train_data + test_data
processed_reviews = [(preprocess_text(review, stop_words), label) for review, label in all_data]
split_index = int(0.8 * len(processed_reviews))
train_split = processed_reviews[:split_index]
test_split = processed_reviews[split_index:]
train_texts = [text for text, label in train_split] # to separate the text data and corresponding labels for training and testing
train_labels = [label for text, label in train_split]
test_texts = [text for text, label in test_split]
test_labels = [label for text, label in test_split]
vectorizer = CountVectorizer(stop_words=stop_words) ## to transform the text data into a bag-of-words representation
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
clf = MultinomialNB() ## to train a Naive Bayes classifier
clf.fit(X_train, train_labels)
predictions = clf.predict(X_test)
acc = accuracy_score(test_labels, predictions)
cm = confusion_matrix(test_labels, predictions)
report = classification_report(test_labels, predictions)
##OUTPUT to print accuracy, confusion matrix and classification report
print("Accuracy:", round(acc * 100, 2), "%")
print("Confusion Matrix:")
print("{:<12} {:<15} {:<15}".format("", "Predicted Pos", "Predicted Neg"))
print("{:<12} {:<15} {:<15}".format("Actual Pos", cm[1][1], cm[1][0]))
print("{:<12} {:<15} {:<15}".format("Actual Neg", cm[0][1], cm[0][0]))
print("\nClassification Report:")
print(report)

Accuracy: 85.03 %
Confusion Matrix:
             Predicted Pos   Predicted Neg  
Actual Pos   4079            915            
Actual Neg   582             4424           

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.88      0.86      5006
           1       0.88      0.82      0.84      4994

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

