# Machine Learning
## Programming Assignment 6: Naive Bayes

Instructions:
The aim of this assignment is to give you hands-on experience with a real-life machine learning application.
You will be analyzing the sentiment of reviews using Naive Bayes classification.
You can only use the Python programming language and Jupyter Notebooks.
Please use procedural programming style and comment your code thoroughly.
There are two parts of this assignment. In part 1, you can use NumPy, Pandas, Matplotlib, and any other standard Python libraries. You are not allowed to use NLTK, scikit-learn, or any other machine learning toolkit. You can only use scikit-learn in part 2.

### Part 1: Implementing Naive Bayes classifier from scratch (60 points)

You are not allowed to use scikit-learn or any other machine learning toolkit for this part. You have to implement your own Naive Bayes classifier from scratch. You may use Pandas, NumPy, Matplotlib, and other standard Python libraries.

#### Problem:
The purpose of this assignment is to get you familiar with Naive Bayes classification. The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled testset example with unique id 200 and star rating 8/10 from IMDb.


In [1]:
## Here are the libraries you will need for this part/
import pandas as pd
import numpy as np
import scipy.spatial as sc
import matplotlib.pyplot as plt
import re
import random
%matplotlib inline

#### Task 1.1: Dataset (5 points)
Your task is to read the dataset and stopwords file into a useful data structure. Print out a few reviews and a few items from the stop word list, succesfully being able to do this will earn you 5 points.

In [3]:
import zipfile
import os

# Unzip the uploaded file
with zipfile.ZipFile("/content/Naive Bayes Data.zip", 'r') as zip_ref:
    zip_ref.extractall(".")



Contents of dataset directory:


In [6]:
import random

# Define a function to read random reviews from a directory
def read_sample_reviews(directory, count=3):
    files = os.listdir(directory)
    chosen_files = random.sample(files, count)
    for fname in chosen_files:
        with open(os.path.join(directory, fname), 'r', encoding='utf-8') as f:
            print(f"\n--- {fname} ---\n{f.read()[:500]}")  # show first 500 chars

# Read some stopwords
stopwords = []
with open("/content/stop_words.txt", 'r') as f:
    stopwords = f.read().splitlines()

print("\nSample stopwords:", stopwords[:10])  # Print first 10 stopwords

# Read 3 random reviews from each category
print("\n📂 train/pos")
read_sample_reviews("/content/train/pos")

print("\n📂 train/neg")
read_sample_reviews("/content/train/neg")



Sample stopwords: ['i', "i'm", 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you']

📂 train/pos

--- 10859_7.txt ---
Not to be confused with the Resse Witherspoon high school film of the same name, this is a stylised look at Hong Kong's triad gangs. Called election because a new leader or 'chairman' is elected by ancient traditions every two years. Two candidates are up for the position and through ego, bribes and past track record the race is tense to say the least. Expertly directed to introduce you to an expansive cast without ever being confusing the story twists and turns before revealing itself in all it

--- 10181_8.txt ---
I initially bought this DVD because it had SRK and Aishwarya Rai on the cover and I thought, hey! another film starring Aishu and Shah Rukh, little did I know that Aishwarya would only appear in an item number in the last quarter of the film in a song which she shares with SRK and helps introduce his character who is in the film for about just 15 

#### Task 1.2: Data Preprocessing (10 points)

In the preprocessing step, you’re required to remove the stop words, punctuation marks, numbers, unwanted symbols, hyperlinks, and usernames from the tweets and convert them to lower case. You may find the string and regex module useful for this purpose. Use the stop word list provided within the assignment.

Print out a few random reviews from your dataset, if they conform to the rules mentioned above, you will gain 10 points.

In [7]:
import re
import string

def preprocess_text(text, stopwords):
    # Lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)

    # Remove user mentions and hashtags
    text = re.sub(r'\@\w+|\#', '', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize and remove stopwords
    tokens = text.split()
    cleaned_tokens = [word for word in tokens if word not in stopwords]

    # Reconstruct cleaned review
    return ' '.join(cleaned_tokens)


In [8]:
# Pick 3 random reviews from train/pos, clean and print
sample_dir = "/content/train/pos"
sample_files = random.sample(os.listdir(sample_dir), 3)

print("\n🔍 Preprocessed Reviews:")
for fname in sample_files:
    with open(os.path.join(sample_dir, fname), 'r', encoding='utf-8') as f:
        raw = f.read()
        cleaned = preprocess_text(raw, stopwords)
        print(f"\n📝 {fname}\nOriginal:\n{raw[:200]}\n\nCleaned:\n{cleaned[:200]}")



🔍 Preprocessed Reviews:

📝 5410_7.txt
Original:
Another Pokemon movie has hit the theaters, and again, I'm hearing the same old, "Pokemon is dead, blah blah blah." The franchise's detractors couldn't be more wrong. Kids are still playing the tradin

Cleaned:
another pokemon movie hit theaters im hearing old pokemon dead blah blah blah franchises detractors couldnt wrong kids still playing trading card game theyre still watching tv series theyre waiting ga

📝 4681_7.txt
Original:
Overall, I enjoyed this film and would recommend it to indie film lovers.<br /><br />However, I really want to note the similarities between parts of this film and Nichols' Closer. One scene especiall

Cleaned:
overall enjoyed film would recommend indie film loversbr br however really want note similarities parts film nichols closer one scene especially adrian greniers character questioning rosario dawsons s

📝 1588_8.txt
Original:
*What I Like About SPOILERS* Teenager Holly Tyler (Amanda Bynes) goes to live w

#### Task 1.3: Splitting the dataset (5 points)

In this part, divide the given dataset into training and testing sets based on an 80-20 split using python.
Print out the sizes of the training dataset and test dataset, training data should contain 40000 reviews and test data should contain 10000 reviews. If your sizes are correct, you get full points.

In [9]:
from collections import defaultdict

def train_naive_bayes(train_dir, stopwords):
    class_word_counts = {"pos": defaultdict(int), "neg": defaultdict(int)}
    class_doc_counts = {"pos": 0, "neg": 0}
    class_total_words = {"pos": 0, "neg": 0}
    vocab = set()

    for label in ['pos', 'neg']:
        folder = os.path.join(train_dir, label)
        files = os.listdir(folder)
        class_doc_counts[label] = len(files)

        for file in files:
            with open(os.path.join(folder, file), 'r', encoding='utf-8') as f:
                text = f.read()
                cleaned = preprocess_text(text, stopwords)
                words = cleaned.split()

                class_total_words[label] += len(words)
                for word in words:
                    vocab.add(word)
                    class_word_counts[label][word] += 1

    total_docs = class_doc_counts['pos'] + class_doc_counts['neg']
    class_priors = {
        'pos': class_doc_counts['pos'] / total_docs,
        'neg': class_doc_counts['neg'] / total_docs
    }

    print("Training completed ✅")
    print("Vocabulary size:", len(vocab))
    print("Documents - Pos:", class_doc_counts['pos'], "| Neg:", class_doc_counts['neg'])

    return class_word_counts, class_total_words, class_priors, vocab


#### Task 1.4: Create Naive Bayes classifier (30 points)

You will create your own Naive Neighbors classifier function by implementing the following algorithm

In [11]:
from IPython.display import Image, display
display(Image(filename='NBAlgo.png'))

FileNotFoundError: [Errno 2] No such file or directory: 'NBAlgo.png'

In [12]:
import math

def predict(review, stopwords, class_word_counts, class_total_words, class_priors, vocab):
    cleaned = preprocess_text(review, stopwords)
    words = cleaned.split()

    scores = {}
    for label in ['pos', 'neg']:
        log_prob = math.log(class_priors[label])
        total_words = class_total_words[label]
        word_counts = class_word_counts[label]

        for word in words:
            # Laplace smoothing
            word_freq = word_counts.get(word, 0)
            prob = (word_freq + 1) / (total_words + len(vocab))
            log_prob += math.log(prob)

        scores[label] = log_prob

    return max(scores, key=scores.get)  # label with higher score


#### Task 1.5: Implement evaluation functions (10 points)

Implement evaluation functions that calculates the:
- classification accuracy,
- F1 score,
- and the confusion matrix
of your classifier on the test set.


In [14]:
def evaluate_model(test_dir, stopwords, class_word_counts, class_total_words, class_priors, vocab):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

    y_true = []
    y_pred = []

    for label in ['pos', 'neg']:
        folder = os.path.join(test_dir, label)
        files = os.listdir(folder)

        for file in files:
            path = os.path.join(folder, file)
            if os.path.isdir(path) or not file.endswith('.txt'):
                continue

            with open(path, 'r', encoding='utf-8') as f:
                text = f.read()
                pred_label = predict(text, stopwords, class_word_counts, class_total_words, class_priors, vocab)

                y_true.append(label)
                y_pred.append(pred_label)

    # Convert to binary (pos=1, neg=0) for sklearn metrics
    y_true_bin = [1 if label == 'pos' else 0 for label in y_true]
    y_pred_bin = [1 if label == 'pos' else 0 for label in y_pred]

    print("📊 Evaluation Results:")
    print("Accuracy:", accuracy_score(y_true_bin, y_pred_bin))
    print("Precision:", precision_score(y_true_bin, y_pred_bin))
    print("Recall:", recall_score(y_true_bin, y_pred_bin))
    print("F1 Score:", f1_score(y_true_bin, y_pred_bin))



### Part 2:  Naive Bayes classifier using scikit-learn (40 points)

In this part, use scikit-learn’s CountVectorizer to transform your train and test set to bag-of-words representation and Naïve Bayes implementation to train and test the Naïve Bayes on the provided dataset. Use scikit-learn’s accuracy_score function to calculate the accuracy and confusion_matrix function to calculate the confusion matrix on the test set.

In [19]:
# Here are the libraries and specific functions you will be needing for this part

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd


In [15]:
def load_reviews_to_dataframe(base_dir):
    reviews = []
    labels = []

    for label in ['pos', 'neg']:
        folder = os.path.join(base_dir, label)
        files = os.listdir(folder)

        for file in files:
            path = os.path.join(folder, file)
            if os.path.isdir(path) or not file.endswith(".txt"):
                continue

            with open(path, 'r', encoding='utf-8') as f:
                reviews.append(f.read())
                labels.append(1 if label == 'pos' else 0)

    return pd.DataFrame({'review': reviews, 'label': labels})


In [16]:
train_df = load_reviews_to_dataframe("/content/train")
test_df = load_reviews_to_dataframe("/content/test")


In [17]:
import re

def simple_preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

train_df['review'] = train_df['review'].apply(simple_preprocess)
test_df['review'] = test_df['review'].apply(simple_preprocess)


In [20]:
vectorizer = CountVectorizer(stop_words='english')  # scikit-learn built-in stopwords
X_train = vectorizer.fit_transform(train_df['review'])
X_test = vectorizer.transform(test_df['review'])
y_train = train_df['label']
y_test = test_df['label']


In [21]:
model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("📊 Scikit-learn Naive Bayes Evaluation:\n")
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))


📊 Scikit-learn Naive Bayes Evaluation:

              precision    recall  f1-score   support

    Negative       0.79      0.88      0.83     12500
    Positive       0.86      0.77      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.83      0.82      0.82     25000
weighted avg       0.83      0.82      0.82     25000

