# Ahmet Emre Usta
# 2200765036

# Part 1

1.  To determine the priority of each case, I will evaluate the probability of dangerous fires as a cause of smoke in each neighborhood, considering both the probability of fire and the probability of smoke given a fire. 
 
    1. **Neighborhood 1**:    
        - **Probability of dangerous fire (P(Fire))** = 1%    
        - **Probability of barbecue smoke (alternative cause)** = 20%    
        - **Probability of smoke given fire (P(Smoke | Fire))** = 80%    
        - Using Bayes' theorem, calculate \( P( Fire | Smoke  ) \) to determine the likelihood of a fire being given smoke. The higher the result, the higher is the priority.     \[    P( Fire | Smoke  ) =  P( Smoke | Fire  ) \times P( Fire  )  P( Smoke  )      \]        
        
        We assume \( P( Smoke  ) = P( Smoke | Fire  ) \times P( Fire  ) + P( Barbecue Smoke  ) \).  
        
    2. **Neighborhood 2**:    
        - **Probability of dangerous fire (P(Fire))** = 35%    
        - **Probability of factory smoke (alternative cause)** = 10%    
        - **Probability of smoke given fire (P(Smoke | Fire))** = 1%        
        
        Calculate \( P( Fire | Smoke  ) \) similarly, adjusting for alternative causes of smoke (factory smoke).  
        
    3. **Neighborhood 3**:    
        - **Probability of dangerous fire (P(Fire))** = 10%    
        - **Probability of coal smoke (alternative cause)** = 80%    
        - **Probability of smoke given fire (P(Smoke | Fire))** = 30%     
        
        Use Bayes' Theorem to estimate the probability of fire given the smoke in this neighborhood, considering the high alternative cause of coal use.  

    ### Comparison and Priority Ranking 
    
    Using Bayes' calculations, we rank neighborhoods by \( P( Fire | Smoke  ) \).  
    
    - **Neighborhood 2 likely has the highest priority:** The high initial probability of fire (35%) strongly suggests a higher likelihood of fire despite a low smoke conditional probability. 
    
    - **Neighborhood 3 may rank second** with a moderate probability of fire (10%) and moderate smoke likelihood (30%) suggesting some risk. 
    
    - **Neighborhood 1 likely ranks lowest**, as its probability of fire is very low (1%) with significant alternative smoke sources.   
    

    1. Neighborhood 2 
    2. Neighborhood 3 
    3. Neighborhood 1

**2.**

- Let \( B1 \) represent selecting the first box, with \( P(B1) = 0.4 \).
- Let \( B2 \) represent selecting the second box, with \( P(B2) = 0.6 \).
- Let \( R1 \) and \( R2 \) represent red balls in box 1 and box 2 respectively.
- Let \( Blue \) be the event that a blue ball is drawn.

- **For Box 1:** There are 5 red and 3 blue balls, so the probability of drawing a blue ball from Box 1 is:
  
  P(Blue|B1) = (3)/(5+3) = 3/8


- **For Box 2:** There are 7 red and 4 blue balls, so the probability of drawing a blue ball from Box 2 is:

  P(Blue|B2) = 4/(7+4) = 4/11


- Using the law of total probability:

  P(Blue) = P( Blue|B1) x P(B1) + P( Blue|B2) x P(B_2)

  P(Blue) = (3/8) x 0.4 + (4/11) x 0.6

  Calculating each part:

  P(Blue) = 0.15 + 0.2182 = 0.3682

  The probability of drawing a blue ball is approximately **0.3682**.


Using Bayes' Theorem:

  P(B2| Blue) = P(Blue|B2) x P(B2) / P(Blue)

  Plugging in values:

  P(B2| Blue) = (2.4/11)/0.3682 = 0.592

  The probability that the second box was selected given that a blue ball was drawn is approximately **0.592**.


3. 
**Text classification is the primary application for Naïve Bayes classifier methods.**
(**T**)

**Explanation**
Naïve Bayes is widely used for text classification tasks owing to its effectiveness and simplicity, particularly in spam filtering, sentiment analysis, and document categorization. It assumes independence between features, which often holds sufficiently well for text classification.

**When an attribute value in the testing record has no example in the training set, the total posterior probability in a Naïve Bayes algorithm will be zero.**
(**T**)

**Explanation**
This is known as the "zero-frequency problem," where a probability of zero for any feature results in a zero probability for the entire class. To address this, smoothing techniques, such as Laplace smoothing, are typically applied to ensure nonzero probabilities.

# Part 2

## Install Necessary Libaries

In [38]:
!pip install pandas numpy scikit-learn plotly statsmodels gensim nbformat >> /dev/null

## Import Libaries

In [39]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter, defaultdict
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression


# close warnings
import warnings

warnings.filterwarnings("ignore")

## Set the Paths and Read the Dataset

In [40]:
DATASET_PATH = "/Users/emre/GitHub/HU-AI/AIN313/2024/Assignment 2/dataset/"
TRAIN_DATASET_PATH = os.path.join(DATASET_PATH, "raw", "aclImdb", "train")
TEST_DATASET_PATH = os.path.join(DATASET_PATH, "raw", "aclImdb", "test")

## Introduction

This project aims to classify the sentiment of movie reviews as either positive or negative using a dataset containing labeled reviews with binary sentiment labels. The main task is to build a custom Naïve Bayes classifier to predict sentiment by analyzing the frequency and co-occurrence of words (unigrams, bigrams, etc.) using the Bag of Words (BoW) approach. The classifier will be implemented from scratch, without relying on pre-built libraries for the main assignment, although we will later explore additional libraries for comparative analysis in a bonus section.

The dataset is split into 25,000 training and 25,000 testing samples, with balanced positive and negative reviews in each set. The training and testing sets contain unique movie reviews, ensuring that the model cannot rely on memorizing terms specific to a movie across datasets. This approach helps benchmark sentiment classification in a supervised learning context and also offers an unsupervised learning component with an additional 50,000 unlabeled reviews, although this component will not be used directly.

### Methods

1. **Data Preprocessing**  
   - Clean and prepare the text data, removing stopwords, punctuation, and other non-essential elements.
   - Tokenize text into different n-grams (unigrams, bigrams, trigrams) to capture varying levels of word context.

2. **Bag of Words (BoW) Representation**  
   - Build a custom BoW dictionary for unigrams, bigrams, and possibly trigrams.
   - Use this dictionary to create feature vectors for each review, encoding word occurrences.
   - Apply Laplace smoothing to handle words not seen in the training set and treat rare words as unknown.

3. **Naïve Bayes Classifier**  
   - Implement a Naïve Bayes model from scratch using the BoW vectors.
   - Use logarithmic probabilities to prevent underflow and simplify multiplication calculations.
   - Implement model prediction based on word probabilities.

4. **Bonus: Word Embedding and Logistic Regression**  
   - For comparative analysis, build a model using word embeddings (Word2Vec or GloVe) with Logistic Regression.
   - This allows for the comparison of traditional n-gram BoW and embedding-based representations.

### Performance Metrics

We will assess the model's performance using the following metrics:

1. **Accuracy**  
   Measures the proportion of correctly classified reviews:
    Accuracy =  (TP + TN) / (TP + TN + FP + FN)

2. **Precision**  
   The fraction of true positive reviews out of all predicted positives:
    Precision =  TP / (TP + FP)

3. **Recall**  
   The fraction of true positive reviews out of all actual positives:
    Recall =  TP / (TP + FN)

4. **F1-Score**  
   A harmonic mean of precision and recall, useful for imbalanced data:
    F1-Score = 2 x (Precision x Recall) / (Precision +  Recall)

These metrics will be computed and compared for each n-gram setting (unigram, bigram, etc.) and for the bonus method. The goal is to optimize both the precision and recall, which are critical for sentiment analysis tasks where false positives and false negatives can both lead to significant interpretative errors.

## Dataset Exploration

In [41]:
def load_dataset(path):
    data = []
    labels = []
    for label in ["pos", "neg"]:
        labeled_path = os.path.join(path, label)
        for file_name in os.listdir(labeled_path):
            with open(
                os.path.join(labeled_path, file_name), "r", encoding="utf-8"
            ) as file:
                data.append(file.read())
                labels.append(1 if label == "pos" else 0)
    return pd.DataFrame({"review": data, "sentiment": labels})

In [42]:
train_data = load_dataset(TRAIN_DATASET_PATH)
test_data = load_dataset(TEST_DATASET_PATH)

# takes sample from the dataset
train_data = train_data.sample(1000)
test_data = test_data.sample(1000)

In [43]:
print("Training Data Shape:", train_data.shape)
print("Testing Data Shape:", test_data.shape)
print(
    "Positive Reviews in Training Set:",
    train_data[train_data["sentiment"] == 1].shape[0],
)
print(
    "Negative Reviews in Training Set:",
    train_data[train_data["sentiment"] == 0].shape[0],
)

Training Data Shape: (1000, 2)
Testing Data Shape: (1000, 2)
Positive Reviews in Training Set: 496
Negative Reviews in Training Set: 504


In [44]:
# Display Example Reviews
print(
    "\nExample Positive Review:\n",
    train_data[train_data["sentiment"] == 1]["review"].iloc[0],
)
print(
    "\nExample Negative Review:\n",
    train_data[train_data["sentiment"] == 0]["review"].iloc[0],
)


Example Positive Review:
 Probably Jackie Chan's best film in the 1980s, and the one that put him on the map. The scale of this self-directed police drama is evident from the opening and closing scenes, during which a squatters' village and shopping mall are demolished. There are, clearly, differences between the original Chinese and dubbed English versions, with many of the jokes failing to make their way into the latter. The latter is also hampered by stars who sound nothing like their Chinese originals. In fact, the only thing the dubbing has corrected is the court trialat the time, trials in colonial Hong Kong were conducted in English, while the original has this scene in Cantonese!<br /><br />Nonetheless, Chan's fighting style and the martial arts choreography inject humour where possible, so non-Cantonese audiences don't miss much. It's not, after all, the dialogue that makes a Chan flick, but the action and the painful out-takes. The story is easy to follow: Chan plays an inc

In [45]:
# Visualize Data Characteristics
# 1. Word Count Distribution per Review
train_data["word_count"] = train_data["review"].apply(lambda x: len(x.split()))

# Plotting with Plotly
fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=train_data["word_count"], nbinsx=50, marker_color="blue", opacity=0.7
    )
)

# Setting titles and labels
fig.update_layout(
    title="Word Count Distribution per Review",
    xaxis_title="Word Count",
    yaxis_title="Frequency",
    bargap=0.1,
)

fig.show()

In [46]:
# 2. Average Sentiment Polarity per Review Length
avg_word_count_pos = train_data[train_data["sentiment"] == 1]["word_count"].mean()
avg_word_count_neg = train_data[train_data["sentiment"] == 0]["word_count"].mean()
print(f"Average Word Count in Positive Reviews: {avg_word_count_pos}")
print(f"Average Word Count in Negative Reviews: {avg_word_count_neg}")

Average Word Count in Positive Reviews: 226.33064516129033
Average Word Count in Negative Reviews: 226.15674603174602


In [47]:
# Define custom stopwords
custom_stopwords = set(
    [
        "i",
        "me",
        "my",
        "myself",
        "we",
        "our",
        "ours",
        "ourselves",
        "you",
        "your",
        "yours",
        "yourself",
        "yourselves",
        "he",
        "him",
        "his",
        "himself",
        "she",
        "her",
        "hers",
        "herself",
        "it",
        "its",
        "itself",
        "they",
        "them",
        "their",
        "theirs",
        "themselves",
        "what",
        "which",
        "who",
        "whom",
        "this",
        "that",
        "these",
        "those",
        "am",
        "is",
        "are",
        "was",
        "were",
        "be",
        "been",
        "being",
        "have",
        "has",
        "had",
        "having",
        "do",
        "does",
        "did",
        "doing",
        "a",
        "an",
        "the",
        "and",
        "but",
        "if",
        "or",
        "because",
        "as",
        "until",
        "while",
        "of",
        "at",
        "by",
        "for",
        "with",
        "about",
        "against",
        "between",
        "into",
        "through",
        "during",
        "before",
        "after",
        "above",
        "below",
        "to",
        "from",
        "up",
        "down",
        "in",
        "out",
        "on",
        "off",
        "over",
        "under",
        "again",
        "further",
        "then",
        "once",
        "here",
        "there",
        "when",
        "where",
        "why",
        "how",
        "all",
        "any",
        "both",
        "each",
        "few",
        "more",
        "most",
        "other",
        "some",
        "such",
        "no",
        "nor",
        "not",
        "only",
        "own",
        "same",
        "so",
        "than",
        "too",
        "very",
        "s",
        "t",
        "can",
        "will",
        "just",
        "don",
        "should",
        "now",
    ]
)

In [48]:
# 3. Frequency Distribution of Words (Unigrams and Bigrams)
# Preprocess text
def preprocess_text(text):
    text = re.sub(r"[^\w\s]", "", text.lower())
    words = text.split()
    return [word for word in words if word not in custom_stopwords]

In [49]:
# Extract Unigrams and Bigrams
all_words = []
all_bigrams = []
for review in train_data["review"]:
    words = preprocess_text(review)
    all_words.extend(words)
    all_bigrams.extend([(words[i], words[i + 1]) for i in range(len(words) - 1)])

In [50]:
# Frequency Counts
unigram_counts = Counter(all_words)
bigram_counts = Counter(all_bigrams)

# Display top 10 most common unigrams and bigrams
print("Top 10 Unigrams:", unigram_counts.most_common(10))
print("Top 10 Bigrams:", bigram_counts.most_common(10))

Top 10 Unigrams: [('br', 2183), ('movie', 1685), ('film', 1418), ('one', 1074), ('like', 752), ('good', 564), ('would', 525), ('even', 481), ('time', 472), ('story', 452)]
Top 10 Bigrams: [(('br', 'br'), 239), (('special', 'effects'), 52), (('br', 'movie'), 46), (('even', 'though'), 45), (('ever', 'seen'), 45), (('br', 'film'), 45), (('moviebr', 'br'), 44), (('one', 'best'), 43), (('ive', 'seen'), 40), (('waste', 'time'), 35)]


In [51]:
# Unigram Frequency Distribution
unigram_counts = Counter(all_words)
unigram_df = pd.DataFrame(
    unigram_counts.most_common(20), columns=["Unigram", "Frequency"]
)

# Plot Unigram Frequency Distribution with Plotly
fig_unigram = px.bar(unigram_df, x="Unigram", y="Frequency", title="Top 20 Unigrams")
fig_unigram.update_xaxes(tickangle=45)
fig_unigram.show()

In [52]:
# Bigram Frequency Distribution
bigram_counts = Counter(all_bigrams)
bigram_df = pd.DataFrame(bigram_counts.most_common(20), columns=["Bigram", "Frequency"])
bigram_df["Bigram"] = bigram_df["Bigram"].apply(
    lambda x: " ".join(x)
)  # Convert tuples to strings

# Plot Bigram Frequency Distribution with Plotly
fig_bigram = px.bar(bigram_df, x="Bigram", y="Frequency", title="Top 20 Bigrams")
fig_bigram.update_xaxes(tickangle=45)
fig_bigram.show()

## Preprocessing

In [53]:
# Define text preprocessing function
def preprocess_text(text, remove_stopwords=True):
    # Remove punctuation and convert text to lowercase
    text = re.sub(r"[^\w\s]", "", text.lower())

    # Split text into words
    words = text.split()

    # Remove stopwords if specified
    if remove_stopwords:
        words = [word for word in words if word not in custom_stopwords]

    return words


# Apply preprocessing to the dataset
train_data["processed_text"] = train_data["review"].apply(lambda x: preprocess_text(x))

In [54]:
# Generate Unigrams, Bigrams, and Trigrams without using nltk
def generate_ngrams(words, n):
    return [tuple(words[i : i + n]) for i in range(len(words) - n + 1)]


train_data["unigrams"] = train_data["processed_text"]
train_data["bigrams"] = train_data["processed_text"].apply(
    lambda x: generate_ngrams(x, 2)
)
train_data["trigrams"] = train_data["processed_text"].apply(
    lambda x: generate_ngrams(x, 3)
)

In [55]:
# Display example processed text structure
print("Example Processed Review (Unigrams):", train_data["unigrams"].iloc[0])
print("Example Processed Review (Bigrams):", train_data["bigrams"].iloc[0])
print("Example Processed Review (Trigrams):", train_data["trigrams"].iloc[0])

Example Processed Review (Unigrams): ['sure', 'like', 'short', 'cartoons', 'didnt', 'like', 'one', 'naturally', 'kids', 'would', 'love', 'im', 'kid', 'anymore', 'although', 'still', 'consider', 'youngbr', 'br', 'tell', 'anything', 'story', 'simple', 'reason', 'story', 'possible', 'dragon', 'cartoon', 'nominated', 'oscar', 'well', 'guess', 'people', '30s', 'happy', 'much', 'present', 'live', 'everything', 'must', 'happen', 'fast', 'look', 'movies', 'nowadays', 'come', 'conclusion', 'live', 'society', 'doesnt', 'allow', 'men', 'slow', 'thats', 'really', 'shame', 'wish', 'lived', '30s', 'seems', 'peaceful', 'every', 'time', 'got', 'ups', 'downs', 'guessbr', 'br', 'conclude', 'like', 'music', 'frogs', 'youll', 'see', 'cartoon', 'otherwise', 'dont', 'spill', 'time']
Example Processed Review (Bigrams): [('sure', 'like'), ('like', 'short'), ('short', 'cartoons'), ('cartoons', 'didnt'), ('didnt', 'like'), ('like', 'one'), ('one', 'naturally'), ('naturally', 'kids'), ('kids', 'would'), ('would'

In [56]:
# Visualizing Processed Text Structure
# 1. Unigram Frequency Distribution
all_unigrams = [word for review in train_data["unigrams"] for word in review]
unigram_counts = Counter(all_unigrams)
unigram_df = pd.DataFrame(
    unigram_counts.most_common(20), columns=["Unigram", "Frequency"]
)

In [57]:
# Plotting Unigram Frequency Distribution with Plotly
fig = px.bar(
    unigram_df,
    x="Unigram",
    y="Frequency",
    title="Top Unigrams after Preprocessing",
    color_discrete_sequence=["purple"],
)
fig.update_xaxes(tickangle=45)
fig.update_layout(xaxis_title="Unigram", yaxis_title="Frequency", bargap=0.1)
fig.show()

In [58]:
# 2. Bigram Frequency Distribution
all_bigrams = [bigram for review in train_data["bigrams"] for bigram in review]
bigram_counts = Counter(all_bigrams)
bigram_df = pd.DataFrame(bigram_counts.most_common(20), columns=["Bigram", "Frequency"])


# Plotting Bigram Frequency Distribution with Plotly
fig = px.bar(
    bigram_df,
    x="Bigram",
    y="Frequency",
    title="Top Bigrams after Preprocessing",
    color_discrete_sequence=["green"],
)
fig.update_xaxes(tickangle=45)
fig.update_layout(xaxis_title="Bigram", yaxis_title="Frequency", bargap=0.1)
fig.show()

In [59]:
all_trigrams = [trigram for review in train_data["trigrams"] for trigram in review]
trigram_counts = Counter(all_trigrams)
trigram_df = pd.DataFrame(
    trigram_counts.most_common(20), columns=["Trigram", "Frequency"]
)

In [60]:
# Plotting Trigram Frequency Distribution with Plotly
fig = px.bar(
    trigram_df,
    x="Trigram",
    y="Frequency",
    title="Top Trigrams after Preprocessing",
    color_discrete_sequence=["blue"],
)
fig.update_xaxes(tickangle=45)
fig.update_layout(xaxis_title="Trigram", yaxis_title="Frequency", bargap=0.1)
fig.show()

## Bag of Words Implementation

In [61]:
MIN_FREQ = 5


class CustomBoW:
    def __init__(self):
        self.vocab = defaultdict(int)  # Unigram vocabulary
        self.bigram_vocab = defaultdict(int)  # Bigram vocabulary
        self.trigram_vocab = defaultdict(int)  # Trigram vocabulary
        self.unknown_token = "<UNK>"
        self.vocab_size = 0
        self.bigram_vocab_size = 0
        self.trigram_vocab_size = 0

    def build_vocab(self, data):
        # Count unigrams
        for review in data["unigrams"]:
            for word in review:
                self.vocab[word] += 1

        # Count bigrams
        for review in data["bigrams"]:
            for bigram in review:
                self.bigram_vocab[bigram] += 1

        # Count trigrams
        for review in data["trigrams"]:
            for trigram in review:
                self.trigram_vocab[trigram] += 1

        # Handle rare words as unknowns
        self.vocab = {
            word: count for word, count in self.vocab.items() if count >= MIN_FREQ
        }
        self.vocab[self.unknown_token] = 1
        self.vocab_size = len(self.vocab)

        self.bigram_vocab = {
            bigram: count
            for bigram, count in self.bigram_vocab.items()
            if count >= MIN_FREQ
        }
        self.bigram_vocab[(self.unknown_token, self.unknown_token)] = 1
        self.bigram_vocab_size = len(self.bigram_vocab)

        self.trigram_vocab = {
            trigram: count
            for trigram, count in self.trigram_vocab.items()
            if count >= MIN_FREQ
        }
        self.trigram_vocab[
            (self.unknown_token, self.unknown_token, self.unknown_token)
        ] = 1
        self.trigram_vocab_size = len(self.trigram_vocab)

    def get_unigram_vector(self, review):
        vector = np.zeros(self.vocab_size)
        for word in review:
            index = (
                list(self.vocab.keys()).index(word)
                if word in self.vocab
                else list(self.vocab.keys()).index(self.unknown_token)
            )
            vector[index] += 1
        return vector

    def get_bigram_vector(self, review):
        vector = np.zeros(self.bigram_vocab_size)
        for bigram in review:
            index = (
                list(self.bigram_vocab.keys()).index(bigram)
                if bigram in self.bigram_vocab
                else list(self.bigram_vocab.keys()).index(
                    (self.unknown_token, self.unknown_token)
                )
            )
            vector[index] += 1
        return vector

    def get_trigram_vector(self, review):
        vector = np.zeros(self.trigram_vocab_size)
        for trigram in review:
            index = (
                list(self.trigram_vocab.keys()).index(trigram)
                if trigram in self.trigram_vocab
                else list(self.trigram_vocab.keys()).index(
                    (self.unknown_token, self.unknown_token, self.unknown_token)
                )
            )
            vector[index] += 1
        return vector

    def get_laplace_smoothed_prob(self, word, ngram_type="unigram"):
        if ngram_type == "unigram":
            count = self.vocab.get(word, self.vocab[self.unknown_token])
            return (count + 1) / (sum(self.vocab.values()) + self.vocab_size)
        elif ngram_type == "bigram":
            count = self.bigram_vocab.get(
                word, self.bigram_vocab[(self.unknown_token, self.unknown_token)]
            )
            return (count + 1) / (
                sum(self.bigram_vocab.values()) + self.bigram_vocab_size
            )
        elif ngram_type == "trigram":
            count = self.trigram_vocab.get(
                word,
                self.trigram_vocab[
                    (self.unknown_token, self.unknown_token, self.unknown_token)
                ],
            )
            return (count + 1) / (
                sum(self.trigram_vocab.values()) + self.trigram_vocab_size
            )

In [62]:
# Instantiate and build the BoW dictionary
bow_model = CustomBoW()
bow_model.build_vocab(train_data)

In [63]:
# Example: Generate unigram and bigram vectors for the first review
sample_review_unigram = train_data["unigrams"].iloc[0]
sample_review_bigram = train_data["bigrams"].iloc[0]
sample_review_trigram = train_data["trigrams"].iloc[0]

unigram_vector = bow_model.get_unigram_vector(sample_review_unigram)
bigram_vector = bow_model.get_bigram_vector(sample_review_bigram)
trigram_vector = bow_model.get_trigram_vector(sample_review_trigram)

print("Unigram Vector for Sample Review:", unigram_vector)
print("Bigram Vector for Sample Review:", bigram_vector)
print("Trigram Vector for Sample Review:", trigram_vector)

Unigram Vector for Sample Review: [1. 3. 1. ... 0. 0. 7.]
Bigram Vector for Sample Review: [ 1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0. 

In [64]:
# Example: Calculate Laplace-smoothed probability for a word/unigram and a bigram
word_prob = bow_model.get_laplace_smoothed_prob("excellent", ngram_type="unigram")
bigram_prob = bow_model.get_laplace_smoothed_prob(("very", "good"), ngram_type="bigram")
tri_prob = bow_model.get_laplace_smoothed_prob(
    ("very", "good", "movie"), ngram_type="trigram"
)

print(f"Laplace-Smoothed Probability of 'excellent' (Unigram): {word_prob}")
print(f"Laplace-Smoothed Probability of ('very', 'good') (Bigram): {bigram_prob}")
print(
    f"Laplace-Smoothed Probability of ('very', 'good', 'movie') (Trigram): {tri_prob}"
)

Laplace-Smoothed Probability of 'excellent' (Unigram): 0.0008810126582278481
Laplace-Smoothed Probability of ('very', 'good') (Bigram): 0.00022568269013766644
Laplace-Smoothed Probability of ('very', 'good', 'movie') (Trigram): 0.015037593984962405


## Naive Bayes Classifier Implementation

In [65]:
class NaiveBayesClassifier:
    def __init__(self, bow_model, ngram_type="unigram"):
        self.bow_model = bow_model  # Custom BoW model
        self.ngram_type = ngram_type  # "unigram" or "bigram"
        self.class_probs = {}  # Log probability of each class
        self.word_probs = {}  # Log probabilities of each word given class

    def train(self, data):
        # Separate positive and negative reviews
        pos_reviews = data[data["sentiment"] == 1]
        neg_reviews = data[data["sentiment"] == 0]

        # Compute log probabilities of each class
        total_reviews = len(data)
        self.class_probs["pos"] = np.log(len(pos_reviews) / total_reviews)
        self.class_probs["neg"] = np.log(len(neg_reviews) / total_reviews)

        # Calculate word probabilities with Laplace smoothing for each class
        self.word_probs["pos"] = self.calculate_word_probs(pos_reviews)
        self.word_probs["neg"] = self.calculate_word_probs(neg_reviews)

    def calculate_word_probs(self, reviews):
        # Calculate total word count in class and initialize word probabilities
        word_count = 0
        word_probs = defaultdict(
            lambda: 1
        )  # Initialize with Laplace smoothing (count of 1)

        # Get unigrams or bigrams based on ngram_type
        if self.ngram_type == "unigram":
            all_words = [word for review in reviews["unigrams"] for word in review]
            vocab = self.bow_model.vocab
        elif self.ngram_type == "bigram":
            all_words = [bigram for review in reviews["bigrams"] for bigram in review]
            vocab = self.bow_model.bigram_vocab
        else:
            raise ValueError("ngram_type must be 'unigram' or 'bigram'")

        # Count word occurrences
        word_counts = Counter(all_words)

        # Calculate log probabilities with Laplace smoothing
        word_count = sum(word_counts.values())
        vocab_size = len(vocab)

        for word in vocab:
            word_probs[word] = np.log(
                (word_counts[word] + 1) / (word_count + vocab_size)
            )

        return word_probs

    def predict(self, review):
        # Calculate log probability for each class
        log_prob_pos = self.class_probs["pos"]
        log_prob_neg = self.class_probs["neg"]

        # Get the appropriate n-grams for the review
        if self.ngram_type == "unigram":
            words = self.bow_model.get_unigram_vector(review)
        elif self.ngram_type == "bigram":
            words = self.bow_model.get_bigram_vector(review)
        else:
            raise ValueError("ngram_type must be 'unigram' or 'bigram'")

        # Sum log probabilities for each word in the review
        for word in words:
            log_prob_pos += self.word_probs["pos"].get(
                word,
                np.log(
                    1
                    / (
                        sum(self.word_probs["pos"].values())
                        + len(self.word_probs["pos"])
                    )
                ),
            )
            log_prob_neg += self.word_probs["neg"].get(
                word,
                np.log(
                    1
                    / (
                        sum(self.word_probs["neg"].values())
                        + len(self.word_probs["neg"])
                    )
                ),
            )

        # Return prediction (1 for positive, 0 for negative)
        return 1 if log_prob_pos > log_prob_neg else 0

    def evaluate(self, data):
        # Evaluate the classifier on test data
        predictions = data["review"].apply(self.predict)
        accuracy = np.mean(predictions == data["sentiment"])
        return accuracy

## Model Evaluation on Test Data

In [66]:
def evaluate_model(classifier, test_data):
    # Predict sentiment for each review in the test set
    predictions = test_data["review"].apply(classifier.predict)

    # Calculate evaluation metrics
    accuracy = accuracy_score(test_data["sentiment"], predictions)
    precision = precision_score(test_data["sentiment"], predictions)
    recall = recall_score(test_data["sentiment"], predictions)
    f1 = f1_score(test_data["sentiment"], predictions)

    return accuracy, precision, recall, f1

In [None]:
# Evaluate models with unigram and bigram settings
results = []

# Unigram Model Evaluation
nb_classifier_unigram = NaiveBayesClassifier(bow_model, ngram_type="unigram")
nb_classifier_unigram.train(train_data)
accuracy, precision, recall, f1 = evaluate_model(nb_classifier_unigram, test_data)
results.append(
    {
        "N-gram": "Unigram",
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1,
    }
)

In [None]:
# Bigram Model Evaluation
nb_classifier_bigram = NaiveBayesClassifier(bow_model, ngram_type="bigram")
nb_classifier_bigram.train(train_data)
accuracy, precision, recall, f1 = evaluate_model(nb_classifier_bigram, test_data)
results.append(
    {
        "N-gram": "Bigram",
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1,
    }
)

In [None]:
# Optional: Trigram Model Evaluation if implemented
nb_classifier_trigram = NaiveBayesClassifier(bow_model, ngram_type="trigram")
nb_classifier_trigram.train(train_data)
accuracy, precision, recall, f1 = evaluate_model(nb_classifier_trigram, test_data)
results.append(
    {
        "N-gram": "Trigram",
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1,
    }
)

In [None]:
# Summarize results in a DataFrame
results_df = pd.DataFrame(results)
print("\nModel Evaluation Results:")
print(results_df)

## Bonus: Comparison woth Word Embedding and Logistic Regression

In [None]:
# Train Word2Vec model on the training dataset
def train_word2vec(data, vector_size=100, window=5, min_count=2):
    sentences = data[
        "processed_text"
    ].tolist()  # Assuming text is tokenized as a list of words in 'processed_text'
    w2v_model = Word2Vec(
        sentences, vector_size=vector_size, window=window, min_count=min_count, sg=1
    )
    return w2v_model


w2v_model = train_word2vec(train_data)

In [None]:
# Helper function to average Word2Vec embeddings for a document
def get_review_embedding(review, model, vector_size=100):
    embedding = np.zeros(vector_size)
    count = 0
    for word in review:
        if word in model.wv:
            embedding += model.wv[word]
            count += 1
    return embedding / count if count != 0 else embedding

In [None]:
# Generate averaged embeddings for each review in the training and test sets
train_embeddings = np.array(
    [get_review_embedding(review, w2v_model) for review in train_data["processed_text"]]
)
test_embeddings = np.array(
    [get_review_embedding(review, w2v_model) for review in test_data["processed_text"]]
)

# Extract labels
train_labels = train_data["sentiment"]
test_labels = test_data["sentiment"]

In [None]:
# Initialize and train Logistic Regression model
log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(train_embeddings, train_labels)

In [None]:
# Predict sentiments on the test set
test_predictions = log_reg_model.predict(test_embeddings)

# Compute performance metrics
accuracy = accuracy_score(test_labels, test_predictions)
precision = precision_score(test_labels, test_predictions)
recall = recall_score(test_labels, test_predictions)
f1 = f1_score(test_labels, test_predictions)

# Display results
print("Logistic Regression with Word2Vec Embeddings - Performance Metrics")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

In [None]:
# Add Word2Vec + Logistic Regression results to the comparison table
word2vec_results = {
    "N-gram": "Word2Vec Embedding + Logistic Regression",
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1,
}

In [None]:
# Append to existing results
results_df = results_df.append(word2vec_results, ignore_index=True)

# Display comparison table
print("\nComparison of Model Performance:")
print(results_df)

## Conclusion

In [None]:
# Melt the DataFrame to make it suitable for Plotly
results_melted = results_df.melt(
    id_vars="N-gram", var_name="Metric", value_name="Score"
)

In [None]:
# Create a bar chart with Plotly
fig = px.bar(
    results_melted,
    x="N-gram",
    y="Score",
    color="Metric",
    barmode="group",
    title="Comparison of Model Performance Metrics",
    labels={"Score": "Performance Metric Score", "N-gram": "Model Type"},
)

# Customize the layout for readability
fig.update_layout(
    xaxis_title="Model Type",
    yaxis_title="Score",
    legend_title="Metric",
    template="plotly_white",
)

fig.show()