<p style="text-align: center; font-size: 28px;"><b>NaiveBayes Classifier</b></p>

## Notebook Overview — Bayes Refresher & Naive Bayes (Text)**

This notebook moves from Bayes’ rule to a working Naive Bayes sentiment classifier for short product reviews.

### Bayes refresher
We compare posterior probabilities of two classes \(c \in \{\text{positive}, \text{negative}\}\) for a given review:
\[
P(c \mid \text{review}) \;=\; \frac{P(\text{review}\mid c)\,P(c)}{P(\text{review})}.
\]
For deciding between classes, the common denominator cancels:
\[
\arg\max_{c} \; P(\text{review}\mid c)\,P(c).
\]

### Naive Bayes classifier (theory)
- Naive assumption (conditional independence): given the class \(c\), words in a review are independent.
- Bag-of-words likelihood:
\[
P(\text{review}\mid c) \;=\; \prod_{w \in \text{review}} P(w \mid c).
\]
- Laplace smoothing (avoid zeros for unseen words):
\[
P(w \mid c) \;=\; \frac{\#(w \text{ in class } c)+1}{\text{total words in } c + N},
\]
where \(N\) is the number of unique tokens in the vocabulary.
- Numerical stability: compare in log-space
\[
\log P(\text{review}\mid c) + \log P(c) \;=\; \sum_{w}\log P(w\mid c) + \log P(c).
\]

### From theory to practice
- Text \(\rightarrow\) features via CountVectorizer (bag-of-words counts).
- Train MultinomialNB with \(X=\) counts and \(y=\) labels.
- Predict with `.predict()`; inspect confidence with `.predict_proba()`.


# Bayes Theorem refresher

## Independent Events

Two events are **independent** when the occurrence of one **does not influence** the probability of the other. In such cases, knowing that one event has occurred gives **no information** about whether the other will occur.

**Examples of independent events:**
- I wear a blue shirt; my coworker wears a blue shirt.  
- I take the subway to work; I eat sushi for lunch.  
- The NY Giants win their football game; the NY Rangers win their hockey game.  

Mathematically, independence between two events \( A \) and \( B \) is expressed as:

$$
P(A \cap B) = P(A) \times P(B)
$$

If this equality does **not** hold, the events are **dependent** — meaning the occurrence of one event affects the probability of the other.

**Examples of dependent events:**
- It rains on Tuesday; I carry an umbrella on Tuesday.  
- I eat spaghetti; I have a red stain on my shirt.  
- I wear sunglasses; I go to the beach.  


## Conditional Probability

**Conditional probability** describes the likelihood of two events happening together. It is most straightforward to calculate when the events are **independent**.

In probability notation, we denote the probability of an event as \( P(\text{event}) \).

If the probability of event \( A \) is \( P(A) \) and the probability of event \( B \) is \( P(B) \), and the two events are **independent**, then the probability that both occur is given by:

$$
P(A \cap B) = P(A) \times P(B)
$$

Here, the symbol \( \cap \) means “and”, so \( P(A \cap B) \) represents the probability that **both** \( A \) and \( B \) happen.

---

### Example: Rolling Two Sixes

Suppose we roll a pair of dice and want to know the probability of getting two sixes.

Each die has six sides, so the probability of rolling a six is \( \frac{1}{6} \).  
Because the dice rolls are **independent** (rolling one six does not affect the other), the joint probability is:

$$
P(6 \cap 6) = P(6) \times P(6) = \frac{1}{6} \times \frac{1}{6} = \frac{1}{36}
$$

Thus, the probability of rolling two sixes is **1/36**.
```


## Conditional Probability and Independence

### 1. General Definition of Conditional Probability

The **conditional probability** of event \( A \) given event \( B \) is defined as:

$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$

This means:

> The probability that \( A \) occurs **given that** \( B \) has already occurred equals the probability that both events happen divided by the probability that \( B \) happens.

---

### 2. Independence — What It Means

Two events \( A \) and \( B \) are **independent** if knowing that one occurred gives **no information** about the other.

Mathematically, this is expressed as:

$$
P(A \mid B) = P(A)
$$

If the occurrence of \( B \) does not change the probability of \( A \), then \( A \) and \( B \) are independent.

---

### 3. Combining the Two Equations

From the **definition of conditional probability**:

$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$

and from **independence**:

$$
P(A \mid B) = P(A)
$$

Setting these equal gives:

$$
P(A) = \frac{P(A \cap B)}{P(B)}
$$

Multiplying both sides by \( P(B) \):

$$
P(A \cap B) = P(A) \times P(B)
$$

---

### 4. Interpretation

For **independent events**, the probability that both occur equals the **product of their individual probabilities**, because independence means the occurrence of one event has **no effect** on the likelihood of the other.


In [46]:
# ## Conditional Probability and Independence
#
# This code demonstrates the relationship between two independent events using probability.
#
# Step 1: Define the concept
# If two events A and B are independent, the probability that both occur equals
# the product of their individual probabilities:
#
#     P(A ∩ B) = P(A) × P(B)
#
# Step 2: Define the probabilities of the two events
# Let's assume:
# - Event A: Probability of rain = 3/5
# - Event B: Probability of going to the gym = 0.3
#
# Step 3: Since they are independent, compute the probability of both events happening

import numpy as np

# Probability of A (rain)
p_rain = 3 / 5

# Probability of B (going to the gym)
p_gym = 0.3

# Joint probability for independent events
p_rain_and_gym = p_rain * p_gym

# Step 4: Print the result with an explanatory message
print(f"The probability of both raining and going to the gym is: {p_rain_and_gym:.2f}")


The probability of both raining and going to the gym is: 0.18


## Testing for a Rare Disease

Imagine you are a doctor testing a patient for a **rare disease**. The test is highly accurate — it gives the correct result **99% of the time**. However, the disease itself is extremely rare, occurring in only **1 out of 100,000** patients.

You run the test and the result is **positive**. At first glance, it may seem that the patient almost certainly has the disease, since the test is only wrong 1% of the time.  
But this intuition is misleading — because the **rarity of the disease** drastically affects the true probability.

When the test result is positive, there are two possible scenarios:

1. The patient **has** the disease, and the test correctly identifies it.  
2. The patient **does not have** the disease, but the test incorrectly reports a positive result (a **false positive**).

This problem illustrates the importance of considering **base rates** (the overall rarity of the condition) when interpreting test results — a concept at the heart of **Bayesian reasoning**.


In [47]:
# ## Testing for a Rare Disease
#
# Step 1: Define the known probabilities
# - Probability of being sick (disease prevalence): 1 in 100,000
# - Probability of being healthy: 1 - sick_rate
# - Probability that the test is correct: 99%
# - Probability that the test is incorrect (false result): 1%
#
# Step 2: Calculate the probabilities of two key scenarios:
#   a) The patient has the disease AND the test correctly identifies it.
#   b) The patient does not have the disease AND the test incorrectly identifies it.
#
# Step 3: Save the results into variables and print them as both raw values and percentages.

import numpy as np

# Define base probabilities
sick_rate = 1 / 100000           # Probability of having the disease
health_rate = 1 - sick_rate      # Probability of not having the disease
test_correct = 0.99              # Test correctly identifies the condition
test_false = 1 - test_correct    # Test gives a false result

# Probability of having the disease and the test being correct
p_disease_and_correct = sick_rate * test_correct

# Probability of not having the disease and the test being incorrect
p_no_disease_and_incorrect = health_rate * test_false

# Print results
print(f"Probability (disease and correctly diagnosed): {p_disease_and_correct:.8f} ({p_disease_and_correct * 100:.6f}%)")
print(f"Probability (no disease and incorrectly diagnosed): {p_no_disease_and_incorrect:.8f} ({p_no_disease_and_incorrect * 100:.6f}%)")


Probability (disease and correctly diagnosed): 0.00000990 (0.000990%)
Probability (no disease and incorrectly diagnosed): 0.00999990 (0.999990%)


## Bayes’ Theorem

In the previous example, we found two key probabilities:

- The patient **had the disease**, and the test correctly diagnosed it: ≈ 0.00001  
- The patient **did not have the disease**, but the test incorrectly diagnosed it: ≈ 0.01  

Although both events are rare, the false positive was about **1,000 times more likely** than a true positive.  
This illustrates the importance of considering **disease prevalence** — not just test accuracy — when interpreting diagnostic results.

---

### Conditional Probability and Bayes’ Theorem

In statistics, we represent the probability of event \( A \) occurring **given that** event \( B \) has occurred as \( P(A \mid B) \).

In our medical test scenario, we want to find:

$$
P(\text{rare disease} \mid \text{positive result})
$$

That is, the probability that the patient **has the disease given** that the test result was **positive**.

Bayes’ Theorem provides a way to calculate this:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

Here:
- \( P(A) \) = prior probability of the event (e.g., having the disease)  
- \( P(B \mid A) \) = probability of a positive result **given** the disease is present (test sensitivity)  
- \( P(B) \) = overall probability of a positive result (true positives + false positives)

---

### Applying It to the Example

In this context:

$$
P(\text{rare disease} \mid \text{positive result}) = \frac{P(\text{positive result} \mid \text{rare disease}) \cdot P(\text{rare disease})}{P(\text{positive result})}
$$

This formula helps us **update our belief** about how likely it is that the patient truly has the disease **after seeing the test result**.

---

### Key Insight

The terms \( P(A \mid B) \) and \( P(B \mid A) \) are **not the same** — the order matters:

- \( P(A \mid B) \): probability the patient **has** the disease given the test was positive  
- \( P(B \mid A) \): probability the test is **positive given** the patient has the disease  

Bayes’ Theorem allows us to correctly relate these two conditional probabilities.


In [48]:
import numpy as np

# Step 1: Define the test characteristics
# Sensitivity measures how well the test identifies true positives.
# Specificity measures how well the test identifies true negatives.
# Here, the test is said to be "99% accurate," which we interpret as:
#   - 99% sensitivity (correctly detects sick people)
#   - 99% specificity (correctly clears healthy people)
# In this case, the *values* are equal (both 99%), but conceptually, they refer to
# different directions of correctness.

p_positive_given_disease = 0.99     # Sensitivity
p_negative_given_no_disease = 0.99  # Specificity
p_positive_given_no_disease = 1 - p_negative_given_no_disease  # False positive rate

print(f"Sensitivity (P(positive | disease)): {p_positive_given_disease*100:.2f}%")
print(f"Specificity (P(negative | no disease)): {p_negative_given_no_disease*100:.2f}%")
print(f"False Positive Rate (P(positive | no disease)): {p_positive_given_no_disease*100:.2f}%")

# Step 2: Define disease prevalence (extremely rare)
p_disease = 1 / 100000
p_no_disease = 1 - p_disease

print(f"\nDisease prevalence (P(disease)): {p_disease*100:.5f}%")
print(f"No disease (P(no disease)): {p_no_disease*100:.5f}%")

# Step 3: Compute the total probability of testing positive (Law of Total Probability)
# This combines true positives and false positives.
p_positive = (p_disease * p_positive_given_disease) + (p_no_disease * p_positive_given_no_disease)
print(f"\nOverall probability of a positive test (P(positive)): {p_positive*100:.4f}%")

# Step 4: Apply Bayes’ Theorem
# P(disease | positive) = [P(positive | disease) * P(disease)] / P(positive)
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
print(f"Probability of disease given a positive test (P(disease | positive)): {p_disease_given_positive*100:.4f}%")

# Step 5: Interpretation
print("\nInterpretation:")
print(f"Even though the test is 99% sensitive and 99% specific, the disease is so rare "
      f"that a positive test result only implies about {p_disease_given_positive*100:.4f}% "
      f"chance of actually having the disease (roughly 1 in {int(1/p_disease_given_positive):,}).")


Sensitivity (P(positive | disease)): 99.00%
Specificity (P(negative | no disease)): 99.00%
False Positive Rate (P(positive | no disease)): 1.00%

Disease prevalence (P(disease)): 0.00100%
No disease (P(no disease)): 99.99900%

Overall probability of a positive test (P(positive)): 1.0010%
Probability of disease given a positive test (P(disease | positive)): 0.0989%

Interpretation:
Even though the test is 99% sensitive and 99% specific, the disease is so rare that a positive test result only implies about 0.0989% chance of actually having the disease (roughly 1 in 1,011).


## Spam Filters and Bayes’ Theorem

Email **spam filters** apply Bayes’ Theorem to estimate how likely a message is spam, given the words it contains.  
For example, consider the word **“enhancement”**, which tends to appear more frequently in spam than in regular emails.

We know the following facts:

- “enhancement” appears in **0.1% of non-spam emails**
- “enhancement” appears in **5% of spam emails**
- Spam emails make up **20% of all emails**

We want to find:

\[
P(\text{spam} \mid \text{enhancement})
\]
— the probability that an email is **spam**, given that it contains the word “enhancement”.

---

### Step 1: Define the probabilities

\[
P(\text{enhancement} \mid \text{spam}) = 0.05
\]  
\[
P(\text{enhancement} \mid \text{not spam}) = 0.001
\]  
\[
P(\text{spam}) = 0.2
\]  
\[
P(\text{not spam}) = 0.8
\]

---

### Step 2: Find the total probability that an email contains “enhancement”

By the **law of total probability**:

\[
P(\text{enhancement}) =
P(\text{enhancement} \mid \text{spam}) P(\text{spam}) +
P(\text{enhancement} \mid \text{not spam}) P(\text{not spam})
\]

\[
= (0.05 \times 0.2) + (0.001 \times 0.8)
= 0.0108
\]

So, about **1.08% of all emails** contain the word “enhancement.”

---

### Step 3: Apply Bayes’ Theorem

\[
P(\text{spam} \mid \text{enhancement}) =
\frac{P(\text{enhancement} \mid \text{spam}) \times P(\text{spam})}{P(\text{enhancement})}
\]

\[
= \frac{0.05 \times 0.2}{0.0108}
= 0.9259
\]

---

### Step 4: Interpretation

Even though spam is only **20% of all emails**, the presence of “enhancement” makes it **92.6% likely** that the email is spam.  
This is the essence of **Bayesian filtering** — updating our belief based on evidence.

A spam filter uses this principle for many words simultaneously, continuously learning which ones best predict spam-like behavior.


In [49]:
import numpy as np

# Step 1: Define the events
# A = "spam"
# B = "enhancement"
# We want to find: P(spam | enhancement)

a = 'spam'
b = 'enhancement'

# Step 2: Base probabilities
# 20% of emails are spam
p_spam = 0.2

# 80% are not spam
p_no_spam = 1 - p_spam

# "enhancement" appears in 5% of spam emails
p_enhancement_given_spam = 0.05

# "enhancement" appears in 0.1% of non-spam emails
p_enhancement_given_no_spam = 0.001

# Step 3: Compute total probability of "enhancement"
# Law of total probability:
# P(enhancement) = P(enhancement | spam)*P(spam) + P(enhancement | no spam)*P(no spam)
p_enhancement = (p_enhancement_given_spam * p_spam) + (p_enhancement_given_no_spam * p_no_spam)

# Step 4: Apply Bayes' Theorem
# P(spam | enhancement) = [P(enhancement | spam) * P(spam)] / P(enhancement)
p_spam_given_enhancement = (p_enhancement_given_spam * p_spam) / p_enhancement

# Step 5: Print results in percentage form
print(f"P(spam) = {p_spam*100:.1f}%")
print(f"P(enhancement | spam) = {p_enhancement_given_spam*100:.1f}%")
print(f"P(enhancement | not spam) = {p_enhancement_given_no_spam*100:.2f}%")
print(f"P(enhancement) = {p_enhancement*100:.2f}%")
print(f"\nP(spam | enhancement) = {p_spam_given_enhancement*100:.2f}%")

# Step 6: Interpretation
print("\nInterpretation:")
print(f"If an email contains '{b}', there is a {p_spam_given_enhancement*100:.2f}% chance it is spam.")
print(f"However, about {p_enhancement_given_no_spam*100:.2f}% of legitimate emails also contain '{b}',")
print("so a spam filter should not rely on a single word but combine multiple indicators for accuracy.")


P(spam) = 20.0%
P(enhancement | spam) = 5.0%
P(enhancement | not spam) = 0.10%
P(enhancement) = 1.08%

P(spam | enhancement) = 92.59%

Interpretation:
If an email contains 'enhancement', there is a 92.59% chance it is spam.
However, about 0.10% of legitimate emails also contain 'enhancement',
so a spam filter should not rely on a single word but combine multiple indicators for accuracy.


# Naive Bayes Classifier

## Introduction

A **Naive Bayes classifier** is a type of **supervised machine learning algorithm** that applies **Bayes’ Theorem** to perform predictions and classifications.

Recall Bayes’ Theorem:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

This formula calculates the probability of event \( A \) given that event \( B \) has occurred.

---

### From Bayes’ Theorem to Classification

In classification tasks, we can think of:
- \( A \) → a **class label** (e.g., "spam" or "not spam")
- \( B \) → the **observed data** (e.g., the contents of an email)

The model computes:
- \( P(\text{spam} \mid \text{email}) \)
- \( P(\text{not spam} \mid \text{email}) \)

The classifier then predicts the class with the **higher posterior probability**.

---

### Why It’s Called “Naive”

The algorithm assumes that all features (e.g., words in an email) are **conditionally independent** given the class.  
This simplification is rarely true in real-world data — hence the term *“naive”* — but it works surprisingly well for many problems like spam filtering, sentiment analysis, and text categorization.

---

### Why It’s Supervised

Naive Bayes is **supervised** because it learns from **labeled training data**.  
To estimate probabilities such as \( P(\text{spam}) \) or \( P(\text{word} \mid \text{spam}) \), we use previously tagged examples — for instance, a dataset where each email is labeled as spam or not spam.

---

In short, a **Naive Bayes classifier** uses Bayes’ Theorem with independence assumptions to predict which class a new data point most likely belongs to, based on learned probabilities from historical data.


## Investigate the Data

In this lesson, we build a **Naive Bayes classifier** to predict whether a product review is **positive** or **negative**.  
Such a model allows companies to quickly gauge public sentiment about a new product without manually reading thousands of reviews or social media posts.

The dataset consists of **Amazon baby product reviews**.  
Originally, it included multiple features such as the reviewer’s name, review date, and product rating.  
For this task, only two key features are used:

- The **text** of the review  
- The **sentiment label** (“positive” or “negative”)

Reviews with a score **below 4** are labeled as **negative**, while those with a score of **4 or higher** are labeled as **positive**.

To keep training efficient, only a **small subset** of the dataset is loaded in the next lessons.  
The **full dataset** will be used later when assembling the complete model.


In [68]:
import pickle
from collections import Counter
from pathlib import Path

data_dir = Path("data_folder")

# --- Load counters from pickle files ---
with open(data_dir / "pos_counter.pkl", "rb") as f:
    pos_counter: Counter = pickle.load(f)

with open(data_dir / "neg_counter.pkl", "rb") as f:
    neg_counter: Counter = pickle.load(f)

# --- Example usage ---
print("Positive 'crib' count:", pos_counter["crib"])
print("Negative 'crib' count:", neg_counter["crib"])
print("Top 5 positive words:", pos_counter.most_common(5))
print("Top 5 negative words:", neg_counter.most_common(5))

Positive 'crib' count: 1
Negative 'crib' count: 12
Top 5 positive words: [('to', 147), ('and', 130), ('the', 126), ('I', 119), ('a', 102)]
Top 5 negative words: [('the', 309), ('to', 170), ('I', 157), ('and', 128), ('a', 110)]


## Bayes Theorem I

In this lesson, we aim to build a classifier that predicts whether the review *"This crib was amazing"* is **positive** or **negative**.  
To do so, we calculate the probabilities:

$$
P(\text{positive} \mid \text{review}) \quad \text{and} \quad P(\text{negative} \mid \text{review})
$$

and determine which one is greater.

We’ll use **Bayes’ Theorem** to compute these probabilities.  
For example, for the positive case:

$$
P(\text{positive} \mid \text{review}) = \frac{P(\text{review} \mid \text{positive}) \cdot P(\text{positive})}{P(\text{review})}
$$

The first term we’ll focus on is **$P(\text{positive})$**, which represents the probability that any given review in our dataset is positive.  
To compute this, we count the total number of reviews labeled as positive and divide by the total number of reviews (both positive and negative).

At this stage, our goal is to estimate this prior probability:

$$
P(\text{positive})
$$

which forms part of the numerator in Bayes’ Theorem.


In [69]:
import pickle
neg_list = pickle.load(open("data_folder/reviews_list_2.p", "rb"))
pos_list = pickle.load(open("data_folder/pos_list.p", "rb"))

In [70]:
# Step 1: Find the total number of positive and negative reviews
# len(pos_list) gives the total count of positive reviews
# len(neg_list) gives the total count of negative reviews
# Add them together to get the total number of reviews in the dataset

total_reviews = len(pos_list) + len(neg_list)

# Step 2: Calculate the proportion of positive and negative reviews
# percent_pos = number of positive reviews divided by total reviews
# percent_neg = number of negative reviews divided by total reviews
# These represent P(positive) and P(negative) respectively
percent_pos = len(pos_list) / total_reviews
percent_neg = len(neg_list) / total_reviews

# Step 3: Print the results to verify they sum to 1 (or very close to 1 due to rounding)
print("Percent positive:", percent_pos)
print("Percent negative:", percent_neg)
print("Sum (should be 1):", percent_pos + percent_neg)

Percent positive: 0.5
Percent negative: 0.5
Sum (should be 1): 1.0


## Bayes Theorem II

We continue classifying the review *“This crib was amazing”*.  
In this step, we compute the second component of Bayes’ Theorem:

$$
P(\text{positive} \mid \text{review}) = \frac{P(\text{review} \mid \text{positive}) \cdot P(\text{positive})}{P(\text{review})}
$$

Here, we focus on the likelihood term:

$$
P(\text{review} \mid \text{positive})
$$

This represents the probability of observing the exact review *given that it is positive*.  
In other words, assuming the review is positive, how likely is it to contain exactly the words “This”, “crib”, “was”, and “amazing”?

To simplify this computation, we make the **conditional independence assumption** — meaning that the appearance of one word does not depend on the presence of another.  
Although this is a strong assumption, it is fundamental to the **Naive Bayes** approach.

Under this assumption, we can expand the equation as follows:

$$
P(\text{“This crib was amazing”} \mid \text{positive}) = P(\text{“This”} \mid \text{positive}) \cdot P(\text{“crib”} \mid \text{positive}) \cdot P(\text{“was”} \mid \text{positive}) \cdot P(\text{“amazing”} \mid \text{positive})
$$

Each term, such as $P(\text{“crib”} \mid \text{positive})$, represents the probability that the word “crib” appears in a positive review.  
This can be calculated as:

$$
P(\text{“crib”} \mid \text{positive}) = \frac{\text{# of times “crib” appears in positive reviews}}{\text{total # of words in positive reviews}}
$$

By calculating this probability for every word in the review and multiplying them together, we obtain:

$$
P(\text{review} \mid \text{positive})
$$

which forms the **likelihood** part of Bayes’ Theorem.


In [71]:
review = "This crib was amazing"

# Assuming equal prior probabilities for positive and negative reviews
percent_pos = 0.5
percent_neg = 0.5

total_pos = sum(pos_counter.values())  # Total words in all positive reviews
total_neg = sum(neg_counter.values())  # Total words in all negative reviews

# Step 2: Initialize probabilities for positive and negative classes
# These will accumulate the product of conditional probabilities for each word.
pos_probability = 1
neg_probability = 1

# Step 3: Split the review text into individual words
# Example: "This crib was amazing" → ["This", "crib", "was", "amazing"]
review_words = review.split()

# Step 4: Loop through each word in the review
# For each word, we retrieve its count in the positive and negative review datasets.
# Step 5: Multiply by the conditional probability P(word | positive)
# Step 6: Repeat for the negative case
for word in review_words:
    word_in_pos = pos_counter[word]   # Number of times 'word' appears in positive reviews
    word_in_neg = neg_counter[word]   # Number of times 'word' appears in negative reviews

    pos_probability *= word_in_pos / total_pos  # Update positive probability
    neg_probability *= word_in_neg / total_neg  # Update negative probability
    
print("")
print("P(review | positive):", pos_probability)
print("P(review | negative):", neg_probability)


P(review | positive): 0.0
P(review | negative): 0.0


In [72]:
neg_counter["crib"]

12

## Smoothing

In the previous step, we computed probabilities such as:

$$
P(\text{“crib”} \mid \text{positive}) = \frac{\text{# of “crib” in positive reviews}}{\text{# of words in positive reviews}}
$$

However, if the word *“crib”* never appears in any positive review, the numerator becomes 0.  
Since the Naive Bayes classifier multiplies probabilities for all words, this would make the entire product:

$$
P(\text{review} \mid \text{positive}) = 0
$$

This problem is especially common when the review contains **typos** or **unseen words** — words that do not exist in the training dataset.

To address this, we apply a technique called **smoothing** (specifically, *Laplace smoothing*).  
Smoothing prevents zero probabilities by adding a small constant (usually 1) to every word count and adjusting the denominator accordingly.

The smoothed probability becomes:

$$
P(\text{“crib”} \mid \text{positive}) = \frac{\text{# of “crib” in positive reviews} + 1}{\text{# of words in positive reviews} + N}
$$

where **N** is the total number of **unique words** in the dataset.

This ensures that no word has zero probability, improving the robustness of the model when classifying unseen or misspelled words.


In [73]:
review = "This cribb was amazing"

# Assuming equal priors for simplicity
percent_pos = 0.5
percent_neg = 0.5

# Step 2: Calculate total word counts for both positive and negative datasets
total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

# Step 3: Initialize probabilities
# We'll compute both versions: before and after smoothing
pos_probability_no_smoothing = 1
neg_probability_no_smoothing = 1

pos_probability_smoothing = 1
neg_probability_smoothing = 1

# Step 4: Split the review into individual words
review_words = review.split()

# Step 5: Loop through each word to calculate probabilities
for word in review_words:
    word_in_pos = pos_counter[word]
    word_in_neg = neg_counter[word]

    # --- Without Smoothing ---
    # If a word doesn't exist in the dataset, its count is 0 → probability becomes 0
    pos_probability_no_smoothing *= word_in_pos / total_pos
    neg_probability_no_smoothing *= word_in_neg / total_neg

    # --- With Laplace Smoothing ---
    # Add 1 to the numerator and total unique word count to the denominator
    pos_probability_smoothing *= (word_in_pos + 1) / (total_pos + len(pos_counter))
    neg_probability_smoothing *= (word_in_neg + 1) / (total_neg + len(neg_counter))

# Step 6: Print results for comparison
print("=== WITHOUT SMOOTHING ===")
print("P(review | positive):", pos_probability_no_smoothing)
print("P(review | negative):", neg_probability_no_smoothing)

print("\n=== WITH LAPLACE SMOOTHING ===")
print("P(review | positive):", pos_probability_smoothing)
print("P(review | negative):", neg_probability_smoothing)

# Step 7: Observe how smoothing prevents zero probabilities
# Even though "cribb" does not appear in the dataset, smoothing ensures
# the overall product does not collapse to zero.


=== WITHOUT SMOOTHING ===
P(review | positive): 0.0
P(review | negative): 0.0

=== WITH LAPLACE SMOOTHING ===
P(review | positive): 1.0906857688451484e-12
P(review | negative): 1.8834508880130966e-13


## Classify

We have now computed both components of the numerator in Bayes’ Theorem:

$$
P(\text{positive} \mid \text{review}) = \frac{P(\text{review} \mid \text{positive}) \cdot P(\text{positive})}{P(\text{review})}
$$

Similarly, for the negative case:

$$
P(\text{negative} \mid \text{review}) = \frac{P(\text{review} \mid \text{negative}) \cdot P(\text{negative})}{P(\text{review})}
$$

Notice that the denominator, $P(\text{review})$, is the same in both equations.  
Since we only need to determine **which probability is greater**, we can ignore the denominator entirely.

Therefore, the classification rule becomes:

$$
\text{Predict} =
\begin{cases}
\text{Positive}, & \text{if } P(\text{review} \mid \text{positive}) \cdot P(\text{positive}) > P(\text{review} \mid \text{negative}) \cdot P(\text{negative}) \\
\text{Negative}, & \text{otherwise.}
\end{cases}
$$

This means we simply compare the **numerators**


In [74]:
review = "This crib was amazing"

# Prior probabilities (assuming equal likelihood for both classes)
percent_pos = 0.5
percent_neg = 0.5

# Step 2: Calculate total word counts for each dataset
total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

# Step 3: Initialize likelihood probabilities
pos_probability = 1
neg_probability = 1

# Step 4: Split the review into individual words
review_words = review.split()

# Step 5: Loop through each word in the review and compute smoothed conditional probabilities
for word in review_words:
    word_in_pos = pos_counter[word]  # Number of times word appears in positive reviews
    word_in_neg = neg_counter[word]  # Number of times word appears in negative reviews

    # Apply Laplace smoothing
    pos_probability *= (word_in_pos + 1) / (total_pos + len(pos_counter))
    neg_probability *= (word_in_neg + 1) / (total_neg + len(neg_counter))

# Step 6: Multiply by prior probabilities to get final class probabilities
final_pos = pos_probability * percent_pos
final_neg = neg_probability * percent_neg

# Step 7: Print computed final probabilities
print("Final P(positive | review):", final_pos)
print("Final P(negative | review):", final_neg)

# Step 8: Compare and classify the review
# The class with the higher probability is the predicted label
if final_pos > final_neg:
    print("The review is positive")
else:
    print("The review is negative")


Final P(positive | review): 1.0906857688451484e-12
Final P(negative | review): 1.2242430772085127e-12
The review is negative


In [75]:
review = "This crib was terrible"

# Step 2: Define equal prior probabilities for both classes
percent_pos = 0.5
percent_neg = 0.5

# Step 3: Calculate total word counts for each sentiment dataset
total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

# Step 4: Initialize likelihood probabilities
pos_probability = 1
neg_probability = 1

# Step 5: Split the review into individual words
review_words = review.split()

# Step 6: Loop through each word and apply Laplace smoothing
# For each word, we calculate P(word | positive) and P(word | negative)
for word in review_words:
    word_in_pos = pos_counter[word]  # Word frequency in positive reviews
    word_in_neg = neg_counter[word]  # Word frequency in negative reviews

    # Apply Laplace smoothing to avoid zero probabilities
    pos_probability *= (word_in_pos + 1) / (total_pos + len(pos_counter))
    neg_probability *= (word_in_neg + 1) / (total_neg + len(neg_counter))

# Step 7: Combine with prior probabilities to compute final class probabilities
final_pos = pos_probability * percent_pos
final_neg = neg_probability * percent_neg

# Step 8: Display the computed probabilities for comparison
print("Final P(positive | review):", final_pos)
print("Final P(negative | review):", final_neg)

# Step 9: Classify the review based on which probability is greater
if final_pos > final_neg:
    print("The review is positive")
else:
    print("The review is negative")

Final P(positive | review): 1.0906857688451484e-12
Final P(negative | review): 1.2242430772085127e-12
The review is negative


## Formatting the Data for scikit-learn

Congratulations! You’ve built your own Naive Bayes text classifier.
If you have a dataset of text labeled with different classes, your classifier can now predict which class a new document belongs to.

We’ll now explore how Python’s scikit-learn library can handle this process for us automatically.

Transforming Text into Numerical Data

To use scikit-learn’s Naive Bayes classifier, we must first transform our text data into a numerical format.
This is achieved using CountVectorizer, which converts text into a matrix of word counts — often referred to as the bag-of-words model.

### Step 1: Create and Fit the Vectorizer

We begin by creating a CountVectorizer and fitting it to our training data.
The .fit() method learns the vocabulary of the training set.
After fitting, the vectorizer learns which words are part of the vocabulary, for example:

{'training': 3, 'review': 1, 'one': 0, 'second': 2}


### Step 2: Transform Text into Word Counts

Once trained, we can use .transform() to convert new text into numerical count vectors.
This produces an array such as:

[[1 2 0 0]]

Interpretation:

“one” appears once

“review” appears twice

“training” and “second” do not appear


### Step 3: Understanding the Vocabulary

You can print vectorizer.vocabulary_ to view the mapping between each word and its corresponding index.
Each index corresponds to a column in the count matrix, defining the word’s position.

If a new word (like “two”) was not present in the data used for .fit(), it will not appear in the vocabulary and will therefore be ignored during transformation.


### Step 4: Using the Count Data in Naive Bayes

The counts matrix is now ready to be used as input for a Naive Bayes classifier such as MultinomialNB.
The typical workflow is as follows:

Fit the CountVectorizer on the training text

Transform the text into word count vectors

Train the Naive Bayes classifier using those vectors

Transform new text data and make predictions

In [76]:
from sklearn.feature_extraction.text import CountVectorizer

# Step 2: Define the new review
review = "This crib was amazing"

# Step 3: Create a CountVectorizer and name it 'counter'
counter = CountVectorizer()

# Step 4: Fit the vectorizer on the combined list of negative and positive reviews
# This teaches the vectorizer the full vocabulary across all training data.
counter.fit(neg_list + pos_list)

# Step 5: Print the learned vocabulary
# The output is a dictionary mapping each word to an index.
print("Vocabulary learned by CountVectorizer:")
print(counter.vocabulary_)

# Step 6: Transform the new review into a count vector
# .transform() converts text into an array of word frequencies.
# We wrap 'review' in a list because transform expects an iterable of strings.
review_counts = counter.transform([review])

# Step 7: Convert the sparse matrix to a readable dense array
# The indices corresponding to the words in the review will have counts of 1.
print("\nWord counts for the review:")
print(review_counts.toarray())

# Step 8: Transform the full training dataset
# This converts all training reviews into count vectors that will be used for model training.
training_counts = counter.transform(neg_list + pos_list)


Vocabulary learned by CountVectorizer:
{'wanted': 1521, 'to': 1429, 'love': 805, 'this': 1408, 'but': 182, 'it': 712, 'was': 1525, 'pretty': 1056, 'expensive': 467, 'for': 525, 'only': 951, 'few': 495, 'months': 871, 'worth': 1584, 'of': 937, 'calendar': 187, 'pages': 981, 'ended': 434, 'up': 1486, 'buying': 185, 'regular': 1130, 'weekly': 1541, 'planner': 1024, '55': 11, 'off': 938, 'the': 1393, 'that': 1392, 'is': 709, '11': 2, 'and': 63, 'has': 618, 'all': 47, 'seven': 1219, 'days': 339, 'on': 947, 'right': 1163, 'page': 980, 'left': 765, 'room': 1166, 'write': 1588, 'do': 380, 'list': 785, 'goals': 577, 'found': 539, 'be': 120, 'more': 873, 'helpful': 633, 'because': 123, 'could': 306, 'mark': 823, 'each': 409, 'day': 337, 'eating': 417, 'sleeping': 1252, 'blocks': 149, 'then': 1397, 'also': 55, 'see': 1207, 'them': 1395, 'side': 1235, 'by': 186, 'her': 636, 'patterns': 993, 'easily': 413, 'with': 1568, 'view': 1511, 'cute': 328, 'just': 724, 'not': 919, 'what': 1550, 'like': 778, 

## Using scikit-learn

Now that our data is formatted correctly, we can train a **Naive Bayes classifier** using **scikit-learn’s `MultinomialNB`** model.

This classifier is designed for discrete data such as word counts, making it ideal for **text classification**.

---

### Step 1: Training the Model

The model is trained using the `.fit()` method.  
This method takes two parameters:
1. The **array of data points** — in this case, the matrix of word counts we created (`training_counts`).
2. The **array of labels** — a list indicating whether each review is positive or negative.

During training, the classifier learns the statistical relationship between word occurrences and their associated class labels.

---

### Step 2: Making Predictions

Once trained, we can use the `.predict()` method to classify new data.  
It takes a list of data points (such as the transformed vector of a new review) and returns the **predicted labels**.

For example:
- A review containing positive words like “great” or “amazing” would likely be classified as **positive**.
- A review containing negative words like “terrible” or “broken” would likely be classified as **negative**.

---

### Step 3: Checking Probabilities

If we want more than just the final prediction, we can use the `.predict_proba()` method.  
This method returns the **probability** of each possible label, allowing us to see how confident the model is in its prediction.

Example output for a single review might look like:

In [81]:
# Step 1.1: Build labels aligned with the order used to stack texts
#   - 0 for each negative review, 1 for each positive review
training_texts = neg_list + pos_list
training_labels = np.array([0] * len(neg_list) + [1] * len(pos_list))

# Step 1.2: Vectorize texts into count features
counter = CountVectorizer()
training_counts = counter.fit_transform(training_texts)

# Step 1.3: Initialize and fit the Multinomial Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(training_counts, training_labels)

# ------------------------------------------------------------
# 2.8.2  Step 2: Making Predictions
# ------------------------------------------------------------

# Step 2.1: Create some new reviews to classify (feel free to edit these)
new_reviews = [
    "this crib was great amazing and wonderful",   # likely positive
    "this crib was terrible and broken",           # likely negative
    "okay quality but not impressive"              # ambiguous/neutral-ish
]

# Step 2.2: Transform new reviews into count vectors using the SAME fitted vectorizer
new_counts = counter.transform(new_reviews)

# Step 2.3: Predict the class labels (0=negative, 1=positive)
pred_labels = classifier.predict(new_counts)

# ------------------------------------------------------------
#  Step 3: Checking Probabilities
# ------------------------------------------------------------

# Step 3.1: Get class probabilities for each review
#   - columns correspond to classes in classifier.classes_ (sorted)
pred_proba = classifier.predict_proba(new_counts)

# Step 3.2: Pretty-print results
label_names = {0: "negative", 1: "positive"}
print("Classes order:", classifier.classes_)  # e.g., [0 1]
print()

for text, label, proba in zip(new_reviews, pred_labels, pred_proba):
    neg_p, pos_p = proba[0], proba[1]  # assuming classes_ = [0, 1]
    print(f"Review: {text!r}")
    print(f" Predicted label: {label} ({label_names[label]})")
    print(f" Probabilities -> negative: {neg_p:.6f}, positive: {pos_p:.6f}")
    print("-" * 60)

# Example of expected output format for a strongly positive review might resemble:
# [[0.049777 0.950223]]  (values will differ with this toy dataset)

Classes order: [0 1]

Review: 'this crib was great amazing and wonderful'
 Predicted label: 0 (negative)
 Probabilities -> negative: 0.699059, positive: 0.300941
------------------------------------------------------------
Review: 'this crib was terrible and broken'
 Predicted label: 0 (negative)
 Probabilities -> negative: 0.831156, positive: 0.168844
------------------------------------------------------------
Review: 'okay quality but not impressive'
 Predicted label: 0 (negative)
 Probabilities -> negative: 0.974334, positive: 0.025666
------------------------------------------------------------


# Final Remarks — What You Built, Limits, and Next Steps

You implemented the full Naive Bayes workflow for text: token counts, priors, smoothed likelihoods, and posterior comparison. The model is fast, simple, and effective on many sparse text tasks.

### Practical tips
- Ensure \(X.shape[0] = |y|\) and that label order matches how texts were stacked.
- Prefer log-probabilities for numerical stability.
- Use proper train/validation/test splits; evaluate with accuracy, precision/recall, F1, and confusion matrices.

### Limitations
- Independence assumption ignores word order and context.
- Performance can drop with domain shift or heavy misspellings; smoothing helps but is limited.

### Useful extensions
- Replace raw counts with TF-IDF features.
- Add n-grams (bigrams/trigrams) to capture short phrases.
- Try alternative linear models (e.g., Logistic Regression, Linear SVM) on the same features.
- Apply light preprocessing: lowercasing, punctuation/number handling, and task-dependent stopword treatment.

These steps give you a strong, reproducible baseline for many text-classification problems and a clear path for iterative improvement.
