In [33]:
from collections import defaultdict, Counter
import math
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Naive Bayes

Naïve Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It is called "naïve" because it assumes that the features are independent of each other, which is often not true in real-world data. Despite this simplification, it works well for many tasks, especially text classification.

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

$P(A)$: The prior probability of \(A\), i.e., the probability of \(A\) before observing any evidence.

$P(B)$: The probability of \(B\), often calculated as:

$$
P(B) = \sum_{i} P(B \mid A_i) \cdot P(A_i)
$$

$P(A \mid B)$: The probability of event \(A\) (the hypothesis) given that \(B\) (the evidence) has occurred. This is the posterior probability.

$P(B \mid A)$: The probability of observing \(B\) given that \(A\) is true. This is the likelihood.

Because this approach is so simple, its quite fast and might fit simple text classification tasks.

In [34]:
docs = [
    "I love coding",     
    "coding is fun", 
    "I hate bugs",       
    "debugging is hard", 
]
labels = ["Positive", "Positive", "Negative", "Negative"]

In [35]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(docs)

In [36]:
X_train

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [37]:
model = MultinomialNB()
model.fit(X_train, labels)

In [38]:
test_sentence = ["I hate bugs"]
X_test = vectorizer.transform(test_sentence)
predicted_class = model.predict(X_test)[0]

print(f"The sentence '{test_sentence[0]}' is predicted to be: {predicted_class}")

The sentence 'I hate bugs' is predicted to be: Negative


## Count Vectorizer

In [6]:
texts = ["love coding",
         "love Python",
         "Python is great"]
vectorizer = CountVectorizer()
X_vectors = vectorizer.fit_transform(texts)

In [7]:
print("\nFeature names (vocabulary):")
print(vectorizer.get_feature_names_out())


Feature names (vocabulary):
['coding' 'great' 'is' 'love' 'python']


In [8]:
print("Dense matrix:")
print(X_vectors.toarray())

Dense matrix:
[[1 0 0 1 0]
 [0 0 0 1 1]
 [0 1 1 0 1]]


Feature names are the words in the vocabulary. Each has an index. 

Please note, the word "I" is not there. Why? Because CountVectorizer removes stop words like I 

The dense matrix represents wether the word exists in the sentence. For example in the first sentence we have:
I love coding. These are indexes:
- I is not part of the index
- love is index 3
- coding is index 0

Therefor in the first row I expect to see 1 in index 0 and 3. Zeros in all the rest. This **IS** what I get.

In [14]:
data = [
    ("Discover the best hiking trails", "Not Spam"),
    ("Win a trip to the Amazon jungle", "Spam"),
    ("Experience the beauty of forests", "Not Spam"),
    ("Exclusive safari deal for you", "Spam"),
    ("Save the whales, donate today", "Not Spam"),
    ("Free camping gear with purchase", "Spam")
]

messages, labels = zip(*data)

mdl = make_pipeline(CountVectorizer(), MultinomialNB())

mdl.fit(messages, labels)

In [15]:
test_message = ["Free hiking gear for you"]
predicted_class = mdl.predict(test_message)
predicted_proba = mdl.predict_proba(test_message)

print(f"Message: {test_message[0]}")
print(f"Predicted Class: {predicted_class[0]}")
print("Predicted Probabilities:", dict(zip(pipeline.classes_, predicted_proba[0])))

Message: Free hiking gear for you
Predicted Class: Spam
Predicted Probabilities: {'Not Spam': 0.1229815214288261, 'Spam': 0.8770184785711737}


In [16]:
test_message = ["donate for whales"]
predicted_class = mdl.predict(test_message)
predicted_proba = mdl.predict_proba(test_message)

print(f"Message: {test_message[0]}")
print(f"Predicted Class: {predicted_class[0]}")
print("Predicted Probabilities:", dict(zip(pipeline.classes_, predicted_proba[0])))

Message: donate for whales
Predicted Class: Not Spam
Predicted Probabilities: {'Not Spam': 0.6818129064532267, 'Spam': 0.31818709354677377}



# Naive Bayes Text Classification with a Nature Theme

### Step 1: Data Preparation

#### Training Data
| Message                             | Class    |
|-------------------------------------|----------|
| "Discover the best hiking trails"   | Not Spam |
| "Win a trip to the Amazon jungle"   | Spam     |
| "Experience the beauty of forests"  | Not Spam |
| "Exclusive safari deal for you"     | Spam     |
| "Save the whales, donate today"     | Not Spam |
| "Free camping gear with purchase"   | Spam     |


### Vocabulary
The unique words across all messages form the vocabulary:

In [21]:
vocab = ['discover', 'the', 'best', 'hiking', 'trails', 'win', 'a', 'trip', 'to',
                'amazon', 'jungle', 'experience', 'beauty', 'of', 'forests', 'exclusive', 'safari',
                'deal', 'for', 'you', 'save', 'whales', 'donate', 'today', 'free', 'camping', 'gear',
                'with', 'purchase']

We use this vocabulary to calculate word probabilities for each class.

### Step 2: Word Frequency Calculation

In [22]:
spam_words = ['win', 'a', 'trip', 'to', 'amazon', 'jungle', 'exclusive', 'safari',
        'deal', 'for', 'you', 'free', 'camping', 'gear', 'with', 'purchase']

In [23]:
len(spam_words)

16

In [24]:
non_spam_words = ['discover', 'the', 'best', 'hiking', 'trails', 'experience', 'the', 'beauty', 'of', 'forests',
                'save', 'the', 'whales', 'donate', 'today']

In [25]:
len(non_spam_words)

15


### Step 3: Laplace Smoothing

To handle words that may not appear in a specific class, we apply Laplace smoothing.  
The formula is:

$$ P(\text{Word} \mid \text{Class}) = \frac{\text{Word Count in Class} + 1}{\text{Total Words in Class} + \text{Vocabulary Size}} $$

**Vocabulary Size = 28** (total unique words).

#### Example Calculations:
- For the word "win" in Spam:
$$ P(\text{win} \mid \text{Spam}) = \frac{1 + 1}{16 + 28} = \frac{2}{44} \approx 0.0455 $$

- For the word "forests" in Spam (not present):
$$ P(\text{forests} \mid \text{Spam}) = \frac{0 + 1}{16 + 28} = \frac{1}{44} \approx 0.0227 $$



### Step 4: Prior Probabilities

The prior probabilities are based on the class distribution in the dataset.

- **Spam**:
$$ P(\text{Spam}) = \frac{\text{Spam Messages}}{\text{Total Messages}} = \frac{3}{6} = 0.5 $$

- **Not Spam**:
$$ P(\text{Not Spam}) = \frac{\text{Not Spam Messages}}{\text{Total Messages}} = \frac{3}{6} = 0.5 $$



### Step 5: Classifying a New Message

Let’s classify the message: `"free hiking gear for you"`.

We calculate the posterior probabilities for both Spam and Not Spam using Bayes’ theorem:

$$ P(\text{Spam} \mid \text{Message}) \propto P(\text{Spam}) \prod_{\text{Wor d} \in \text{Message}} P(\text{Word} \mid \text{Spam}) $$

$$ P(\text{Not Spam} \mid \text{Message}) \propto P(\text{Not Spam}) \prod_{\text{Word} \in \text{Message}} P(\text{Word} \mid \text{Not Spam}) $$

Next, we compute these probabilities step-by-step.


Signs explanation:

The probability the message is spam given its content.

The prior probability that any message is spam.

The likelihood of each word in the message appearing in spam

The formula calculates the probability of a message being spam by combining the prior likelihood of spam with the probabilities of each word in the message appearing in spam, assuming the words are independent.

Basically you take the chance of something being spam or not spam, unrelated to the message it self, lets say the chance of being spam is 0.3.

You take each word, calculate its chance of being spam, and multiply by 0.3. You sum this over all the words in the message, and you get the chance of it being spam.

In [None]:
data = [
    ("Discover the best hiking trails", "Not Spam"),
    ("Win a trip to the Amazon jungle", "Spam"),
    ("Experience the beauty of forests", "Not Spam"),
    ("Exclusive safari deal for you", "Spam"),
    ("Save the whales, donate today", "Not Spam"),
    ("Free camping gear with purchase", "Spam")
]

## Exercise

Build your own Naive Bayes model from scratch

Instructions:

1. Tokenize the data in what ever way you see fit
2. Create a data structure that for each class, holds the probability of each word in that class. For example:
- in the class spam: free: 0.3, commit: 0.03 (made up numbers)
3. Given a sentence, calculate its probability for each class and choose the higher probability as the prediction

Once its ready, use it to classify the IMDB dataset

In [27]:
df = pd.read_csv('../datasets/IMDB Dataset.csv')

In [28]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [40]:
df.iloc[0]['review']

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [None]:
{spam: {'free': 0.3, 'commit': 0.03}}
{non_spam: {'free': 0.1, 'commit': 0.75}}
