1.	Performing it manually. In manually developing a Naïve Bayes model, create methods that will do the following:

a.	Generate a Bag of Words (for word frequency)

b.	Calculate the prior for the class HAM and SPAM

c.	Calculate the likelihood of the tokens in the vocabulary with respect to the class.


In [6]:
from collections import defaultdict
import math
import re

In [7]:
# Dataset
documents = [
    ("Free money now!!!", "SPAM"),
    ("Hi mom, how are you?", "HAM"),
    ("Lowest price for your meds", "SPAM"),
    ("Are we still on for dinner?", "HAM"),
    ("Win a free iPhone today", "SPAM"),
    ("Let's catch up tomorrow at the office", "HAM"),
    ("Meeting at 3 PM tomorrow", "HAM"),
    ("Get 50% off, limited time!", "SPAM"),
    ("Team meeting in the office", "HAM"),
    ("Click here for prizes!", "SPAM"),
    ("Can you send the report?", "HAM")
]

print("Dataset loaded:")
for i, (doc, label) in enumerate(documents, 1):
    print(f"{i}. [{label}] {doc}")

Dataset loaded:
1. [SPAM] Free money now!!!
2. [HAM] Hi mom, how are you?
3. [SPAM] Lowest price for your meds
4. [HAM] Are we still on for dinner?
5. [SPAM] Win a free iPhone today
6. [HAM] Let's catch up tomorrow at the office
7. [HAM] Meeting at 3 PM tomorrow
8. [SPAM] Get 50% off, limited time!
9. [HAM] Team meeting in the office
10. [SPAM] Click here for prizes!
11. [HAM] Can you send the report?


In [8]:
# Tokenizer function
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

a. Generate Bag of Words

In [9]:

bow = {'SPAM': Counter(), 'HAM': Counter()}
class_counts = {'SPAM': 0, 'HAM': 0}
vocab = set()

for doc, label in documents:
    tokens = tokenize(doc)
    bow[label].update(tokens)
    class_counts[label] += 1
    vocab.update(tokens)

print("\nBag of Words (word frequencies):")
for label in bow:
    print(f"{label}: {dict(bow[label])}")

print("\nVocabulary:")
print(sorted(vocab))


Bag of Words (word frequencies):
SPAM: {'free': 2, 'money': 1, 'now': 1, 'lowest': 1, 'price': 1, 'for': 2, 'your': 1, 'meds': 1, 'win': 1, 'a': 1, 'iphone': 1, 'today': 1, 'get': 1, '50': 1, 'off': 1, 'limited': 1, 'time': 1, 'click': 1, 'here': 1, 'prizes': 1}
HAM: {'hi': 1, 'mom': 1, 'how': 1, 'are': 2, 'you': 2, 'we': 1, 'still': 1, 'on': 1, 'for': 1, 'dinner': 1, 'let': 1, 's': 1, 'catch': 1, 'up': 1, 'tomorrow': 2, 'at': 2, 'the': 3, 'office': 2, 'meeting': 2, '3': 1, 'pm': 1, 'team': 1, 'in': 1, 'can': 1, 'send': 1, 'report': 1}

Vocabulary:
['3', '50', 'a', 'are', 'at', 'can', 'catch', 'click', 'dinner', 'for', 'free', 'get', 'here', 'hi', 'how', 'in', 'iphone', 'let', 'limited', 'lowest', 'meds', 'meeting', 'mom', 'money', 'now', 'off', 'office', 'on', 'pm', 'price', 'prizes', 'report', 's', 'send', 'still', 'team', 'the', 'time', 'today', 'tomorrow', 'up', 'we', 'win', 'you', 'your']


b.	Calculate the prior for the class HAM and SPAM

In [10]:
# 2. Calculate priors
total_docs = sum(class_counts.values())
priors = {label: class_counts[label] / total_docs for label in class_counts}
print("\nPriors:")
for label in priors:
    print(f"P({label}) = {priors[label]:.3f}")


Priors:
P(SPAM) = 0.455
P(HAM) = 0.545


c. Calculate the likelihood of the tokens in the vocabulary with respect to the class.

In [11]:
# 3. Calculate likelihoods with Laplace smoothing
likelihoods = {label: {} for label in bow}
for label in bow:
    total_words = sum(bow[label].values())
    for word in vocab:
        # Laplace smoothing
        likelihoods[label][word] = (bow[label][word] + 1) / (total_words + len(vocab))

print("\nLikelihoods (P(word|class)):")
for label in likelihoods:
    print(f"\nClass: {label}")
    for word in sorted(vocab):
        print(f"P({word}|{label}) = {likelihoods[label][word]:.4f}")


Likelihoods (P(word|class)):

Class: SPAM
P(3|SPAM) = 0.0149
P(50|SPAM) = 0.0299
P(a|SPAM) = 0.0299
P(are|SPAM) = 0.0149
P(at|SPAM) = 0.0149
P(can|SPAM) = 0.0149
P(catch|SPAM) = 0.0149
P(click|SPAM) = 0.0299
P(dinner|SPAM) = 0.0149
P(for|SPAM) = 0.0448
P(free|SPAM) = 0.0448
P(get|SPAM) = 0.0299
P(here|SPAM) = 0.0299
P(hi|SPAM) = 0.0149
P(how|SPAM) = 0.0149
P(in|SPAM) = 0.0149
P(iphone|SPAM) = 0.0299
P(let|SPAM) = 0.0149
P(limited|SPAM) = 0.0299
P(lowest|SPAM) = 0.0299
P(meds|SPAM) = 0.0299
P(meeting|SPAM) = 0.0149
P(mom|SPAM) = 0.0149
P(money|SPAM) = 0.0299
P(now|SPAM) = 0.0299
P(off|SPAM) = 0.0299
P(office|SPAM) = 0.0149
P(on|SPAM) = 0.0149
P(pm|SPAM) = 0.0149
P(price|SPAM) = 0.0299
P(prizes|SPAM) = 0.0299
P(report|SPAM) = 0.0149
P(s|SPAM) = 0.0149
P(send|SPAM) = 0.0149
P(still|SPAM) = 0.0149
P(team|SPAM) = 0.0149
P(the|SPAM) = 0.0149
P(time|SPAM) = 0.0299
P(today|SPAM) = 0.0299
P(tomorrow|SPAM) = 0.0149
P(up|SPAM) = 0.0149
P(we|SPAM) = 0.0149
P(win|SPAM) = 0.0299
P(you|SPAM) = 0.014

d. Determine the class of the following test sentence:

i.	Limited offer, click here!

ii.	Meeting at 2 PM with the manager.


In [12]:


test_sentences = [
    "Limited offer, click here!",
    "Meeting at 2 PM with the manager."
]

for test in test_sentences:
    tokens = tokenize(test)
    scores = {}
    for label in ['SPAM', 'HAM']:
        # Start with the log prior
        log_prob = math.log(priors[label])
        total_words = sum(bow[label].values())
        for token in tokens:
            if token in vocab:
                log_prob += math.log(likelihoods[label][token])
            else:
                # Handle unknown words with Laplace smoothing
                log_prob += math.log(1 / (total_words + len(vocab)))
        scores[label] = log_prob
    predicted = max(scores, key=scores.get)
    print(f"Sentence: \"{test}\"")
    print(f"Predicted class: {predicted}\n")

Sentence: "Limited offer, click here!"
Predicted class: SPAM

Sentence: "Meeting at 2 PM with the manager."
Predicted class: HAM



2.	Using Scikit-Learn. Use the scikit-learn package to train and test a Multinomial Naïve Bayes classifer.

In [13]:


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Prepare data
texts = [doc for doc, label in documents]
labels = [label for doc, label in documents]

# Vectorize the text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Train the classifier
clf = MultinomialNB()
clf.fit(X, labels)

# Test sentences
test_sentences = [
    "Limited offer, click here!",
    "Meeting at 2 PM with the manager."
]

# Transform and predict
X_test = vectorizer.transform(test_sentences)
predictions = clf.predict(X_test)

for sent, pred in zip(test_sentences, predictions):
    print(f"Sentence: \"{sent}\"")
    print(f"Predicted class: {pred}\n")

Sentence: "Limited offer, click here!"
Predicted class: SPAM

Sentence: "Meeting at 2 PM with the manager."
Predicted class: HAM

