# Week 07 Group 1 - Naive Bayes classifiers
### Daniel Centurion Barrionuevo, Giacomo Gardella, Muhammad Hilmi, Terje Haugland Jacobsson, Vitor Neves, Adrian Langmo Pavlak


### Naive Bayes — Short intro

**Definition.** A *generative* classifier that applies Bayes’ rule with the **naive** assumption that features are *conditionally independent* given the class.

**Decision rule** (one sample with features $x=(x_1,\dots,x_p)$, classes $c\in\{1,\dots,K\}$):
$$
\hat{y}=\arg\max_{c}\; P(y=c)\;\prod_{j=1}^{p} P(x_j \mid y=c)
$$

**Numerically stable (log-space) form:**
$$
\hat{y}=\arg\max_{c}\; \log P(y=c)\;+\;\sum_{j=1}^{p}\log P(x_j \mid y=c)
$$

**Pros**
- Trains/predicts **extremely fast**; tiny memory  
- Strong baseline for **high-dimensional sparse** data (e.g., text)  
- **Interpretable**: additive per-feature (log) contributions  
- Works with **small datasets** (with smoothing)

**Cons**
- Independence assumption often **violated** → overconfident probabilities  
- **Correlated/continuous** features can hurt accuracy/calibration  
- Effectively **linear** in chosen features; nonlinear structure needs feature engineering

## How are Naive Bayes models trained?

Estimate **priors** and **likelihood parameters** from labeled data, then classify with Bayes’ rule in log-space.

---

### 0) Common steps for all NB variants
- **Prior (class frequency)**  
  $$\hat P(y=c)=\frac{N_c}{N}$$
  where $N_c$ is the number of training samples in class $c$, $N$ the total.

- **Decision rule (used at prediction time)**  
  $$\hat y=\arg\max_c\ \log \hat P(y=c)+\log \hat P(x\mid y=c)$$

This leaves the question:s **how we estimate** $\hat P(x\mid y=c)$.

The answer? Using Likelihood models.

---

## Common Likelihood Models

**Context.** In Naive Bayes we pick the class via  
$$\hat{y}=\arg\max_c\; P(y=c)\,P(x\mid y=c),$$  
so we **model** the class-conditional likelihood ($P(x\mid y=c)$). The three most common choices:


- Multinomial Likelihood
- Bernoulli Likelihood (binary presence/absence)
- Gaussian Likelihood (continuous features)
---
#### Before starting
- Always compute decisions in **log-space** to avoid underflow;
- Use **smoothing** ($(\alpha$)) for discrete models; pick $(\alpha\in[0.1,1])$ as a start.


### 1) Multinomial Likelihood (counts / text)
**Use when:** features are **counts or frequencies** (bag-of-words, TF/TF-IDF\*).  
**Model:** words/features occur as draws from a class-specific multinomial.

- Parameters: $(\theta_{jc}=P(\text{feature }j\mid y=c)$), with $(\sum_{j=1}^{V}\theta_{jc}=1$).

- Likelihood (up to a class-constant):  
  $$
  \log P(x\mid y=c)=\sum_{j=1}^{V} x_j\,\log \theta_{jc} + \text{const}
  $$
- Laplace smoothing: Add-$(\alpha$) to avoid zeros:
  $$
  \hat\theta_{jc}=\frac{\text{count}(j,c)+\alpha}{\sum_{w}\text{count}(w,c)+\alpha V}
  $$
**Pros:** excellent for sparse, high-dimensional text; fast; robust with smoothing.  
**Cons:** assumes independence and proportionality to counts; TF-IDF may break strict multinomial assumptions.

---


### 2) Bernoulli Likelihood (binary presence/absence)
**Use when:** features are **binary** ($(x_j\in\{0,1\}$)).  
**Model:** independent class-specific coin flips per feature.

- Parameters: $(\theta_{jc}=P(x_j=1\mid y=c)$).
- Likelihood:
  $$
  P(x\mid y=c)=\prod_{j=1}^{p}\theta_{jc}^{\,x_j}\,(1-\theta_{jc})^{(1-x_j)}
  $$
  so
  $$
  \log P(x\mid y=c)=\sum_{j=1}^{p}\Big[ x_j\log\theta_{jc} + (1-x_j)\log(1-\theta_{jc}) \Big]
  $$
- Smoothing:
$\theta_{jc}=\dfrac{\mathrm{count}(x_j=1\mid y=c)+\alpha}{N_c+2\alpha}$


**Pros:** simple for presence/absence; penalizes missing features explicitly.  
**Cons:** ignores counts/intensity; can underperform multinomial on text.

---

### 3) Gaussian Likelihood (continuous features)
**Use when:** features are **continuous** and roughly **unimodal** per class.  
**Model:** each feature is Gaussian and **independent** given the class.

- Parameters: per class and feature \(($mu_{jc},\sigma^2_{jc})$).
- Likelihood (density):
  $$
  P(x\mid y=c)=\prod_{j=1}^{p}\frac{1}{\sqrt{2\pi\sigma_{jc}^2}}\exp\!\Big(-\frac{(x_j-\mu_{jc})^2}{2\sigma_{jc}^2}\Big)
  $$
  hence
  $$
  \log P(x\mid y=c)=\sum_{j=1}^{p}\Big[-\tfrac12\log(2\pi\sigma_{jc}^2)-\frac{(x_j-\mu_{jc})^2}{2\sigma_{jc}^2}\Big]
  $$
- Stabilization: add small $(\epsilon$) to variances ($(\sigma_{jc}^2\leftarrow\sigma_{jc}^2+\epsilon$)).

**Pros:** fast, works with small \(n\), simple closed forms.  
**Cons:** independence + Gaussian shape can be unrealistic; miscalibrated if features are correlated or multimodal.

---

In [None]:
pip install -q python-pptx nbformat


In [None]:
# If your notebook is in Drive, mount it first
from google.colab import drive
drive.mount('/content/drive')

# List likely places for .ipynb files
import glob, os, textwrap
candidates = glob.glob('/content/*.ipynb') + glob.glob('/content/drive/MyDrive/**/*.ipynb', recursive=True)
print("Found notebooks:")
for p in candidates:
    print("-", p)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Found notebooks:
- /content/drive/MyDrive/Week 07 Group 1.ipynb


In [None]:
from google.colab import files

#https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB, ComplementNB
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
import string
from collections import Counter
import warnings
warnings.filterwarnings('ignore')


In [None]:
# Create DataFrame
filename = list(uploaded.keys())[0]
df = pd.read_csv(filename, sep='\t', names=['label', 'messages'])
#df = pd.read_csv(io.BytesIO(uploaded[filename]))
print(f"df shape: {df.shape}")

df.head()

IndexError: list index out of range

In [None]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Keep only 1 space
    text = ' '.join(text.split())
    return text

# Apply
df['cleaned_text'] = df['messages'].apply(preprocess_text)
df = df.drop_duplicates(subset='cleaned_text')
print(f"df shape: {df.shape}")
df.head(5)


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned_text'],
    df['label'],
    test_size=0.2,
    random_state=42,
    stratify=df['label'])

# Train
vec = CountVectorizer()
X_train_vec = vec.fit_transform(X_train)
X_test_vec = vec.transform(X_test)

model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=['ham', 'spam'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['ham', 'spam'])
disp.plot(cmap='Blues')
plt.show()

# Evaluation metrics
print(f"Accuracy : {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred, pos_label='spam'):.3f}")
print(f"Recall   : {recall_score(y_test, y_pred, pos_label='spam'):.3f}")
print(f"F1-score : {f1_score(y_test, y_pred, pos_label='spam'):.3f}")

In [None]:
feature_names = np.array(vec.get_feature_names_out())
log_prob_spam = model.feature_log_prob_[1]          # class 1 = spam
log_prob_ham  = model.feature_log_prob_[0]          # class 0 = ham

# top 5 spam-indicative words
top_spam = feature_names[np.argsort(log_prob_spam - log_prob_ham)[-5:]]
print("Top spam words:", list(reversed(top_spam)))

In [None]:
# Tools
!apt-get -qq update && apt-get -qq install -y pandoc
!pip -q install "nbconvert>=7.10" jupyterlab_pygments

# Convert to Markdown WITHOUT executing
!jupyter nbconvert --to markdown "/content/drive/MyDrive/Week 07 Group 1.ipynb" \
  --output-dir=/content/nb_md --TemplateExporter.exclude_input=True

# Convert Markdown -> PPTX
!pandoc "/content/nb_md/Week 07 Group 1.md" \
  --resource-path="/content/nb_md:/content/drive/MyDrive" \
  --standalone -o "/content/Week 07 Group 1.pptx"

from google.colab import files
files.download("/content/Week 07 Group 1.pptx")



W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
[NbConvertApp] Converting notebook /content/drive/MyDrive/Week 07 Group 1.ipynb to markdown
[NbConvertApp] Writing 11218 bytes to /content/nb_md/Week 07 Group 1.md


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>