 ## Prelab :  Discriminative and Generative Models                             By : Ethan Smadja , Tom Urban, Marine Belet 
 Professor : Jae Yun JUN KIM 


## 2 Naive Bayes
 Sources: Scikit-learn
 2.1 Example 1: Bernoulli Naive Bayes

The provided script demonstrates the usage of the Bernoulli Naive Bayes classifier, which is typically used for binary feature classification problems.

Here’s what the script does at a high level:

Dataset Creation:

Randomly generates a small dataset (X) with 6 samples and 100 binary features each (features can only have values 0 or 1).
Assigns labels y to these samples as [1, 2, 3, 4, 4, 5].
Training:

Fits a Bernoulli Naive Bayes classifier to this data, learning to associate patterns of binary features with each label.
Prediction:

Makes predictions on the exact same samples used for training.

In [4]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB

# Generating a random binary dataset
X = np.random.randint(2, size=(6, 100))

# Labels for each of the 6 samples
y = np.array([1, 2, 3, 4, 4, 5])

# Creating the Bernoulli Naive Bayes classifier
clf = BernoulliNB()

# Training the classifier with the dataset
clf.fit(X, y)

# Predicting labels using the trained classifier
for i in range(0, 6):
    prediction = clf.predict(X[i:(i+1)])
    print(clf.predict(X[i:(i+1)]))


[1]
[2]
[3]
[4]
[4]
[5]


Interpretation of Results:

The classifier predicts exactly the labels it was trained on ([1, 2, 3, 4, 4, 5]) because the predictions were performed on the training data itself. This indicates the classifier has successfully memorized these training examples.
In practice, predictions on unseen data might differ, highlighting the importance of evaluating performance on separate test sets to gauge true predictive capabilities.

## 2.2 Example2: MultinomialNaiveBayes

This script uses a Multinomial Naive Bayes classifier suited for features representing discrete counts (e.g., text frequencies, occurrences of words).

It generates random data (X) containing integer counts between 0 and 4 for each feature.
Labels (y) from 1 to 6 are assigned to the six samples.
The model is trained and immediately tested on the training set itself.

In [6]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB

# Generate a random dataset with integer features between 0 and 4
X = np.random.randint(5, size=(6, 100))

# Labels assigned to the 6 samples
y = np.array([1, 2, 3, 4, 5, 6])

# Create and train the Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X, y)

# Predict labels for each sample in the training data
for i in range(6):
    prediction = clf.predict(X[i:i+1])
    print(clf.predict(X[i:i+1]))


[1]
[2]
[3]
[4]
[5]
[6]


Result Interpretation:

Predictions exactly match the true labels because the predictions are made directly on the training set.
Such results demonstrate that the classifier can perfectly memorize the small dataset but does not reflect its actual performance on unseen data.
In real scenarios, separate testing data is necessary to evaluate model accuracy and generalizability.

## 2.3 Example 3 : GaussianNaiveBayes

Data Preparation:

A small dataset (X) of 6 samples is created, separated clearly into two classes (y = [1, 1, 1, 2, 2, 2]).
Each class has distinct numeric features that allow the model to distinguish between them clearly.
Model Training:

Two Gaussian Naive Bayes classifiers (clf and clf_pf) are created and trained:
clf.fit(X, y) trains the model on all data at once.
clf_pf.partial_fit(X, y, np.unique(y)) demonstrates incremental learning, useful if the dataset is large or streamed.
Prediction:

Both models predict the class of a new sample [-1, -0.8].

In [10]:
import numpy as np
from sklearn.naive_bayes import GaussianNB

# Creating a small dataset with continuous numerical features
X = np.array([[-1, -1], [ -1, -2], [-3, -2], [1, 1], [2, 1], [3, 2]])

# Labels indicating two classes: class 1 and class 2
y = np.array([1, 1, 1, 2, 2, 2])

# Initialize and train Gaussian Naive Bayes classifier
clf = GaussianNB()
clf.fit(X, y)

# Predict class for a new sample [-1, -0.8]
print("Prediction (fit):", clf.predict([[-1, -0.8]]))

# Initialize GaussianNB classifier with incremental learning (partial_fit)
clf_pf = GaussianNB()
clf_pf.partial_fit(X, y, np.unique(y))

# Predict class for the same sample using partial_fit-trained classifier
print(clf_pf.predict([[-1,-0.8]]))


Prediction (fit): [1]
[1]


The Gaussian Naive Bayes classifier makes predictions based on proximity to learned Gaussian distributions (mean and variance for each class).
The tested sample was correctly identified as class 1, reflecting accurate learning by both methods.

 ## 2.4 Example 4 : Filteringspamemails

In [2]:
# -*- coding: utf-8 -*-
"""
Created on Fri Jan 27 22:53:50 2017
@author: Abhijeet Singh
"""

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# Function to create a dictionary from email text data

def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir, f) for f in os.listdir(train_dir)]
    all_words = []
    for mail in emails:
        with open(mail, encoding='latin1') as m:
            for i, line in enumerate(m):
                if i == 2:  # Usually, the 3rd line contains useful content
                    words = line.split()
                    all_words += words

    dictionary = Counter(all_words)

    # Removing non-alphabetic words and single-character words
    for item in list(dictionary):
        if not item.isalpha() or len(item) == 1:
            del dictionary[item]

    # Keeping only the top 3000 most common words
    dictionary = dictionary.most_common(3000)
    return dictionary

# Function to extract features from emails based on the created dictionary

def extract_features(mail_dir, dictionary):
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files), 3000))

    for docID, fil in enumerate(files):
        with open(fil, encoding='latin1') as fi:
            for i, line in enumerate(fi):
                if i == 2:  # Usually the subject/content line
                    words = line.split()
                    for word in words:
                        for wordID, d in enumerate(dictionary):
                            if d[0] == word:
                                features_matrix[docID, wordID] = words.count(word)

    return features_matrix


# Directories for training and testing data
train_dir = 'ling-spam/ling-spam/train-mails'
test_dir = 'ling-spam/ling-spam/test-mails'

# Create dictionary from training data
dictionary = make_Dictionary(train_dir)

# Create labels for training data (702 mails: first half non-spam, second half spam)
train_labels = np.zeros(702)
train_labels[351:701] = 1  # Marking spam mails

# Extract features from training mails
train_matrix = extract_features(train_dir, dictionary)

# Train Multinomial Naive Bayes classifier
model = MultinomialNB()
model.fit(train_matrix, train_labels)

# Prepare feature vectors for test data
test_matrix = extract_features(test_dir, dictionary)

# Labels for testing data (first 130 ham, next 130 spam)
test_labels = np.zeros(260)
test_labels[130:260] = 1

# Make predictions on test data and display confusion matrix
result = model.predict(test_matrix)

# Output confusion matrix to evaluate performance
print("Confusion Matrix:\n", confusion_matrix(test_labels, result))

Confusion Matrix:
 [[129   1]
 [  9 121]]


Interpretation:

True Negatives (129):
Correctly classified ham (non-spam) emails. This indicates the classifier is highly accurate at recognizing legitimate emails.

False Positives (1):
One ham email was incorrectly identified as spam. This type of error is minor here, but still undesirable since legitimate emails could be missed by the recipient.

False Negatives (9):
Nine actual spam emails were incorrectly classified as ham. These represent spam emails slipping through the filter.

True Positives (121):
Correctly classified spam emails. Indicates good spam identification.

Model Performance Insights:
The classifier is overall very effective.
Accuracy is high:
Accuracy ≈96.15%
The small number of false positives (1) is good, as users prefer to avoid losing legitimate emails to spam filters.
The classifier is slightly less efficient at detecting every spam (9 false negatives), meaning some spam emails might reach inboxes, but performance is still strong overall.
