###  Task 1: Bernoulli Naive Bayes on Binary Text Data

* Task 1 Setup: SMS Spam Classification using Bernoulli Naive Bayes

* Step 1: Load the dataset from this URL:
* https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv

* Step 2: Convert the 'label' column into binary values: spam = 1, ham = 0

* Step 3: Use CountVectorizer with binary=True to transform the text into binary features

* Step 4: Split the dataset into training and test sets (e.g., 70/30 split)

* Step 5: Initialize and train a BernoulliNB model

* Step 6: Predict on the test set and evaluate using accuracy and confusion matrix

In [4]:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

base_url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'

df = pd.read_csv(base_url, sep='\t', header=None, names=['label', 'message'])

df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Convert text to numerical features using TF-IDF
vectorizer = CountVectorizer(binary=True, stop_words='english')
X = vectorizer.fit_transform(df['message'])  # Feature matrix
y = df['label']  # Target variable

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Bernouli Naïve Bayes classifier
model = BernoulliNB()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Function to predict new messages
def predict_message(msg):
    # msg_clean = clean_text(msg)  # Preprocess message
    msg_vectorized = vectorizer.transform([msg])  # Convert to TF-IDF vector
    prediction = model.predict(msg_vectorized)[0]
    return "Spam" if prediction == 1 else "Ham"

# Test with a new message
new_message = "Congra wotulations! Youn a free lottery. Click here to claim your prize!"
print(f"Message: {new_message} → Prediction: {predict_message(new_message)}")

Accuracy: 0.9749

Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1448
           1       0.97      0.84      0.90       224

    accuracy                           0.97      1672
   macro avg       0.97      0.92      0.94      1672
weighted avg       0.97      0.97      0.97      1672

Message: Congra wotulations! Youn a free lottery. Click here to claim your prize! → Prediction: Ham


### Task 2: Gaussian Naive Bayes on Real-Valued Data

* Task 2 Setup: Iris Dataset with Gaussian Naive Bayes

* Step 1: Load the Iris dataset using sklearn.datasets.load_iris()

* Step 2: Split the data into training and test sets (e.g., 70/30 split)

* Step 3: Initialize a GaussianNB model

* Step 4: Train the model and evaluate performance using accuracy and classification report


In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Train a Gaussian Naive Bayes classifier
clf = GaussianNB()
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9778

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45



### Task 3: Multinomial Naive Bayes on Word Frequencies

* Task 3 Setup: SMS Spam Classification using Multinomial Naive Bayes

* Step 1: Reuse the same SMS Spam dataset from Task 1

* Step 2: Use CountVectorizer (without binary=True) to extract word frequency features

* Step 3: Split the data into training and test sets

* Step 4: Initialize and train a MultinomialNB model

* Step 5: Evaluate the model with appropriate metrics


In [3]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

base_url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'

df = pd.read_csv(base_url, sep='\t', header=None, names=['label', 'message'])

df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Convert text to numerical features using TF-IDF
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])  # Feature matrix
y = df['label']  # Target variable

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Bernouli Naïve Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Function to predict new messages
def predict_message(msg):
    # msg_clean = clean_text(msg)  # Preprocess message
    msg_vectorized = vectorizer.transform([msg])  # Convert to TF-IDF vector
    prediction = model.predict(msg_vectorized)[0]
    return "Spam" if prediction == 1 else "Ham"

# Test with a new message
new_message = "Congra wotulations! Youn a free lottery. Click here to claim your prize!"
print(f"Message: {new_message} → Prediction: {predict_message(new_message)}")

Accuracy: 0.9850

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1448
           1       0.93      0.96      0.95       224

    accuracy                           0.99      1672
   macro avg       0.96      0.97      0.97      1672
weighted avg       0.99      0.99      0.99      1672

Message: Congra wotulations! Youn a free lottery. Click here to claim your prize! → Prediction: Spam
