## Naive Bayes Algorithm:  Real-World practical Tasks


### Task 1: Text Classification using Naive Bayes

**Objective:** Implement Naive Bayes for text classification using the Bag-of-Words model and Term Frequency-Inverse Document Frequency (TF-IDF) to classify movie reviews as positive or negative.

**Steps:**
1. Load a dataset of movie reviews (or any other text data) and preprocess it using tokenization and vectorization.
2. Apply both Bag-of-Words and TF-IDF models.
3. Train a Naive Bayes classifier on the training set.
4. Evaluate the model on the test set using accuracy, precision, recall, and F1-score.
    

In [1]:

# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_20newsgroups

# Load dataset (20 Newsgroups dataset for text classification)
data = fetch_20newsgroups(subset='all', categories=['rec.autos', 'rec.sport.baseball'], shuffle=True, random_state=42)
X = data.data
y = data.target

# Preprocessing: Bag-of-Words
vectorizer = CountVectorizer(stop_words='english')
X_bow = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.3, random_state=42)

# Train Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Predict and evaluate
y_pred = nb_classifier.predict(X_test)
print("Classification Report for Bag-of-Words Naive Bayes:")
print(classification_report(y_test, y_pred))

# Preprocessing: TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Train-test split
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)

# Train Naive Bayes Classifier
nb_classifier_tfidf = MultinomialNB()
nb_classifier_tfidf.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred_tfidf = nb_classifier_tfidf.predict(X_test_tfidf)
print("Classification Report for TF-IDF Naive Bayes:")
print(classification_report(y_test, y_pred_tfidf))


Classification Report for Bag-of-Words Naive Bayes:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       305
           1       1.00      0.99      0.99       291

    accuracy                           0.99       596
   macro avg       1.00      0.99      0.99       596
weighted avg       1.00      0.99      0.99       596

Classification Report for TF-IDF Naive Bayes:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       305
           1       1.00      0.99      0.99       291

    accuracy                           0.99       596
   macro avg       1.00      0.99      0.99       596
weighted avg       0.99      0.99      0.99       596




### Task 2: Naive Bayes for Image Classification

**Objective:** Implement Naive Bayes for image classification using pixel intensities as features. Apply the algorithm to a simplified image dataset (e.g., MNIST) to classify handwritten digits.

**Steps:**
1. Load the MNIST dataset and flatten the image data into 1D arrays (pixel intensities as features).
2. Train a Naive Bayes classifier on the training data.
3. Evaluate the performance of the classifier on the test data using accuracy and confusion matrix.
    

In [2]:

# Import necessary libraries
from sklearn.datasets import fetch_openml
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X_mnist = mnist.data
y_mnist = mnist.target

# Convert labels to integer type
y_mnist = y_mnist.astype(int)

# Normalize pixel values to [0,1]
X_mnist = X_mnist / 255.0

# Train-test split
X_train_mnist, X_test_mnist, y_train_mnist, y_test_mnist = train_test_split(X_mnist, y_mnist, test_size=0.3, random_state=42)

# Train Naive Bayes Classifier for image classification
nb_classifier_image = GaussianNB()
nb_classifier_image.fit(X_train_mnist, y_train_mnist)

# Predict and evaluate
y_pred_image = nb_classifier_image.predict(X_test_mnist)
accuracy_image = accuracy_score(y_test_mnist, y_pred_image)
conf_matrix_image = confusion_matrix(y_test_mnist, y_pred_image)

print(f"Image Classification Accuracy: {accuracy_image}")
print("Confusion Matrix:")
print(conf_matrix_image)


Image Classification Accuracy: 0.550952380952381
Confusion Matrix:
[[1885    3   12    7    5    5   76    2   39   24]
 [   2 2249    4    9    1    6   28    1   51   13]
 [ 252   60  652  131    5    9  560    3  433   28]
 [ 253  119   18  734    2    7  155   15  621  252]
 [  93   14   24   10  257   10  240   15  362  911]
 [ 342   60   12   34    5   71  130    4 1055  202]
 [  26   36   11    0    2    6 1979    0   25    3]
 [  16   19    5   28   13    4    3  646   65 1449]
 [  54  270    8   16    4   11   59    6 1125  439]
 [  16   20   11    5    5    1    1   25   34 1972]]



### Task 3: Spam Detection using Naive Bayes

**Objective:** Build a Naive Bayes classifier to detect spam emails using email datasets (e.g., SMS Spam Collection). Evaluate the performance of the classifier using precision, recall, and F1-score.

**Steps:**
1. Load and preprocess the email data (or SMS data).
2. Convert the text data into numeric format using TF-IDF.
3. Train the Naive Bayes classifier to distinguish between spam and non-spam messages.
4. Evaluate the model's performance using standard classification metrics.
    

In [5]:
# Import necessary libraries
import pandas as pd
import zipfile
import requests
from io import BytesIO

# URL for the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'

# Download and extract the ZIP file
response = requests.get(url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # Extract the specific file 'SMSSpamCollection'
    with z.open('SMSSpamCollection') as f:
        df = pd.read_csv(f, sep='\t', header=None, names=['label', 'message'])

# Encode labels (spam = 1, ham = 0)
df['label'] = df['label'].map({'spam': 1, 'ham': 0})

# Preprocess text data
X_spam = df['message']
y_spam = df['label']

# Vectorization using TF-IDF
tfidf_vectorizer_spam = TfidfVectorizer(stop_words='english')
X_spam_tfidf = tfidf_vectorizer_spam.fit_transform(X_spam)

# Train-test split
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(X_spam_tfidf, y_spam, test_size=0.3, random_state=42)

# Train Naive Bayes Classifier
nb_classifier_spam = MultinomialNB()
nb_classifier_spam.fit(X_train_spam, y_train_spam)

# Predict and evaluate
y_pred_spam = nb_classifier_spam.predict(X_test_spam)
print("Spam Detection Classification Report:")
print(classification_report(y_test_spam, y_pred_spam))


Spam Detection Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1448
           1       0.99      0.80      0.89       224

    accuracy                           0.97      1672
   macro avg       0.98      0.90      0.94      1672
weighted avg       0.97      0.97      0.97      1672

