## Naive Bayes Spam Detection: Corrected Version


### Task: Spam Detection using Naive Bayes (Corrected)

**Objective:** Build a Naive Bayes classifier to detect spam emails using the SMS Spam Collection dataset. The text data is converted into numeric format using TF-IDF, and the classifier is evaluated using precision, recall, and F1-score.

**Steps:**
1. Load and preprocess the SMS Spam Collection dataset.
2. Convert the text data into numeric format using TF-IDF.
3. Train the Naive Bayes classifier to distinguish between spam and non-spam messages.
4. Evaluate the model's performance using standard classification metrics.
    

In [1]:

# Import necessary libraries
import pandas as pd
import zipfile
import requests
from io import BytesIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# URL for the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'

# Download and extract the ZIP file
response = requests.get(url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # Extract the specific file 'SMSSpamCollection'
    with z.open('SMSSpamCollection') as f:
        df = pd.read_csv(f, sep='	', header=None, names=['label', 'message'])

# Encode labels (spam = 1, ham = 0)
df['label'] = df['label'].map({'spam': 1, 'ham': 0})

# Preprocess text data
X_spam = df['message']
y_spam = df['label']

# Vectorization using TF-IDF
tfidf_vectorizer_spam = TfidfVectorizer(stop_words='english')
X_spam_tfidf = tfidf_vectorizer_spam.fit_transform(X_spam)

# Train-test split
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(X_spam_tfidf, y_spam, test_size=0.3, random_state=42)

# Train Naive Bayes Classifier
nb_classifier_spam = MultinomialNB()
nb_classifier_spam.fit(X_train_spam, y_train_spam)

# Predict and evaluate
y_pred_spam = nb_classifier_spam.predict(X_test_spam)
print("Spam Detection Classification Report:")
print(classification_report(y_test_spam, y_pred_spam))


Spam Detection Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1448
           1       0.99      0.80      0.89       224

    accuracy                           0.97      1672
   macro avg       0.98      0.90      0.94      1672
weighted avg       0.97      0.97      0.97      1672

