**Spam Email Classification (using Naive Bayes)**

In [3]:
# Importing required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load Spam dataset
# Make sure to update the file path with the location of the Spam dataset on your machine.
spam_file_path = 'spam_ham_dataset.csv'  # Update this path to your Spam CSV file
spam_df = pd.read_csv(spam_file_path)

# Step 1: Drop irrelevant columns (like 'Unnamed: 0')
# Sometimes datasets include columns that are just indices or irrelevant information.
# We'll remove such a column (if it exists) to focus only on the necessary data.
spam_df.drop('Unnamed: 0', axis=1, inplace=True)

# Step 2: Text Preprocessing
# The text column contains the email content, which is in raw format.
# We'll use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert text data into numerical values.
# This method gives more weight to important words and downplays frequent but non-informative words (like 'the', 'and', etc.).
tfidf = TfidfVectorizer(stop_words='english', max_df=0.95)  # Remove stop words and ignore words appearing in over 95% of emails
X_spam = tfidf.fit_transform(spam_df['text'])  # Convert text into a matrix of TF-IDF features

# Step 3: Define the target variable
# The 'label_num' column contains the labels: 1 for spam and 0 for ham (not spam).
y_spam = spam_df['label_num']  # Target variable (spam or ham)

# Step 4: Split the dataset into training and testing sets
# Again, we'll use 80% of the data for training and 20% for testing.
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(X_spam, y_spam, test_size=0.2, random_state=42)

# Step 5: Train a Naive Bayes classifier
# Naive Bayes is a probabilistic classifier that works well with text data.
# We'll use MultinomialNB, which is specifically designed for classification with word counts or frequencies.
nb_model_spam = MultinomialNB()  # Create a Multinomial Naive Bayes model
nb_model_spam.fit(X_train_spam, y_train_spam)  # Train the model on the training data

# Step 6: Make predictions on the test set
# We'll now use the trained model to make predictions on the test data.
y_pred_spam = nb_model_spam.predict(X_test_spam)

# Step 7: Evaluate the model
# We'll evaluate the model's performance using accuracy, precision, recall, and F1-score.
accuracy_spam = accuracy_score(y_test_spam, y_pred_spam)  # Calculate the accuracy
classification_report_spam = classification_report(y_test_spam, y_pred_spam)  # Detailed classification report

# Step 8: Print the results
print(f"Accuracy: {accuracy_spam:.4f}")  # Print accuracy with 4 decimal places
print("Classification Report:\n", classification_report_spam)  # Print the detailed classification report


Accuracy: 0.9227
Classification Report:
               precision    recall  f1-score   support

           0       0.90      1.00      0.95       742
           1       1.00      0.73      0.84       293

    accuracy                           0.92      1035
   macro avg       0.95      0.86      0.90      1035
weighted avg       0.93      0.92      0.92      1035

