# Email Spam Detection Model


## Objective:
The primary objective of this project is to build a machine learning model that can accurately classify emails as either **spam** or **ham** (non-spam) based on their content. The classification model leverages the **Naive Bayes algorithm**, a probabilistic classifier that works well for text classification tasks, especially when features (such as word frequencies) are assumed to be conditionally independent.

## Data Description:
The dataset contains **5172 emails**, with each email represented by a row. The columns (except the first and last) represent the counts of the **3000 most common words** found across all the emails, with each word count serving as a feature. 

- **First Column**: Email identifier
- **Last Column**: The target label:
  - `1` for **spam**
  - `0` for **ham** (non-spam)

The dataset is pre-processed such that each email's content is represented by the frequency of certain words, effectively reducing the raw text data into a more structured form suitable for machine learning.

## Naive Bayes Classification:
Naive Bayes is based on **Bayes' Theorem**, which calculates the probability of an email being spam or not, given its word counts. It assumes that the occurrence of each word in an email is **independent** of the others, making it particularly efficient for high-dimensional data like text.

For this project, the **Multinomial Naive Bayes** variant is ideal since the features (word counts) are discrete counts of occurrences. This classifier calculates the probability of a given email being spam or ham by multiplying the probabilities of the individual words occurring in emails of each clss (spam or ham).


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.utils import resample

In [2]:
# 1. Load and Explore the Data
emails_df = pd.read_csv('emails.csv')

# Explore the dataset
print("First few rows of the dataset:")
print(emails_df.head())

# Check for missing values
print("\nMissing values in the dataset:")
print(emails_df.isnull().sum())

# Check class distribution of the 'Prediction' column
print("\nClass distribution of 'Prediction' column:")
print(emails_df['Prediction'].value_counts())

First few rows of the dataset:
  Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  jay  \
0   Email 1    0   0    1    0    0   0    2    0    0  ...         0    0   
1   Email 2    8  13   24    6    6   2  102    1   27  ...         0    0   
2   Email 3    0   0    1    0    0   0    8    0    0  ...         0    0   
3   Email 4    0   5   22    0    5   1   51    2   10  ...         0    0   
4   Email 5    7   6   17    1    5   2   57    0    9  ...         0    0   

   valued  lay  infrastructure  military  allowing  ff  dry  Prediction  
0       0    0               0         0         0   0    0           0  
1       0    0               0         0         0   1    0           0  
2       0    0               0         0         0   0    0           0  
3       0    0               0         0         0   0    0           0  
4       0    0               0         0         0   1    0           0  

[5 rows x 3002 columns]

Missing values in the dataset:

In [3]:
# 2. Preprocessing
# Drop irrelevant columns (e.g., Email No.)
emails_df = emails_df.drop(columns=['Email No.'])

# Separate features (X) and target (y)
X = emails_df.drop(columns=['Prediction'])
y = emails_df['Prediction']

# Handle class imbalance by oversampling the minority class
X_resampled, y_resampled = resample(X, y, 
                                    replace=True, 
                                    n_samples=len(emails_df), 
                                    random_state=42)

In [4]:
# 3. Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.2, random_state=42)

In [5]:
# 4. Train a Naive Bayes Model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

In [6]:
# 5. Optimize the Model using Grid Search (for hyperparameter tuning)
from sklearn.model_selection import GridSearchCV

# Set up a parameter grid for optimization (optional)
param_grid = {'alpha': [0.1, 0.5, 1, 2, 5]}  # Regularization strength for Naive Bayes

# Perform Grid Search
grid_search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print best hyperparameters
print("\nBest hyperparameters from Grid Search:")
print(grid_search.best_params_)


Best hyperparameters from Grid Search:
{'alpha': 0.1}


In [7]:
# For GridSearchCV
best_model = grid_search.best_estimator_

# For RandomizedSearchCV
# best_model = random_search.best_estimator_

# Predict on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model using various metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])

# Print the evaluation metrics
print("\nModel Evaluation Metrics After Hyperparameter Tuning:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Model Evaluation Metrics After Hyperparameter Tuning:
Accuracy: 0.9498
Precision: 0.9003
Recall: 0.9249
F1-Score: 0.9125
ROC-AUC: 0.9846

Confusion Matrix:
[[712  30]
 [ 22 271]]
