# Spam Detection Project Documentation

## Introduction:
This project aims to develop a machine learning model for spam detection in emails. Leveraging techniques from natural language processing (NLP) and supervised learning, we build a robust system capable of accurately identifying spam emails, thereby enhancing email security and user experience.

## Project Overview:

### 1. Data Loading and Overview:
- Imported the pandas library for data manipulation and analysis.
- Loaded the dataset from a CSV file using pandas, handling encoding issues.
- Displayed the structure and initial rows of the dataset to understand its format.

### 2. Data Cleaning and Preparation:
- Dropped unnecessary columns ('Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4') from the dataset.
- Renamed columns for clarity ('v1' to 'label', 'v2' to 'message').
- Checked for missing values and found none.

### 3. Text Preprocessing:
- Imported necessary libraries for text preprocessing, including NLTK.
- Lowercased all text messages for consistency.
- Removed punctuation and tokenized messages into individual words.
- Removed stopwords and applied stemming to reduce words to their root forms.

### 4. Feature Extraction:
- Used TF-IDF vectorization to convert text data into numerical features.
- Converted tokenized messages back to string format for vectorization.
- Created TF-IDF matrix representing messages and unique words.
- Transformed the target variable 'label' into binary format (spam: 1, ham: 0).

### 5. Model Training:
- Split the dataset into training and testing sets (80% train, 20% test).
- Initialized and trained a Multinomial Naive Bayes classifier on the training data.
- Evaluated the classifier's performance on the testing set, achieving an accuracy of 96.14%.

### 6. Model Evaluation:
- Generated a classification report to assess precision, recall, and F1-score.
- Achieved high precision and recall for non-spam (ham) messages, but lower recall for spam messages.

### 7. Parameter Tuning:
- Utilized GridSearchCV to tune the hyperparameter 'alpha' for the Naive Bayes classifier.
- Found the best alpha value to be 0.1 through cross-validation.

### 8. Final Model Evaluation:
- Trained a new classifier with the best alpha value obtained.
- Evaluated the tuned classifier's performance, achieving an accuracy of 98.48%.
- Demonstrated high precision and recall for both spam and non-spam messages.

## Conclusion:
Through a systematic approach encompassing data preprocessing, feature extraction, model training, and evaluation, we have successfully developed a robust spam detection system. The final model demonstrates high accuracy and performs effectively in distinguishing between spam and non-spam emails. Future work may involve exploring additional feature engineering techniques and advanced machine learning models for further improving performance.

## Scores:

- **Accuracy (Before Tuning):** 96.14%
- **Accuracy (After Tuning):** 98.48%


In [3]:
# Importing the pandas library for data manipulation and analysis.
import pandas as pd

# Define the path to the dataset and load it into a pandas DataFrame.
# We use 'latin1' encoding as the default 'utf-8' encoding was not successful.
# This can be common with datasets that contain characters outside the 'utf-8' range.
spam_data_path = 'spam.csv'
spam_data = pd.read_csv(spam_data_path, encoding='latin1')

# Displaying the first few rows of the DataFrame to understand its structure.
# We expect to see a label column ('v1') that indicates if the message is 'spam' or 'ham' (non-spam),
# and a text column ('v2') which contains the email message itself.
spam_data.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
# Data Cleaning and Preparation

# Dropping unnecessary columns that are not needed for spam detection.
# These are the columns named 'Unnamed: 2', 'Unnamed: 3', and 'Unnamed: 4'.
spam_data = spam_data.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])

# Renaming the columns for better readability and accessibility.
# 'v1' will be renamed to 'label' and 'v2' to 'message'.
spam_data.columns = ['label', 'message']

# Checking for any missing values in our dataset.
# Missing values need to be handled appropriately before training a machine learning model.
missing_values = spam_data.isnull().sum()

# Display the new structure of the DataFrame and the count of missing values.
spam_data.head(), missing_values


(  label                                            message
 0   ham  Go until jurong point, crazy.. Available only ...
 1   ham                      Ok lar... Joking wif u oni...
 2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
 3   ham  U dun say so early hor... U c already then say...
 4   ham  Nah I don't think he goes to usf, he lives aro...,
 label      0
 message    0
 dtype: int64)

In [5]:
# Text Preprocessing

# Importing the necessary libraries for text preprocessing
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import string

# Downloading the set of stopwords from NLTK
import nltk
nltk.download('stopwords')

# Convert all messages to lower case
spam_data['message'] = spam_data['message'].str.lower()

# Remove punctuation from messages
spam_data['message'] = spam_data['message'].apply(lambda x: ''.join([letter for letter in x if letter not in string.punctuation]))

# Tokenizing the messages into individual words
spam_data['message'] = spam_data['message'].apply(nltk.word_tokenize)

# Removing stopwords
stop_words = set(stopwords.words('english'))
spam_data['message'] = spam_data['message'].apply(lambda x: [word for word in x if word not in stop_words])

# Applying stemming
stemmer = SnowballStemmer('english')
spam_data['message'] = spam_data['message'].apply(lambda x: [stemmer.stem(word) for word in x])

# Display the processed messages
spam_data.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,label,message
0,ham,"[go, jurong, point, crazi, avail, bugi, n, gre..."
1,ham,"[ok, lar, joke, wif, u, oni]"
2,spam,"[free, entri, 2, wkli, comp, win, fa, cup, fin..."
3,ham,"[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"[nah, dont, think, goe, usf, live, around, tho..."


In [6]:
# Feature Extraction

# Importing TfidfVectorizer from sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer

# We need to convert the list of words back into string format to use TfidfVectorizer
spam_data['message'] = spam_data['message'].apply(lambda x: ' '.join(x))

# Creating a TfidfVectorizer object
tfidf_vect = TfidfVectorizer()

# Fitting the vectorizer to the messages and transforming the messages into a numeric format
X = tfidf_vect.fit_transform(spam_data['message'])

# Getting the target variable 'label' where spam is 1 and ham is 0
y = spam_data['label'].apply(lambda x: 1 if x == 'spam' else 0)

# Display the shape of the features matrix X and target vector y
X.shape, y.shape


((5572, 8033), (5572,))

In [7]:
# Model Training

# Importing the Multinomial Naive Bayes model from scikit-learn
from sklearn.naive_bayes import MultinomialNB

# Importing train_test_split to split the data into training and testing sets
from sklearn.model_selection import train_test_split

# Splitting the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Training the classifier on the training set
nb_classifier.fit(X_train, y_train)

# Evaluating the classifier on the testing set
accuracy = nb_classifier.score(X_test, y_test)
accuracy


0.9614349775784753

In [8]:
# Model Evaluation

# Importing classification_report from scikit-learn to compute evaluation metrics
from sklearn.metrics import classification_report

# Making predictions on the testing set
y_pred = nb_classifier.predict(X_test)

# Computing and printing the classification report
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.71      0.83       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115



In [9]:
# Model Training with Parameter Tuning

# Importing GridSearchCV for parameter tuning
from sklearn.model_selection import GridSearchCV

# Defining a range of alpha values to try
alpha_values = {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]}

# Creating a Multinomial Naive Bayes classifier
nb_classifier_tuned = MultinomialNB()

# GridSearchCV for parameter tuning
grid_search = GridSearchCV(nb_classifier_tuned, param_grid=alpha_values, cv=5, scoring='accuracy')

# Fitting the grid search to the training data
grid_search.fit(X_train, y_train)

# Extracting the best alpha value from the grid search results
best_alpha = grid_search.best_params_['alpha']

# Creating a new Naive Bayes classifier with the best alpha value
nb_classifier_best = MultinomialNB(alpha=best_alpha)

# Training the classifier on the training set
nb_classifier_best.fit(X_train, y_train)

# Evaluating the classifier on the testing set
accuracy_best = nb_classifier_best.score(X_test, y_test)
accuracy_best


0.9847533632286996

In [10]:
# Model Evaluation with Best Parameters

# Making predictions on the testing set using the classifier with the best alpha value
y_pred_best = nb_classifier_best.predict(X_test)

# Computing and printing the classification report for the classifier with the best alpha value
report_best = classification_report(y_test, y_pred_best)
print(report_best)


              precision    recall  f1-score   support

           0       0.99      0.99      0.99       965
           1       0.97      0.92      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.98      0.98      0.98      1115

