# **Spam Email Classifier Using Machine Learning**

## **Project Description**:

This project aims to develop a spam email classifier using machine learning techniques to automatically categorize incoming emails into spam and non-spam (ham). The dataset consists of 33,715 email samples labeled as spam or not. The goal is to train a machine learning model that can accurately classify emails into these two categories.

To evaluate the model's performance, the dataset is split into training and testing sets. In this project, 70% of the data (23,570 samples) is used for training the model, and 30% (10,145 samples) is reserved for testing the model's ability to generalize to unseen data.

### The following steps are followed in the project:

**Data Preprocessing**: Cleaning and preparing the dataset, including text normalization (removal of stop words, punctuation, etc.).

**Feature Extraction**: Converting email text data into numerical features, such as word frequency (TF-IDF), to feed into machine learning algorithms.

**Model Selection**: Testing various classification algorithms (**e.g.**, **Logistic Regression**, **Naive Bayes**, **SVM**) to determine the best-performing model.

**Model Evaluation**: Using standard metrics (**accuracy**, **precision**, **recall**, **F1-score**) to assess model performance on the test set.

**Optimization**: Fine-tuning the model to improve accuracy and reduce overfitting.
By the end of the project, a reliable spam filter will be developed that can automatically classify emails, contributing to more efficient email management.



In [1]:
import os
import zipfile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

import re
import joblib
from bs4 import BeautifulSoup

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [2]:
email_df = pd.read_csv('enron-spam.csv')

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
class EmailPreProcessor(BaseEstimator, TransformerMixin):

    def __init__(self):
        """
        Constructor for EmailPreProcessor, takes in no arguments.

        Initializes the necessary components for email preprocessing:
        - stopwords: A set of common English stopwords.
        - stemmer: The Porter Stemmer for stemming words.
        - vectorizer: A TF-IDF vectorizer to convert text to numerical features.
        """
        self.stop_words = set(stopwords.words('english'))
        self.stemmer = PorterStemmer()
        self.vectorizer = TfidfVectorizer(stop_words='english')
        pass

    def fit(self, X, y=None):
        """
        The fit method is responsible for learning the vocabulary and the TF-IDF scores
        from the email dataset. It also performs basic text preprocessing steps like:
        - Cleaning email content (removes HTML tags, special characters, and punctuation).
        - Removing stopwords.
        - Applying stemming.

        Parameters:
        - X: pandas DataFrame containing email content in the 'content' column.
        - y: Optional; target labels (not used here).

        Returns:
        - self: The fitted EmailPreProcessor object.
        """

        email_df = X.copy(deep=True)

        email_df['content'] = self.clean_email_content(email_df)
        email_df['tags'] = self.stopwords_removal(email_df)
        email_df['tags'] = self.stem_conversion(email_df)

        self.vectorizer.fit(email_df['tags'].values)

        return self

    def transform(self, X: pd.DataFrame, y=None) -> np.ndarray:
        """
        The transform method preprocesses the email data by applying the learned
        transformations (e.g., TF-IDF) on the raw email content.

        Preprocessing steps:
        - Removing HTML tags, special characters, punctuation.
        - Lowercasing the text.
        - Removing stopwords.
        - Stemming the tokens.
        - Applying TF-IDF vectorization.

        Parameters:
        - X: pandas DataFrame containing the email content in the 'content' column.
        - y: Optional; target labels (not used here).

        Returns:
        - A matrix with TF-IDF values representing the transformed email data.
        """

        email_df = X.copy(deep=True)

        email_df['content'] = self.clean_email_content(email_df)
        email_df['tags'] = self.stopwords_removal(email_df)
        email_df['tags'] = self.stem_conversion(email_df)

        return self.vectorizer.transform(email_df['tags'].values).toarray()


    def clean_email_content(self, email_df: pd.DataFrame) -> pd.DataFrame:
        """
        Removes html tags, unwanted special characters and punctuations and extra spaces
        from email content.

        paramters: email_df -> pandas Dataframe
        """
        def clean_text(text: str) -> str:
            # remove HTML tags using BeautifulSoup
            text = BeautifulSoup(text, "html.parser").get_text()

            # remove non-ASCII characters and unwanted symbols (e.g., "\x01")
            text = re.sub(r'[^\x00-\x7F]+', '', text)

            # remove all punctuation using regex (except spaces)
            text = re.sub(r'[^\w\s]', '', text)

            # normalize multiple spaces or newlines to a single space
            text = re.sub(r'\s+', ' ', text).strip()

            # convert the text string to lower case for noramlization
            text = text.lower()

            return text

        return email_df['content'].apply(clean_text)


    def stopwords_removal(self, email_df: pd.DataFrame) -> pd.DataFrame:
        """
        Removes stopwords from the email text.

        Parameters:
        - email_df: pandas DataFrame containing the 'content' column with tokenized text.

        Returns:
        - A pandas Series with the tokenized text after removing stopwords.
        """
        def stopwords_removal_helper(text):
            words = text.split()
            return [word for word in words if word not in self.stop_words]

        return email_df['content'].apply(stopwords_removal_helper)


    def stem_conversion(self, email_df: pd.DataFrame) -> pd.DataFrame:
        """
        Applies stemming to the tokens in the email content.

        Parameters:
        - email_df: pandas DataFrame containing tokenized email content (tags).

        Returns:
        - A pandas Series with the stemmed words.
        """
        def stem_conversion_helper(words):
            # join the list values as a string to pass it to TF-IDF Vectorizer
            return " ".join([self.stemmer.stem(word) for word in words])

        return email_df['tags'].apply(stem_conversion_helper)

## Model Training.

### Train Test split
We use 70% of the data for training and remaining 30% for testing, to prevent overfitting and generalize the model's ability to perform better for unseen data.


In [None]:
# defining the input and target columns
input_cols = ['content']
target_col = ['spam']

X = email_df[input_cols]
y = email_df[target_col]

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)

### Multinomial Naive bayes.

In [None]:
# Pipeline for naive bayes model.
naive_bayes_pipeline = Pipeline(
    steps=[
        ('preprocessor', EmailPreProcessor()),
        ('multinomial_nb_clf', MultinomialNB(alpha=1, fit_prior=False))
    ]
)

naive_bayes_pipeline.fit(X_train, y_train.values.ravel())

y_train_pred = naive_bayes_pipeline.predict(X_train)
y_test_pred = naive_bayes_pipeline.predict(X_test)

  text = BeautifulSoup(text, "html.parser").get_text()
  text = BeautifulSoup(text, "html.parser").get_text()
  text = BeautifulSoup(text, "html.parser").get_text()
  text = BeautifulSoup(text, "html.parser").get_text()


### Evaluation

In [None]:
print(classification_report(y_train.values.ravel(), y_train_pred))
print(classification_report(y_test.values.ravel(), y_test_pred))


              precision    recall  f1-score   support

           0       0.99      0.99      0.99     11572
           1       0.99      0.99      0.99     12028

    accuracy                           0.99     23600
   macro avg       0.99      0.99      0.99     23600
weighted avg       0.99      0.99      0.99     23600

              precision    recall  f1-score   support

           0       0.98      0.99      0.98      4973
           1       0.99      0.98      0.98      5142

    accuracy                           0.98     10115
   macro avg       0.98      0.98      0.98     10115
weighted avg       0.98      0.98      0.98     10115



## Exporting the Model

In [None]:
joblib.dump(naive_bayes_pipeline, 'spam_email_clf_nb.joblib')

['spam_email_clf_nb.joblib']

## For Testing purposes

In [None]:
preprocessor = EmailPreProcessor()
preprocessor.fit(email_df)
X = preprocessor.transform(email_df)

  text = BeautifulSoup(text, "html.parser").get_text()
  text = BeautifulSoup(text, "html.parser").get_text()


In [None]:
svc_clf = SVC(C=1.5, kernel='rbf')
svc_clf.fit(X, y.values.ravel())


svc_clf.score(X, y.values.ravel())

(33715, 132422)