## Email Spam Detection Model using Machine Learning

In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


## Data Loading and Preprocessing

In [32]:
# Load the dataset
spamdata = pd.read_csv('spam.csv', encoding='latin-1')

# Display basic information about the dataset
data_info = spamdata.info()
data_head = spamdata.head()

data_info, data_head


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


(None,
      v1                                                 v2 Unnamed: 2  \
 0   ham  Go until jurong point, crazy.. Available only ...        NaN   
 1   ham                      Ok lar... Joking wif u oni...        NaN   
 2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
 3   ham  U dun say so early hor... U c already then say...        NaN   
 4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   
 
   Unnamed: 3 Unnamed: 4  
 0        NaN        NaN  
 1        NaN        NaN  
 2        NaN        NaN  
 3        NaN        NaN  
 4        NaN        NaN  )

The dataset contains 5572 entries and 5 columns. The relevant columns are:

v1: The label indicating whether an email is 'spam' or 'ham' (not spam).

v2: The actual text of the email.

The other columns (Unnamed: 2, Unnamed: 3, Unnamed: 4) contain very few non-null entries and are likely not relevant for spam detection.

In [33]:
spamdata.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [34]:
# Removing the unnecessary columns 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'
spamdata = spamdata.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

# Display the first few rows of the cleaned dataset
spamdata.head


<bound method NDFrame.head of         v1                                                 v2
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
...    ...                                                ...
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham              Will Ì_ b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name

[5572 rows x 2 columns]>

In [35]:
spamdata.columns

Index(['v1', 'v2'], dtype='object')

In [39]:
# Renaming columns for clarity
spamdata.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)

# # Converting labels to a binary format
spamdata['label'] = spamdata['label'].map({'ham': 0, 'spam': 1})

spamdata.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [40]:
spamdata.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Feature Engineering

In [41]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(spamdata['text'], spamdata['label'], test_size=0.2, random_state=42)

# Feature Engineering: Text vectorization (TF-IDF)
tfidf_vectorizer = TfidfVectorizer(stop_words='english')


## Model Selection and Training

In [43]:
# Model Selection and Training: Using Naive Bayes Classifier
model = make_pipeline(tfidf_vectorizer, MultinomialNB())

# Train the model
model.fit(X_train, y_train)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(stop_words='english')),
                ('multinomialnb', MultinomialNB())])

## Evaluation

In [44]:
# Model Evaluation
predictions = model.predict(X_test)
report = classification_report(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

report, conf_matrix

('              precision    recall  f1-score   support\n\n           0       0.96      1.00      0.98       965\n           1       1.00      0.75      0.86       150\n\n    accuracy                           0.97      1115\n   macro avg       0.98      0.88      0.92      1115\nweighted avg       0.97      0.97      0.96      1115\n',
 array([[965,   0],
        [ 37, 113]]))

The spam detector model has been trained and evaluated. Here's a summary of its performance on the test set:

Accuracy: The model achieves an overall accuracy of 97%, which is quite high.

Precision and Recall:
For 'ham' emails (label 0): The precision is 96%, and the recall is 100%. This means the model is very good at identifying non-spam emails and doesn't miss any.

For 'spam' emails (label 1): The precision is 100%, and the recall is 75%. While the model is perfect at confirming spam emails when it predicts them, it misses about 25% of spam emails.

Confusion Matrix:
Out of 1115 emails in the test set, 965 were 'ham', and all of them were correctly identified.
There were 150 'spam' emails, out of which the model correctly identified 113, but missed 37.

## Deployment

The first step is to save the trained model to a file so that it can be loaded and used in an application.

In [49]:
# Save the model to a file
model_filename = 'spam_detector_model.joblib'
joblib.dump(model, model_filename)

model_filename


'spam_detector_model.joblib'

The model was developed using a simple  Flask based application to serve the model as an API endpoint.

visit **https://github.com/amadasunese/spam_detector** to run the application and test the model.