# Email Spam Detection with Machine Learning

Spam emails, also known as junk mail, pose a significant problem for email users worldwide. These unsolicited messages often contain fraudulent content, phishing attempts, or unwanted advertisements, leading to a cluttered inbox and potential security risks. In this project, we'll explore how to build a machine learning model for email spam detection using Python.

## About the Dataset

The dataset used in this project is the SMS Spam Collection, which consists of SMS messages tagged as either "ham" (legitimate) or "spam". It contains 5,574 messages in English, collected from various sources on the internet. Each message is labeled with its corresponding class (ham or spam), allowing us to train a classifier to distinguish between legitimate and spam emails.

## Goal of the Project

Our goal is to develop a machine learning model capable of accurately classifying emails as either spam or ham. We'll start by preprocessing the text data, including tokenization, removal of punctuation, and elimination of stopwords. Next, we'll create a Bag-of-Words representation of the text data and split the dataset into training and testing sets. Finally, we'll train a Multinomial Naive Bayes classifier on the training data and evaluate its performance on the testing set.

By the end of this project, we'll have a trained model that can effectively identify spam emails, helping users filter out unwanted content and enhance their email security.

Let's dive into the implementation!


In [47]:
import pandas as pd

# Load the dataset
df = pd.read_csv('spam.csv', encoding='latin-1')

# Drop the unnecessary columns
df = df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])

# Display the first few rows of the dataset again to confirm the changes
print(df.head())


     v1                                                 v2
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [49]:
import nltk
from nltk.tokenize import word_tokenize

# Tokenize the text data
df['v2_tokens'] = df['v2'].apply(word_tokenize)

# Concatenate the tokens back into a single string for each document
df['v2_concatenated'] = df['v2_tokens'].apply(lambda x: ' '.join(x))

# Display the first few rows to verify
print(df.head())


     v1                                                 v2  \
0   ham  Go until jurong point, crazy.. Available only ...   
1   ham                      Ok lar... Joking wif u oni...   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...   
3   ham  U dun say so early hor... U c already then say...   
4   ham  Nah I don't think he goes to usf, he lives aro...   

                                           v2_tokens  \
0  [Go, until, jurong, point, ,, crazy, .., Avail...   
1           [Ok, lar, ..., Joking, wif, u, oni, ...]   
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...   
3  [U, dun, say, so, early, hor, ..., U, c, alrea...   
4  [Nah, I, do, n't, think, he, goes, to, usf, ,,...   

                                     v2_concatenated  
0  Go until jurong point , crazy .. Available onl...  
1                    Ok lar ... Joking wif u oni ...  
2  Free entry in 2 a wkly comp to win FA Cup fina...  
3  U dun say so early hor ... U c already then sa...  
4  Nah I do n't

In [98]:
!pip install nltk

import nltk
from nltk.tokenize import word_tokenize

# Tokenize the text data
df['v2_tokens'] = df['v2'].apply(word_tokenize)

# Display the first few rows to verify
print(df[['v2', 'v2_tokens']].head())


80618.37s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
                                                  v2  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                           v2_tokens  
0  [Go, until, jurong, point, ,, crazy, .., Avail...  
1           [Ok, lar, ..., Joking, wif, u, oni, ...]  
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...  
3  [U, dun, say, so, early, hor, ..., U, c, alrea...  
4  [Nah, I, do, n't, think, he, goes, to, usf, ,,...  


In [53]:
import re

def remove_punctuation(tokens):
    # Define the pattern to match any punctuation marks
    punctuation_pattern = re.compile(r'[^\w\s]')
    
    # Remove punctuation from each token
    tokens_without_punctuation = [re.sub(punctuation_pattern, '', token) for token in tokens]
    
    return tokens_without_punctuation

# Apply the function to each tokenized message
df['v2_cleaned'] = df['v2_tokens'].apply(remove_punctuation)

# Display the first few rows to verify the changes
print(df.head())


     v1                                                 v2  \
0   ham  Go until jurong point, crazy.. Available only ...   
1   ham                      Ok lar... Joking wif u oni...   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...   
3   ham  U dun say so early hor... U c already then say...   
4   ham  Nah I don't think he goes to usf, he lives aro...   

                                           v2_tokens  \
0  [Go, until, jurong, point, ,, crazy, .., Avail...   
1           [Ok, lar, ..., Joking, wif, u, oni, ...]   
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...   
3  [U, dun, say, so, early, hor, ..., U, c, alrea...   
4  [Nah, I, do, n't, think, he, goes, to, usf, ,,...   

                                     v2_concatenated  \
0  Go until jurong point , crazy .. Available onl...   
1                    Ok lar ... Joking wif u oni ...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor ... U c already then sa...   
4  Nah I d

In [55]:
from nltk.corpus import stopwords

# Download stopwords 
import nltk
nltk.download('stopwords')

# Load stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stop_words]

# Apply remove_stopwords function to each row in the v2_cleaned column
df['v2_cleaned'] = df['v2_cleaned'].apply(remove_stopwords)

# Display the first few rows to verify the changes
print(df.head())


     v1                                                 v2  \
0   ham  Go until jurong point, crazy.. Available only ...   
1   ham                      Ok lar... Joking wif u oni...   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...   
3   ham  U dun say so early hor... U c already then say...   
4   ham  Nah I don't think he goes to usf, he lives aro...   

                                           v2_tokens  \
0  [Go, until, jurong, point, ,, crazy, .., Avail...   
1           [Ok, lar, ..., Joking, wif, u, oni, ...]   
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...   
3  [U, dun, say, so, early, hor, ..., U, c, alrea...   
4  [Nah, I, do, n't, think, he, goes, to, usf, ,,...   

                                     v2_concatenated  \
0  Go until jurong point , crazy .. Available onl...   
1                    Ok lar ... Joking wif u oni ...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor ... U c already then sa...   
4  Nah I d

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/95b32ad5-31de-4480-bb89-
[nltk_data]     b070e00f8fb3/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer object
count_vectorizer = CountVectorizer()

# Fit and transform the cleaned text data
bow_matrix = count_vectorizer.fit_transform(df['v2_concatenated'])

# Convert the BoW matrix to an array for easier handling
bow_array = bow_matrix.toarray()

# Print the shapes of the training and testing sets
print("BoW matrix shape:", bow_matrix.shape)


BoW matrix shape: (5572, 8660)


In [59]:
from sklearn.model_selection import train_test_split

# Split the dataset into features (X) and target variable (y)
X = df['v2_cleaned']
y = df['v1']

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (4457,)
X_test shape: (1115,)
y_train shape: (4457,)
y_test shape: (1115,)


In [90]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Multinomial Naive Bayes classifier
naive_bayes_classifier = MultinomialNB()

# Train the classifier
naive_bayes_classifier.fit(X_train_bow, y_train)

# Predict on the testing set
y_pred = naive_bayes_classifier.predict(X_test_bow)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(classification_report(y_test, y_pred))


Accuracy: 0.9838565022421525
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# Conclusion

In this project, we developed a machine learning model for email spam detection using the SMS Spam Collection dataset. We started by preprocessing the text data, which involved tokenization, removal of punctuation, and elimination of stopwords. We then created a Bag-of-Words representation of the text data and split the dataset into training and testing sets.

The Bag-of-Words matrix had a shape of (5572, 8660), indicating that there were 5,572 messages in the dataset, with 8,660 unique words across all messages. The training set consisted of 4,457 samples, while the testing set comprised 1,115 samples.

We trained a Multinomial Naive Bayes classifier on the training data and evaluated its performance on the testing set. The classifier achieved an impressive accuracy of 98.39% on the testing set. Additionally, the classification report showed high precision, recall, and F1-score for both the "ham" (legitimate) and "spam" classes.

Overall, the results demonstrate the effectiveness of the machine learning model in accurately classifying emails as either spam or ham. Such a model can be valuable for email users in filtering out unwanted spam content and improving email security.

