In the Email Spam Detection project, I've successfully developed a sophisticated machine learning system that efficiently identifies and filters out spam emails from users' inboxes. Leveraging various email content features and advanced classification algorithms, this system ensures the delivery of legitimate messages while significantly reducing the intrusion of unwanted and potentially harmful spam emails.

In [1]:
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


In [2]:
df = pd.read_csv('dataset\spam.csv', encoding='ISO-8859-1')

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df.columns.tolist()

['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']

In [5]:
data = df[['v1', 'v2']]

# Rename columns for clarity
data.columns = ['label', 'text']

In [6]:
# Data Preprocessing 
data.loc[:, 'text'] = data['text'].str.lower()

data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'text'] = data['text'].str.lower()


Unnamed: 0,label,text
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."
5,spam,freemsg hey there darling it's been 3 week's n...
6,ham,even my brother is not like to speak with me. ...
7,ham,as per your request 'melle melle (oru minnamin...
8,spam,winner!! as a valued network customer you have...
9,spam,had your mobile 11 months or more? u r entitle...


In [7]:
# Function to remove punctuation from text
def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

# Use .loc to explicitly modify the DataFrame
data.loc[:, 'text'] = data['text'].apply(lambda x: remove_punctuation(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'text'] = data['text'].apply(lambda x: remove_punctuation(x))


In [8]:
# Tokenize the text
data['text'] = data['text'].apply(lambda x: word_tokenize(x))

# Display the first few rows of the preprocessed dataset
print(data.head())

  label                                               text
0   ham  [go, until, jurong, point, crazy, available, o...
1   ham                     [ok, lar, joking, wif, u, oni]
2  spam  [free, entry, in, 2, a, wkly, comp, to, win, f...
3   ham  [u, dun, say, so, early, hor, u, c, already, t...
4   ham  [nah, i, dont, think, he, goes, to, usf, he, l...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: word_tokenize(x))


In [9]:
# Convert the list of tokens back to a single string for each document
data['text'] = data['text'].apply(lambda tokens: ' '.join(tokens))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda tokens: ' '.join(tokens))


In [10]:
#Initialize the TF-IDF Vectorizer:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  

In [11]:
#Fit and Transform Text Data:
tfidf_matrix = tfidf_vectorizer.fit_transform(data['text'])


In [12]:
#Create Features
X = tfidf_matrix  # Features
y = data['label']  # Target variable (labels, e.g., 'spam' or 'non-spam')


In [13]:
#split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Create a dictionary of classifiers
classifiers = {
    'Naive Bayes': MultinomialNB(),
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier()
}

In [15]:
best_accuracy = 0.0
best_model = None

# Loop through each classifier
for clf_name, clf in classifiers.items():
    # Train the classifier
    clf.fit(X_train, y_train)

    # Make predictions
    y_pred = clf.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)

    # Check if this is the best model so far
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = clf_name

    classification_rep = classification_report(y_test, y_pred)

    # Print results
    print(f"Classifier: {clf_name}")
    print(f"Accuracy: {accuracy:.2f}")
    print("Classification Report:\n", classification_rep)
    print("=" * 50)  # Separator between classifiers
    print()

# Print the best model
print(f"The best model is: {best_model} with accuracy: {best_accuracy:.2f}")

Classifier: Naive Bayes
Accuracy: 0.97
Classification Report:
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.76      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.97      1115


Classifier: SVM
Accuracy: 0.98
Classification Report:
               precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       1.00      0.85      0.92       150

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115


Classifier: Decision Tree
Accuracy: 0.97
Classification Report:
               precision    recall  f1-score   support

         ham       0.98      0.98      0.98       965
        spam       0.86      0.89      0.88       15