**Objective**: Separating Spam From Ham

This notebook elaborates the process of email classification using a text corpus dataset. It explains the steps for creating the corpus by performing cleaning, tokenization, stemming, etc. The document term matrix is then created from the corpus for classification. The classification is performed by using three models namely, Random Forest, Naive Bayes and CART and the performance is evaluated using various evaluation metrics.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DS/Dataset/emails.csv')
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [None]:
#How many emails are in the dataset?
len(df)

5728

In [None]:
#How many of the emails are spam? 
len(df[df['spam'] == 1])

1368

In [None]:
#Which word appears at the beginning of every email in the dataset? 
#Respond as a lower-case word with punctuation removed.
for i in range(len(df)):
  print(df['text'][i].split(' ')[0][:-1].lower())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject
subject

In [None]:
df['text'][3][8:].strip()

'4 color printing special  request additional information now ! click here  click here for a printable version of our order form ( pdf format )  phone : ( 626 ) 338 - 8090 fax : ( 626 ) 338 - 8102 e - mail : ramsey @ goldengraphix . com  request additional information now ! click here  click here for a printable version of our order form ( pdf format )  golden graphix & printing 5110 azusa canyon rd . irwindale , ca 91706 this e - mail message is an advertisement and / or solicitation .'

**Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?**


Yes. This is because -

Each email has the word “subject” appear at least once, but the frequency with which it appears can us differentiate spam from ham.


For example, a long email chain would have the word “subject” appear a number of times, and this higher frequency might be indicative of a ham message.

In [None]:
# How many characters are in the longest email in the dataset 
#where longest is measured in terms of the maximum number of characters)?
max_idx = 0
max_len = len(df['text'][0])
for i in range(len(df)):
  if(len(df["text"][i]) > max_len):
    max_idx = i
    max_len = len(df["text"][i])
max_len

43952

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(len(stopwords.words('english')))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
179


In [None]:
#Build a new corpus variable called corpus
corpus = df.copy()

In [None]:
#convert the text to lowercase. 
for i in range(len(corpus)):
  corpus['text'][i] = corpus['text'][i].lower()
corpus.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,spam
0,subject: naturally irresistible your corporate...,1
1,subject: the stock trading gunslinger fanny i...,1
2,subject: unbelievable new homes made easy im ...,1
3,subject: 4 color printing special request add...,1
4,"subject: do not have money , get software cds ...",1


In [None]:
import re

In [None]:
#remove all punctuation from the corpus.
for i in range(len(corpus)):
  corpus['text'][i] = re.sub(r'[^\w\s]','',corpus['text'][i])
corpus.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,spam
0,subject naturally irresistible your corporate ...,1
1,subject the stock trading gunslinger fanny is...,1
2,subject unbelievable new homes made easy im w...,1
3,subject 4 color printing special request addi...,1
4,subject do not have money get software cds fr...,1


In [None]:
#remove all English stopwords from the corpus. 
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
ps = PorterStemmer()
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
for i in range(len(corpus)):
  corpus['text'][i] = word_tokenize(corpus['text'][i])
  corpus['text'][i] = [ps.stem(w) for w in corpus["text"][i] if not w.lower() in stop_words]
corpus.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,text,spam
0,"[subject, natur, irresist, corpor, ident, lt, ...",1
1,"[subject, stock, trade, gunsling, fanni, merri...",1
2,"[subject, unbeliev, new, home, made, easi, im,...",1
3,"[subject, 4, color, print, special, request, a...",1
4,"[subject, money, get, softwar, cd, softwar, co...",1


In [None]:
for i in range(len(corpus)):
  corpus["text"][i] = " ".join(corpus["text"][i])
corpus.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,text,spam
0,subject natur irresist corpor ident lt realli ...,1
1,subject stock trade gunsling fanni merril muzo...,1
2,subject unbeliev new home made easi im want sh...,1
3,subject 4 color print special request addit in...,1
4,subject money get softwar cd softwar compat gr...,1


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

In [None]:
vectorizer = CountVectorizer(max_features=20000)
x = vectorizer.fit_transform(corpus["text"])
tdm = pd.DataFrame(x.toarray().transpose(), index = vectorizer.get_feature_names())
tdm = tdm.T



In [None]:
tdm.shape

(5728, 20000)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tdm,corpus['spam'], test_size=0.3)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
cart = DecisionTreeClassifier()
cart.fit(X_train, y_train)
y_pred = cart.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, y_pred)

0.9546247818499127

In [None]:
from sklearn.ensemble import RandomForestClassifier
spamRF = RandomForestClassifier()
spamRF.fit(X_train, y_train)
y_pred = spamRF.predict(X_test)
accuracy_score(y_test, y_pred)

0.9837114601512508

In [None]:
from sklearn.metrics import roc_auc_score


In [None]:
roc_auc_score(y_test, cart.predict(X_test))

0.9549215633754896

In [None]:
roc_auc_score(y_test, spamRF.predict(X_test))

0.9750989959243361

The email dataset had uneven class distribution with 1368 spam emails and 4360 ham emails. In this experiment, I performed text analysis for separating spam email from ham.


A spam classifier can potentially benefit from including the frequency of the word that appears in every email.


Random Forest model outperformed CART model in terms of Accuracy and AUC.


This is because Random forests consist of multiple single trees each based on a random sample of the training data. They are typically more accurate than single decision trees.


While a single decision tree like CART is often pruned, a random forest tree is fully grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.


Each random forest tree is learned on a random sample, and at each node, a random set of features are considered for splitting. Both mechanisms create diversity among the trees.