# Spam Detector 
## Charles James

### In this project, the model is built to determine whether an email is spam or not. Included below is an implementation of the Naive Bayes model, one of the known types of classification, on the training and testing data. 

3.	Build a model to predict whether an email is a spam or not.

In this program, Spam is noted as a 1 and not spam is noted as 0 under the label_num column. 

In [1]:
#imports
import numpy as np
import pandas as pd

#natural laguage tool kit
import nltk
from nltk.corpus import stopwords
import string

In [2]:
#read the CSV file
df = pd.read_csv('spam_ham_dataset.csv')

#print the first 5 rows
df.head(5)


Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [3]:
#print the shape(number of rows and columns)
df.shape

(5171, 4)

In [4]:
#Get the column names
df.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

In [5]:
#Show the number of missing data for each column (Examples: NAN, NaN, na)
#Here we check to see if any of the mail in the csv has missing data. 
df.isnull().sum()

Unnamed: 0    0
label         0
text          0
label_num     0
dtype: int64

In [6]:
#Download the stopwords package
#We download the stopwords package to use in the function that will be created. 
#Stopwords are the English words which does not add much meaning to a sentence. 
#They can safely be ignored without sacrificing the meaning of the sentence.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/cj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
#The goal of this function is to process the texts in 3 ways. 

#The function process_text takes in the text parameter (email message)
def process_text(text):
    
    #1) The first way is removing punctuation from the text. It is then stored in nopunc
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2) The second way is removing the stopwords(useless words/data) from the text. 
    # Then we split the message to get tokens. Each word will be sepereated by a comma.
    # We also specifiy that the top words will be english. 
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    #3) The last way is returning a list of clean text words
    return clean_words

In [8]:
#Show the tokenization ( a list of tokens also called lemmas)
#A lemma is a word that stands at the head of a definition in a dictionary
#Since 'text' column contains the messages, we apply the function we just created to the 'text' column and print out the first 5 rows.
df['text'].head(5).apply(process_text)

0    [Subject, enron, methanol, meter, 988291, foll...
1    [Subject, hpl, nom, january, 9, 2001, see, att...
2    [Subject, neon, retreat, ho, ho, ho, around, w...
3    [Subject, photoshop, windows, office, cheap, m...
4    [Subject, indian, springs, deal, book, teco, p...
Name: text, dtype: object

In [9]:
#For comparison to show difference from function. 
df['text'].head(5)

0    Subject: enron methanol ; meter # : 988291\r\n...
1    Subject: hpl nom for january 9 , 2001\r\n( see...
2    Subject: neon retreat\r\nho ho ho , we ' re ar...
3    Subject: photoshop , windows , office . cheap ...
4    Subject: re : indian springs\r\nthis deal is t...
Name: text, dtype: object

In [10]:
#This cell shows an example of how we will input the processed text into the model to make predict if the message is spam or not.

# We create two strings pretending they are messages. 
message0 = 'hello world hello hello world play'
message1 = 'test test test test one hello'
print(message0)
print(message1)
print()

#Convert the text to a matrix of token counts. 
#We use the CountVectorizer to both tokenize the collection of text and build a vocabulary of known words. 
from sklearn.feature_extraction.text import CountVectorizer

#bow is short for bag of words. 
#We use the transform method, specifiying the the two messages we created making it a list of lists.
#The function created is used as the analyzer.
bow4 = CountVectorizer(analyzer=process_text).fit_transform([[message0], [message1]])

print(bow4)
print()
#Shows the number of messages(rows) and the number of unique words in the data set(columns)
print(bow4.shape)

hello world hello hello world play
test test test test one hello

  (0, 0)	3
  (0, 4)	2
  (0, 2)	1
  (1, 0)	1
  (1, 3)	4
  (1, 1)	1

(2, 5)


In the first column of numbers, 0 is for message 0 and 1 is for message one. The last column tells us that a unique word appears that many times. For example, the first row:
(0, 0)   3 
tells us that a word in message 0 appears 3 times. This word is "hello"

In [11]:
#Convert a collection of text(messages) to a matrix of tokens
#We now use the transform method on the text in out dataset
messages_bow = CountVectorizer(analyzer = process_text).fit_transform(df['text'])

In [12]:
#Split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split

#We split the data on messages_bow and on the 'label_num' column
x_train, x_test, y_train, y_test = train_test_split(messages_bow, df['label_num'], test_size = 0.20, random_state=0)

In [13]:
#Get the shape of messages_bow
messages_bow.shape

(5171, 50381)

In [14]:
#Create and train the Naive Bayes Model to make our prediction
#We use the multinomial Naive Bayes classifier (suitible for discrete features Ex: words counts from a text)
from sklearn.naive_bayes import MultinomialNB
#fit method is used to train it
classifier = MultinomialNB().fit(x_train, y_train)

In [15]:
#Print the predictions
print(classifier.predict(x_train))

#print the actual values
print(y_train.values)

[0 0 0 ... 1 0 0]
[0 0 0 ... 1 0 0]


In [16]:
#Evaluate the model on the training data set
#imports
#Classification report is used to measure the quality of predictions from a classification algorithm
from sklearn.metrics import classification_report
#A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

pred = classifier.predict(x_train)
print(classification_report(y_train, pred))
print()
print('Confusion Matrix: \n', confusion_matrix(y_train, pred))
print('True Positive | False Positive')
print('False Negative | True Negative')
print()
print('Accuracy: ', accuracy_score(y_train, pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      2940
           1       0.98      0.97      0.98      1196

    accuracy                           0.99      4136
   macro avg       0.99      0.98      0.98      4136
weighted avg       0.99      0.99      0.99      4136


Confusion Matrix: 
 [[2918   22]
 [  30 1166]]
True Positive | False Positive
False Negative | True Negative

Accuracy:  0.9874274661508704


# We Now Do the same thing for the testing data

In [20]:
#Print the predictions
print(classifier.predict(x_test))

#print the actual values
print(y_test.values)

[0 0 0 ... 0 1 0]
[0 0 0 ... 0 1 0]


In [18]:
#Evaluate the model on the test data set
pred = classifier.predict(x_test)
print(classification_report(y_test, pred))
print()
print('Confusion Matrix: \n', confusion_matrix(y_test, pred))
print('True Positive | False Positive')
print('False Negative | True Negative')
print()
print('Accuracy: ', accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       732
           1       0.95      0.96      0.96       303

    accuracy                           0.97      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.97      0.97      0.97      1035


Confusion Matrix: 
 [[718  14]
 [ 13 290]]
True Positive | False Positive
False Negative | True Negative

Accuracy:  0.9739130434782609


# 