---

### Importing Libraries

This also includes nltk which is Natural Language ToolKit that includes the list of stopwords to be used to pre process the data


In [None]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

### Importing the Dataset
This Dataset includes <u>5728</u> mails which are categorized into 'spam' and 'not spam'

In [None]:
df = pd.read_csv('./emails.csv')
df.head(8)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1


In [None]:
print("Shape of the Dataset :",df.shape)
df.isnull().sum()


Shape of the Dataset : (5728, 2)


text    0
spam    0
dtype: int64

The above data is Clean and is not missing any values

#### Downloading NLTK Stopwords

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

---

### Pre-Processing the Text

The mail text is processed in the function 'text_process' wherein the first parts removes all the punctuation in the sentence and then joins the words to create an object separated by commas
The second part checks if the words are present in the stopwords(explained in the markdown above) list. If present, they are ignored, if not, they are added to the clearWords variable and the function returns the clearWords

In [None]:
def text_process(mail):

    removePunc = [char for char in mail if char not in string.punctuation]
    removePunc = ''.join(removePunc)

    clearWords = [word for word in removePunc.split() if word.lower() not in stopwords.words('english')]

    return clearWords

The function 'text_process' is applied to the DataFrame 'text' which contains the text in the mail

In [None]:
df['text'].head(10).apply(text_process)

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
5    [Subject, great, nnews, hello, welcome, medzon...
6    [Subject, hot, play, motion, homeland, securit...
7    [Subject, save, money, buy, getting, thing, tr...
8    [Subject, undeliverable, home, based, business...
9    [Subject, save, money, buy, getting, thing, tr...
Name: text, dtype: object

### Initializing CountVectorizer from sklearn

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.

Tested Runtimes:

 - CPU - ~5m 50s
 - GPU - ~0m 11s


In [None]:
vectorizer = CountVectorizer(analyzer=text_process)
vectorizer.fit_transform(df['text'])

<5728x37229 sparse matrix of type '<class 'numpy.int64'>'
	with 565908 stored elements in Compressed Sparse Row format>

Splitting the Data into Training and testing Data (80% Training, 20% Testing)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(vect, df['spam'], test_size=0.20, random_state=0)

In [None]:
y_train.head()

4518    0
4472    0
799     1
4809    0
1043    1
Name: spam, dtype: int64

In [None]:
print("Training Data: ",y_train.size)
print("Testing Data: ",y_test.size)

Training Data:  4582
Testing Data:  1146


In [None]:
print(X_train)

  (0, 3638)	1
  (0, 21485)	1
  (0, 32099)	1
  (0, 33656)	1
  (0, 1821)	1
  (0, 9165)	1
  (0, 36899)	1
  (0, 24418)	1
  (0, 2126)	1
  (0, 916)	1
  (0, 19886)	1
  (0, 304)	1
  (0, 6391)	1
  (0, 25629)	1
  (0, 34400)	1
  (0, 21852)	2
  (0, 17693)	1
  (0, 28712)	1
  (0, 28707)	1
  (0, 1223)	1
  (0, 9325)	1
  (0, 35840)	1
  (0, 357)	1
  (0, 32338)	1
  (0, 927)	2
  :	:
  (4581, 26109)	1
  (4581, 18219)	1
  (4581, 10011)	1
  (4581, 20544)	1
  (4581, 14)	1
  (4581, 14669)	1
  (4581, 34784)	1
  (4581, 27016)	1
  (4581, 13342)	1
  (4581, 27303)	1
  (4581, 5629)	1
  (4581, 5218)	1
  (4581, 31966)	1
  (4581, 12780)	1
  (4581, 29073)	2
  (4581, 1588)	1
  (4581, 31940)	1
  (4581, 19651)	2
  (4581, 10597)	2
  (4581, 2796)	1
  (4581, 5539)	1
  (4581, 5990)	2
  (4581, 26460)	1
  (4581, 2246)	1
  (4581, 19507)	1


### Naive Bayes Classification

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
classifier = model.fit(X_train, y_train)

In [None]:
predicted_training = classifier.predict(X_train)
predicted_testing = classifier.predict(X_test)

print(predicted_training)
print(y_train.values,'\n')
print(predicted_testing)
print(y_test.values,'\n')

[0 0 1 ... 0 0 0]
[0 0 1 ... 0 0 0] 

[0 0 1 ... 1 0 1]
[0 0 1 ... 0 0 1] 



In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

predict = classifier.predict(X_train)
print(classification_report(y_train, predict))
print('Confusion Matrix: \n', confusion_matrix(y_train, predict),'\n')
print('Accuracy for training data: ', accuracy_score(y_train, predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3475
           1       0.99      1.00      1.00      1107

    accuracy                           1.00      4582
   macro avg       1.00      1.00      1.00      4582
weighted avg       1.00      1.00      1.00      4582

Confusion Matrix: 
 [[3466    9]
 [   2 1105]] 

Accuracy for training data:  0.9975993016150153


In [None]:
print(classifier.predict(X_test))
print(y_test.values)
predict = classifier.predict(X_test)
print(classification_report(y_test, predict))
print('Confusion Matrix: \n', confusion_matrix(y_test, predict),'\n')
print('Accuracy for testing data: ', accuracy_score(y_test, predict))

[0 0 1 ... 1 0 1]
[0 0 1 ... 0 0 1]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       885
           1       0.95      0.99      0.97       261

    accuracy                           0.99      1146
   macro avg       0.97      0.99      0.98      1146
weighted avg       0.99      0.99      0.99      1146

Confusion Matrix: 
 [[872  13]
 [  2 259]] 

Accuracy for testing data:  0.9869109947643979


### Testing the model on custom data

In [None]:
test_mail = df['text'].values[2:3]

print("Test Mail :", test_mail)
print("Actual spam value :",df['spam'].values[2:3])
print("Predicted spam value :",classifier.predict(X_test[-4:-3]))


Test Mail : ['Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved for a $ 454 , 169 home loan at a 3 . 72 fixed rate .  this offer is being extended to you unconditionally and your credit is in no way a factor .  to take advantage of this limited time opportunity  all we ask is that you visit our website and complete  the 1 minute post approval form  look foward to hearing from you ,  dorcas pittman']
Actual spam value : [1]
Predicted spam value : [1]


---

Concluding this project with a final prediction accuracy of 98.69%