### Importing Basic libraries to read and handle dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Matplotlib created a temporary config/cache directory at /var/folders/w1/ddjn79r10gb30q27xq0t_fjw0000gn/T/matplotlib-7684u40e because the default path (/Users/ravishkumar/.matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


### Place the data files inside the same folder and then read the csv file using pandas' read_csv function

In [2]:
email_data=pd.read_csv("spam_ham_dataset.csv")

email_data.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

### Removing the Unnamed: 0 column

In [3]:
email_data = email_data.drop(['Unnamed: 0','label'],axis=1)
email_data.columns

Index(['text', 'label_num'], dtype='object')

### Counting Number of samples corresponding to Ham and Spam classes

In [4]:
print(email_data['label_num'].value_counts())

0    3672
1    1499
Name: label_num, dtype: int64


So, we have 3672 samples for Ham class and 1499 samples for Spam class

### Checking if there are any Null values

In [5]:
print(email_data.isnull().any())

text         False
label_num    False
dtype: bool


### Data Pre-processing for the prepared dataset

We will use the NLTK (Natural Language Toolkit) library to find the set of Stopwords. The code below will be run once to download the set of stopwords from internet. It will take some time, so wait for a while.

In [None]:
import nltk
nltk.download('stopwords')

### Importing the downloaded stopwords and then using it to preprocess the email spam ham dataset.

In [7]:
from nltk.corpus import stopwords
stopset = set(stopwords.words("english"))

### Apply Porter-Stemmer Algorithm for Stemming the word tokens and bring them in native form

In [11]:
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer

stopset = set(stopwords.words("english"))

corpus = []

for i in range(0, len(email_data)):
    e_mail = re.sub('[^a-zA-Z]', ' ', email_data['text'][i])
    e_mail = e_mail.split()
    
    ps = PorterStemmer()
    e_mail = [ps.stem(word) for word in e_mail if not word in set(stopwords.words('english'))]
    e_mail = ' '.join(e_mail)
    
    corpus.append(e_mail)


### Vectorize the formed dataset to convert the data in machine readbale format

We will be using the famous countvectorizer which converts the text data into numerical form by counting the number of unique vectors and then vectorize the dataset based on the vocabulary formed.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

x = cv.fit_transform(corpus)

### Finalizing the labels as well

In [13]:
y = email_data['label_num']

### Splitting the dataset using the train_test_split function from scikit-learn library

In [15]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=5)

### Finally, building the KNN model to classify the emails into Spam and Ham classes

In [17]:
from sklearn.neighbors import KNeighborsClassifier

model=KNeighborsClassifier(n_neighbors=1)

model.fit(x_train,y_train)

### Performing prediction using the trained model

In [18]:
y_pred = model.predict(x_test)

### Evaluating the performance of the trained KNN model on the test dataset

In [19]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

confusion_mat=confusion_matrix(y_test,y_pred,labels=None)
print("confusion_mat = ",confusion_mat)
print("Accuracy Score:",accuracy_score(y_test,y_pred))              
print("precision score = ",precision_score(y_test, y_pred))         
print("recall score = ",recall_score(y_test, y_pred))               
print("F1 score = ",f1_score(y_test, y_pred)) 

confusion_mat =  [[681  76]
 [ 14 264]]
Accuracy Score: 0.9130434782608695
precision score =  0.7764705882352941
recall score =  0.9496402877697842
F1 score =  0.854368932038835
