## Spam Email Classifier with KNN using TF-IDF scores

1.   Assignment must be implemented in Python 3 only.
2.   You are allowed to use libraries for data preprocessing (numpy, pandas, nltk etc) and for evaluation metrics, data visualization (matplotlib etc.).
3.   You will be evaluated not just on the overall performance of the model and also on the experimentation with hyper parameters, data prepossessing techniques etc.
4.   The report file must be a well documented jupyter notebook, explaining the experiments you have performed, evaluation metrics and corresponding code. The code must run and be able to reproduce the accuracies, figures/graphs etc.
5.   For all the questions, you must create a train-validation data split and test the hyperparameter tuning on the validation set. Your jupyter notebook must reflect the same.
6.   Strict plagiarism checking will be done. An F will be awarded for plagiarism.

**Task: Given an email, classify it as spam or ham**

Given input text file ("emails.txt") containing 5572 email messages, with each row having its corresponding label (spam/ham) attached to it.

This task also requires basic pre-processing of text (like removing stopwords, stemming/lemmatizing, replacing email_address with 'email-tag', etc..).

You are required to find the tf-idf scores for the given data and use them to perform KNN using Cosine Similarity.

### Import necessary libraries

In [173]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import re
import math
import random
from sklearn.model_selection import train_test_split
from sortedcontainers import SortedDict
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

### Load dataset

In [174]:
# url="https://github.com/debashish05/SMAI/blob/main/Dataset/emails.txt"            # This will give the whole html page
# url="https://raw.githubusercontent.com/debashish05/SMAI/main/Dataset/emails.txt"  # serve unprocessed versions of files in GitHub.
# data=pd.read_csv(url,delimiter = '\n')

file = open("emails.txt","r")

data=[]       # all dataset

for line in file:
  data.append(line)

file.close()

FileNotFoundError: ignored

### Preprocess data
1. Remove dupplicates,
2. Converted all letters to lowercase letters, 
3. Converted email to emai tags,
4. Remove puncuations, 
5. Removed stopwords and 
6. Lemmatized all words

In [None]:
# Remove Dupplicates
def removeDupplicates(line):
  """ Removes dupplicates element from the input provided"""
  return list(set(line))


# Removing stopwords
def removeStopWord(lines, language="english"):
  """
      Removes stopword from the list of sentences, of the language passed
  """
  nltk.download('stopwords')                          # Need to download stopwords first
  stopWord = nltk.corpus.stopwords.words(language)    # all stopword in english

  for i in range(len(lines)):
    line=""
    for word in lines[i].split():
      if word not in stopWord:
        line+=word+" "
    lines[i]=line
  return lines

# Lemmatization
def lemmatization(lines):
  """
      Lemmatizie all word in the lines, using WordNet
  """
  
  nltk.download('wordnet')
  lemmatizer = nltk.stem.WordNetLemmatizer()
  
  for i in range(len(lines)):
    line=""
    for word in lines[i].split():
        line+=lemmatizer.lemmatize(word)+" "
    lines[i]=line
  
  return lines

# Convert all two lowercase for easier evaluation
def lowerCase(lines):
  """ Convert all the character in lines to lower case"""
  return [line.lower() for line in lines]


# identify all emai address and conver to tags
def emailToTag(lines):
  """replacing email_address with 'email-tag'"""

  emailRegex=re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
  for i in range(len(lines)):
    lines[i]=re.sub(emailRegex,r"<email>",lines[i])
  return lines


# remove puncuations
def removePuncuations(lines):
  """Replacing punctuations with spaces"""

  punc=re.compile(r'[?|!|\'|"|#|.|,|)|(|\|/|:|)|;]')
  for i in range(len(lines)):
    lines[i]=re.sub(punc,r" ",lines[i])
  return lines


data = removeDupplicates(data)                        
data = lowerCase(data)
data = emailToTag(data)
data = removePuncuations(data)
data = removeStopWord(data)
data = lemmatization(data)

**Term Frequency–Inverse Document Frequency**

In [None]:
email=[]      # Contains email of all dataset
output=[]     # represent ham or spam for corresponding email

#random.shuffle(data)      # shuffling the data. 

for dataInstance in data:
  if len(dataInstance) <=4:
  # It may contain ham or spam only not the email after preprocessing
    continue

  result,mail = dataInstance.split(None,maxsplit = 1)  
  # None implies Split on Whitespace include spaces, newlines \n and tabs \t , 
  # and consecutive whitespace are processed together.
  email.append(mail)
  output.append(result) 



idf={}        # If for all word present in corpus
tf=[]         # List of dictionary, ith element have tf for ith docuements words
N=len(data)


# Term Frequency - Inverse Document frequency
def TFIDF(lines):
  """ 
      Computes Inverse Document frequency
      IDF(wi,all corpus)= log(# document/ # dcouemnt which contain wi)
      Calculates TF. Elements in list are treated as document and each word in 
      list is tread word of the documents.
  """
  
  for sentence in lines:
    linefreq={}
    linetf={}
    count=0

    for word in sentence.split():
      linefreq.setdefault(word,0)
      linefreq[word]+=1
      count+=1

    for key,value in linefreq.items():
      idf.setdefault(key,0)
      linetf.setdefault(key,0)
      idf[key]+=1
      linetf[key]+=value/count

    tf.append(linetf)

  for key,value in idf.items():
    idf[key]=math.log(N/value,2)


TFIDF(email)

**Making Dictionary from the dataset provided**

In [None]:
dictionary={} # word to corresponding index


def makeDictionary(text):
  """
    Make a hashmaof of strings and assign a unique number to it
  """
  count=0
  for sentence in text:
    for word in sentence.split():
      if word not in dictionary:
        dictionary[word]=count
        count+=1 

makeDictionary(email)

**Vectorization**

In [None]:
# array of all email and corresponding tfidf value
dataMatrix = []

for i in range(len(email)):

  vector=[0]*(len(dictionary)+1)
  for word in email[i].split():
    vector[dictionary[word]]=tf[i][word]*idf[word]
  dataMatrix.append(np.array(vector))

df ={'type':output,'tfidf':dataMatrix}
df = pd.DataFrame(df)
print(df)

### Split data

In [None]:
# 80% for Test Data, 10% Validation Data, 10% Data Set

train, test = train_test_split(df, test_size=0.2)
#test,validation = train_test_split(test, test_size=0.5)
print("Number of train Data points",len(train))
print("Number of test Data points",len(test))
#print("Number of validation Data points",len(validation))
print(test)
print(type(test))

testArr = np.array(test['tfidf'])
trainArr = np.array(train['tfidf'])
#validationArr = np.array(validation['tfidf'])

print(type(testArr))


### Train your KNN model (reuse previously iplemented model built from scratch) and test on your data

***1. Experiment with different distance measures [Euclidean distance, Manhattan distance, Hamming Distance] and compare with the Cosine Similarity distance results.***

In [None]:
def euclideanDistance(a,b):
  return np.sqrt(np.sum(np.square(b-a)))

def cosineSimilarity(a,b):
  return np.dot(b,a)/(np.linalg.norm(b)*np.linalg.norm(a))
  
def manhattanDistance(a,b):
  return np.abs(a-b).sum()
  

# Preprocessing query for all value of k in validation dataset
VcosineSimilarity = [] # ith element in the list, is a dictionary consiting of {entry number,cosine similarity}
VmanhatanSimilarity = [] # ith element in the list, is a dictionary consiting of {entry number,manhatan similarity}
VecludianSimilarity = [] #ith element in the list, is a dictionary consiting of {entry number,ecludian similarity}

Csimilarity={}   
Esimilarity={}
Msimilarity={}

for i in range(len(testArr)):
  
  Csimilarity.clear()   
  Esimilarity.clear()
  Msimilarity.clear()

  for j in range(len(trainArr)):
    Csimilarity[j] = (1-cosineSimilarity(testArr[i], trainArr[j]))
    Msimilarity[j]= (manhattanDistance(testArr[i], trainArr[j]))
    Esimilarity[j]= (euclideanDistance(testArr[i], trainArr[j]))
  
  CSim = dict(sorted(Csimilarity.items(), key=lambda x: x[1]))
  ESim = dict(sorted(Esimilarity.items(), key=lambda x: x[1]))
  MSim = dict(sorted(Msimilarity.items(), key=lambda x: x[1]))

  VmanhatanSimilarity.append(MSim)
  VcosineSimilarity.append(CSim)
  VecludianSimilarity.append(ESim)
  
    


**Consine Distance**

In [None]:
Cscore=[]

def KNNCosine(k):
  predicted=[]
  groundTruth=[]
  for i in range(len(testArr)):
    count=0
    num=0
    for j in VcosineSimilarity[i].keys():
      if train.iloc[j]['type'] == 'spam':
        count-=1
      else:
        count+=1
      num+=1
      if(num==k):
        break
    if count>0:
      predicted.append("ham")
    else:
      predicted.append("spam")
    
    groundTruth.append(test.iloc[i]['type'])
  
  print(predicted)
  print(groundTruth)
  print(metrics.confusion_matrix(groundTruth,predicted, labels=["ham","spam"]))
  print(metrics.classification_report(groundTruth,predicted,labels=["ham","spam"],zero_division=1))
  Cscore.append(metrics.f1_score(groundTruth,predicted,average="micro"))


**Manhatan Distance**

In [None]:
Mscore=[]

def KNNManhatan(k):
  predicted=[]
  groundTruth=[]
  for i in range(len(testArr)):
    count=0
    num=0
    for j in VmanhatanSimilarity[i].keys():
      if train.iloc[j]['type'] == 'spam':
        count-=1
      else:
        count+=1
      num+=1
      if(num==k):
        break
    if count>0:
      predicted.append("ham")
    else:
      predicted.append("spam")
    
    groundTruth.append(test.iloc[i]['type'])
  
  print(predicted)
  print(groundTruth)
  print(metrics.confusion_matrix(groundTruth,predicted, labels=["ham","spam"]))
  print(metrics.classification_report(groundTruth,predicted,labels=["ham","spam"],zero_division=1))
  Mscore.append(metrics.f1_score(groundTruth,predicted,average="micro"))

**Euclidian Distance**

In [None]:
Escore=[]

def KNNEuclid(k):
  predicted=[]
  groundTruth=[]
  for i in range(len(testArr)):
    count=0
    num=0
    for j in VecludianSimilarity[i].keys():
      if train.iloc[j]['type'] == 'spam':
        count-=1
      else:
        count+=1
      num+=1
      if(num==k):
        break
    if count>0:
      predicted.append("ham")
    else:
      predicted.append("spam")
    
    groundTruth.append(test.iloc[i]['type'])
  
  print(metrics.confusion_matrix(groundTruth,predicted, labels=["ham","spam"]))
  print(metrics.classification_report(groundTruth,predicted,labels=["ham","spam"],zero_division=1))
  Escore.append(metrics.f1_score(groundTruth,predicted,average="micro"))

***2. Explain which distance measure works best and why? Explore the distance measures and weigh their pro and cons in different application settings.***

Cosine Gives Best result in my calculations. Consie is 0 when both the vector points are perpendicular to each other and 1 when they lie on each other in the hyperplane. Cosine distance best work in collabrative filtering and recommendations systems. Whereas Manhatna distance work best when we are working in a grid. Hamming distance in best suited when we are dealing with bits of number (where the input is binary, like spam or ham etc). Euclidian distance find the shortest distance between any two points in the hyper plane. When the dimensionality increase manhattan and euclidian distance can perform better. Since two points may be on the same line but far part in distance. 

***3. Report Cosine, Euclidian, Manhatan score in a tabular form***



In [None]:
knn_val=[1,3,5,7,9,11,17,23,28]
for val in knn_val:
  KNNCosine(val)
  KNNManhatan(val)
  KNNEuclid(val)

***4. Choose different K values (k=1,3,5,7,11,17,23,28) and experiment. Plot a graph showing F1 score vs k.***

In [None]:
plt.plot(knn_val,Cscore)
plt.xlabel("K values")
plt.ylabel("F1 score")
plt.title("Cosine Similarity")
plt.show()

plt.plot(knn_val,Mscore)
plt.xlabel("K values")
plt.ylabel("F1 score")
plt.title("Mahantan Distance Similarity")
plt.show()

plt.plot(knn_val,Escore)
plt.xlabel("K values")
plt.ylabel("F1 score")
plt.title("Euclidian Distance Similarity")
plt.show()

### Train and test Sklearn's KNN classifier model on your data (use metric which gave best results on your experimentation with built-from-scratch model.)

In [None]:
score=[]

for val in knn_val:
    knnModel = KNeighborsClassifier(val, metric='cosine')
    final = knnModel.fit(list(train['tfidf']), list(train['type']))
    predicted = final.predict(list(test['tfidf']))
    #score.append(metrics.f1_score(list(test['type']), list(predicted)))
    #score.append(metrics.f1_score(list(test['type']), list(predicted), average=None))
    score.append(metrics.f1_score(list(test['type']), list(predicted), average="micro"))

***Compare both the models result.***

We can see that SKLearn has lesser score at some point of time value 28 rest they are same.  

In [None]:
plt.plot(knn_val, score, Label="Sklearn's KNN")
plt.plot(knn_val,Cscore,Label="Cosine Similarity")
plt.xlabel('k - value')
plt.ylabel('F1 Score')
plt.title('Comparison with Sklearn KNN classifier and  built-from-scratch model ')
plt.legend()
plt.show()

***What is the time complexity of training using KNN classifier?***

KNN take O(1) for training because we just need to add one point in the plane. 

***What is the time complexity while testing? Is KNN a linear classifier or can it learn any boundary?***