## Spam Email Classifier with KNN using TF-IDF scores

1.   Assignment must be implemented in Python 3 only.
2.   You are allowed to use libraries for data preprocessing (numpy, pandas, nltk etc) and for evaluation metrics, data visualization (matplotlib etc.).
3.   You will be evaluated not just on the overall performance of the model and also on the experimentation with hyper parameters, data prepossessing techniques etc.
4.   The report file must be a well documented jupyter notebook, explaining the experiments you have performed, evaluation metrics and corresponding code. The code must run and be able to reproduce the accuracies, figures/graphs etc.
5.   For all the questions, you must create a train-validation data split and test the hyperparameter tuning on the validation set. Your jupyter notebook must reflect the same.
6.   Strict plagiarism checking will be done. An F will be awarded for plagiarism.

**Task: Given an email, classify it as spam or ham**

Given input text file ("emails.txt") containing 5572 email messages, with each row having its corresponding label (spam/ham) attached to it.

This task also requires basic pre-processing of text (like removing stopwords, stemming/lemmatizing, replacing email_address with 'email-tag', etc..).

You are required to find the tf-idf scores for the given data and use them to perform KNN using Cosine Similarity.

### Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from collections import Counter
# from contractions import contractions_dict

stopword = stopwords.words('english')
snowball_stemmer = SnowballStemmer('english')
# 3 type of stemmer in order of accuracy snowball > porter > lancaster

### Load dataset

In [2]:
df = pd.read_csv('emails.txt', sep='\t', header=None, names=["type", "msg"])

df = df[0:3000]

# df
# display(df.to_string())


### Preprocess data

In [3]:
def preprocess(message):
    """preprocess the given mail"""
    
    # Lowercase
    message = message.lower()
    
    # Removing Tags
    message = re.sub('<[^<]+?>','', message)

    # Removing Links
    message = re.sub(r'http\S+', '', message)
    message = re.sub(r'www.\S+', '', message)
    
    # Contraction

    # Removing Numbers and Punctuations
    message = ''.join(c for c in message if not (c.isdigit() or c in punctuation))

    # Word Tokenize
    message_tokenized = nltk.word_tokenize(message)

    # Removing Stop Words
    message_tokenized = [word for word in message_tokenized if not (word in stopword or len(word) < 3)]

    # Lemmatizing
    # reason why not using lemmatizing here:
    # using both lemmatizing and stemming will not improve much
    # lemmatizing is slow

    # Stemming
    message_tokenized = [snowball_stemmer.stem(word) for word in message_tokenized]
    # print (stemmed_word)
    
    return message_tokenized
    
df["msg"] = df["msg"].apply(preprocess)
# df        
# display(df.to_string())

### Split data

In [4]:
# print(df.sample(frac=1, replace=False), [int(.6*len(df)), int(.8*len(df))])

# train, validate, test = np.split(df.sample(frac=1, replace=False), [int(.6*len(df)), int(.8*len(df))])
train_data, validate_data, test_data = np.split(df, [int(.6*len(df)), int(.8*len(df))])
# print(train_data.shape)
# # print(train)
# print(validate_data.shape)
# # print(validate)
# print(test_data.shape)
# # print(test)

### TF-IDF Utility Fn

In [5]:
# unique_words = dict()

# def countUniqueWords(message):
#     """Function to count unique words"""
#     for word  in message:
#         if word not in unique_words:
#             unique_words[word] = 0
#             # unique_words_count[word] = 0
#         # unique_words_count[word] += 1

# def TF(message):
#     """Function to calculate TF of data"""
#     tf_message = unique_words.copy()
#     count_words = len(message)
#     for word in message:
#         tf_message[word] += 1
#         # count_words += 1
#     if count_words != 0:
#         tf_message.update((x, y/count_words) for x, y in tf_message.items())
#     return tf_message

# def IDF():
#     """Function to calculate IDF of words"""
#     idf_words = unique_words.copy()
#     no_of_sentence = len(train)
#     for word in idf_words:
#         word_in_sentence = 0
#         for i in range(0, no_of_sentence):
#             # print(train.iloc[i].loc["msg"][word])
#             if train.at[i, "msg"][word] > 0:
#                 word_in_sentence += 1
#         idf_words[word] = np.log(no_of_sentence / word_in_sentence)
#     return idf_words

In [6]:
# train["msg"].apply(countUniqueWords)
# # print(unique_words)
# # print(len(unique_words))
# # print(unique_words_count)

In [7]:
# lll= list()
# lll.append(train["msg"].apply(TF))
# # for d in train["msg"]:
# #     print(d)
# # print(train)
# # print(lll)

In [8]:
# idf_words = IDF()
# # idf_words = IDF()
# # print(idf_words)

# # for i in range(0,len(train)):
# #     print(train.iloc[i].loc["msg"])

In [9]:
unique_words = dict()
for message in train_data["msg"]:
    for word in message:
        if word not in unique_words:
            unique_words[word] = 0.0

# IDF of train
train_idf = unique_words.copy()
no_of_sentence = len(train_data)
for word in train_idf:
    word_in_sentence = 0
    for message in train_data["msg"]:
        if word in message:
            word_in_sentence += 1
    train_idf[word] = np.log(no_of_sentence / word_in_sentence - 1)
# print(train_idf)


In [10]:
def cal_TF(data_frame):
    tf = list()
    for message in data_frame["msg"]:
        len_message = len(message)
        counter_dict = Counter(message)
        message_tf = unique_words.copy()
        for word, word_count in counter_dict.items():
            message_tf[word] = word_count / len_message
        tf.append(message_tf)
    return tf

In [11]:
def get_tfidf(data_frame):
    tf = cal_TF(data_frame)
    return tf
    

In [12]:
train_tfidf = get_tfidf(train_data)
# validate_tfidf = get_tfidf(validate_data)
# test_tfidf = get_tfidf(test_data)

In [13]:
# print(train_tfidf[0])
pd.DataFrame(train_tfidf)

### Train your KNN model (reuse previously iplemented model built from scratch) and test on your data

***1. Experiment with different distance measures [Euclidean distance, Manhattan distance, Hamming Distance] and compare with the Cosine Similarity distance results.***

***2. Explain which distance measure works best and why? Explore the distance measures and weigh their pro and cons in different application settings.***

***3. Report Mean Squared Error(MSE), Mean-Absolute-Error(MAE), R-squared (R2) score in a tabular form***

***4. Choose different K values (k=1,3,5,7,11,17,23,28) and experiment. Plot a graph showing R2 score vs k.***

### Train and test Sklearn's KNN classifier model on your data (use metric which gave best results on your experimentation with built-from-scratch model.)

***Compare both the models result.***

***What is the time complexity of training using KNN classifier?***

***What is the time complexity while testing? Is KNN a linear classifier or can it learn any boundary?***