# Assignment: Classification

Classification refers to categorizing the given data into classes. For example,
- Given an image of hand-written character, identifying the character (multi-class classification)
- Given an image, annotating it with all the objects present in the image (multi-label classification)
- Classifying an email as spam or non-spam (binary classification)
- Classifying a tumor as benign or malignant and so on

In this assignment, we will be building a classifier to classify emails as spam or non-spam. We will be using the Kaggle dataset [Spam or Not Spam Dataset](https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset?resource=download) for this task. 

**Note**: You cannot load any libraries other than the mentioned ones.




### Data pre-processing
The first step in every machine learning algorithm is to process the raw data in some meaningful representations. We will be using the [Bag-of-Words](https://towardsdatascience.com/a-simple-explanation-of-the-bag-of-words-model-b88fc4f4971) representation to process the text. It comprises of following steps:

- Process emails line-by-line to extract all the words.
- Replace extracted words by their stem (root) word. This is known as stemming and lematization.
- Remove stop words like and, or, is, am, and so on.
- Assign a unique index to each word. This forms the vocabulary.
- Represent each email as a binary vector of length equal to the size of the vocabulary such that the $i^{th}$ element of the vector is 1 iff the $i^th$ word is present in the email.

Here we provide you with the function signature along with the expected functionality. You are expected to complete them accordingly. 

In [None]:
import numpy as np
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

vocabulary=[]

import sys
import csv

csv.field_size_limit(sys.maxsize)

# takes an email as an argument
# read email line-by-line and extract all the words
# return list of extracted words
def read_email(email):
    word = email.split()
    return word

# takes a list of words as an argument
# replace each word by their stem word
# return list of stem words
ps = PorterStemmer()

def stemming(words):
  for i in range(len(words)):
      words[i]= ps.stem(words[i])
  return words

# takes a list of stem-words as an argument
# remove stop words
# return list of stem words after removing stop words
def remove_stop_words(words):
    stop_words = set(stopwords.words('english'))
    stem_no_stop_words=[]
    for w in words:
        if w not in stop_words:
            stem_no_stop_words.append(w)

    return stem_no_stop_words

# takes a list of stem-words as an argument
# add new words to the vocabulary and assign a unique index to them
# returns new vocabulary
def build_vocabulary(stem_words):
  #define vocab
  for x in stem_words:
    if x not in vocabulary:
      vocabulary.append(x)
  return vocabulary


# takes a list of stem-words and vocabulary as an argument
# returns bow representation
def get_bow(words, vocabulary):
    email_bow=[]
    for w in vocabulary:
        if w in words:
            email_bow.append(1)
        else:
            email_bow.append(0)
    return email_bow


# read the entire dataset
# convert emails to bow and maintain their labels
# call function text_to_bow()
# read the entire dataset
# convert emails to bow and maintain their labels
# call function text_to_bow()
def read_data():
  s=[]
  ns =[]
  with open('spam_or_not_spam.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line = 0
    for row in csv_reader:
        if line != 0:
          email = row[0]
          spam = row[1]
          words= remove_stop_words(stemming(read_email(email)))

          build_vocabulary(words)

          if spam=='1':
            s.append(words)
          else:
            ns.append(words)
    
        line += 1

  email_s=[]
  for mail in s:
    t=get_bow(mail, vocabulary)
    email_s.append(t)


  email_ns=[]
  for mail in ns:
    t=get_bow(mail, vocabulary)
    email_ns.append(t)

  return email_s,email_ns



### Data Visualization
Let's understand the data distribution
- Visualize the frequency of word-occurence in all the emails(spam + non-spam)
- Visualize the freuency of word-occurence for spam and non-spam emails separately

In [None]:
import matplotlib.pyplot as plt

# visuallze data distribution
import matplotlib.pyplot as plt

# visuallze data distribution
def data_vis(s,ns):

  total = [x + y for (x, y) in zip(s, ns)]
  x=np.array(range(len(vocabulary)))

  fig,ax = plt.subplots(3)
  fig.suptitle("Graph of words vs frequencies", fontsize=8)

  # plotting the points
  ax[0].plot(x, s)
  ax[0].set_title('spam emails')
  ax[1].plot(x, ns)
  ax[1].set_title('non- spam emails')
  ax[2].plot(x, total)
  ax[2].set_title('all emails')

  # function to show the plot
  plt.show()

  return

email_s,email_ns =read_data()
s = np.sum(email_s, 0)
ns = np.sum(email_ns, 0)

data_vis(s,ns)


### Learn a Classifier
Split the dataset randomly in the ratio 80:20 as the training and test dataset. Use only training dataset to learn the classifier. No test data should be used during training. Test data will only be used during evaluation.

Now let us try to use ML algorithms to classify emails as spam or non-spam. You are supposed to implement [SVM](https://scikit-learn.org/stable/modules/svm.html) and [K-Nearest Neighbour](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) algorithm available in scikit-learn using the same training dataset for both.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

# split dataset
def split(data):
  np.random.shuffle(data)
  train_data=[]
  test_data=[]

  i=0
  for mail in data:
    if i < (0.8)*(len(data)):
      train_data.append(mail)
    else:
      test_data.append(mail)
    i+=1

  return train_data, test_data

# learn a SVM model
# use the model to make prediction
# return the model predictions on train and test dataset
def svm_classifier(X_train,Y_train,X_test):
  svc = svm.SVC(kernel='linear', C=1.0).fit(X_train, Y_train)
  predict_labels=(svc.predict(X_train),svc.predict(X_test))
  return predict_labels

# implement k-NN algorithm
# use the model to make prediction
# return the model predictions on train and test dataset
def knn_classifier(X_train,Y_train,X_test):
  knn = KNeighborsClassifier(n_neighbors=int((len(Y_train))*(1/2)))
  knn.fit(X_train, Y_train)
  predict_labels=(knn.predict(X_train),knn.predict(X_test))
  return predict_labels

data=[]
email_s,email_ns =read_data()

for mail in email_s:
  x=(mail,1)
  data.append(x)
for mail in email_ns:
  x=(mail,0)
  data.append(x)

train_data, test_data = split(data)

X_train =[]
Y_train =[]
for mail_label in train_data:
  X_train.append(mail_label[0])
  Y_train.append(mail_label[1])

X_test =[]
Y_test =[]
for mail_label in test_data:
  X_test.append(mail_label[0])
  Y_test.append(mail_label[1])

svm_train_predictions, svm_test_predictions = svm_classifier(X_train,Y_train,X_test)
knn_train_predictions, knn_test_predictions = knn_classifier(X_train,Y_train,X_test)


### Model Evaluation
Compare the SVM and k-NN model using metrics
- Accuracy
- [AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html)


In [None]:
from sklearn import metrics

def compute_accuracy(true_labels, predicted_labels):
  acc= metrics.accuracy_score(true_labels, predicted_labels)
  return acc


# compute AUC score 
def compute_auc(true_labels, predicted_labels):
  auc = metrics.roc_auc_score(true_labels, predicted_labels)
  return auc


#for svm
print("for SVM model: accuracy and AUC respectively for predicting test data is-")
print(compute_accuracy(Y_test,svm_test_predictions))
print(compute_auc(Y_test,svm_test_predictions))
print("for predicting train data is-")
print(compute_accuracy(Y_train,svm_train_predictions))
print(compute_auc(Y_train,svm_train_predictions))

#for knn
print("for KNN model: accuracy and AUC respectively for predicting test data is-")
print(compute_accuracy(Y_test,knn_test_predictions))
print(compute_auc(Y_test,knn_test_predictions))
print("for predicting train data is-")
print(compute_accuracy(Y_train,knn_train_predictions))
print(compute_auc(Y_train,knn_train_predictions))