---

# AERO 5 - Hands on Machine Learning for cybersecurity (2022/2023)


# 3– Machine Learning for email fraud and spam detection

by Leila GHARSALLI

---

In this lab session we will discuss how the ML is used for spam detection. We will define our own vectorizer to clear the datasets. Then, we will use logistic regression and the naive bayes classifier to train our model! 

The `scikit-learn` documentation is complete and should be consulted whenever necessary. In particular herein you can consult:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/naive_bayes.html

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction


## Exercise 1: Logistic Regression

Logistic regression is a binary classification technique. A key difference from linear regression is that the output value being modeled is a binary value rather than a numeric value. In this exercise, we will apply a logistic regression model to ingest SMS spam.

The dataset consists of a collection of 425 items from the Grumbletext website which is a site in the UK where users manually report spam text messages. In addition to the spam text messages that were randomly chosen from the National University of Singapore SMS Corpus (NSC) and have also been added to the dataset. Another 450 benign SMS messages where collected from caroline Tag’s PhD thesis.

### Question 1:
Start by importing the relevant packages : the `pandas` will be used to enable data frame capabilities, the `scikit-learn` package will be used to divide the data into training and testing datasets. We will also use the logistic regression available in `scikit-learn`.

In [None]:
# EDIT THIS CELL

### Question 2:
Import the dataset ''SMSSpamCollection.csv'' and analyze it.

In [None]:
# EDIT THIS CELL

### Question 3:
For featurization, the TF-IDF method is used. Transform then the data to fit the logistic regression model.

In [None]:
# EDIT THIS CELL

### Question 4:
Consider the following test dataset : 'URGENT! Your Mobile No 1234 was awarded a Prize', 'Hey honey, what’s up?' to predict the accuracy of the model. Conclude.

In [None]:
# EDIT THIS CELL

## Exercise 2: Naïve Bayes classifier

This exercise is about using the naive bayes classifier for spam filtering. For this purpose, we will consider the dataset ''sms_spam_no_header.csv''.

### Question 1:
Start by importing the relevant packages.

In [None]:
# EDIT THIS CELL

### Question 2: 
Import the dataset ‘SMSSpam_no_header.csv’ and split it into train and test datasets.

In [None]:
# EDIT THIS CELL

Mails provided in data are full of unstructured mess, so it is important to preprocess this text before feature extraction and modelling. Tokenization converts continuous stream of words into separate token for each word.

In [None]:
from textblob import TextBlob
def get_tokens(msg):
return TextBlob(str(msg)).words

Then the process of lemmatization groups together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form. Hence word like 'moved' and 'moving' will be reduced to 'move.

In [None]:
def get_lemmas(msg):
lemmas = []
words = get_tokens(msg)
for word in words:
 lemmas.append(word.lemma)
return lemmas

### Question 3:
Extract text features known as TF-IDF features. Then transform the data to fit the `multinomial naïve bayes` model.

In [None]:
# EDIT THIS CELL

### Question 4:
Use the model now to make a prediction on a sample text. Then compute the accuracy of the model.Conclude.

In [None]:
# EDIT THIS CELL

## Exercise 3: Naïve Bayes classifier and SVM

Consider the follwing code:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, naive_bayes, metrics

In [None]:
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head(n=10)

In [None]:
count_Class=pd.value_counts(data["v1"], sort= True)
count_Class.plot(kind= 'bar', color= ["blue", "orange"])
plt.title('Bar chart')
plt.show()

In [None]:
count_Class.plot(kind = 'pie',  autopct='%1.0f%%')
plt.title('Pie chart')
plt.ylabel('')
plt.show()

In [None]:
count1 = Counter(" ".join(data[data['v1']=='ham']["v2"]).split()).most_common(20)
df1 = pd.DataFrame.from_dict(count1)
df1 = df1.rename(columns={0: "words in non-spam", 1 : "count"})
count2 = Counter(" ".join(data[data['v1']=='spam']["v2"]).split()).most_common(20)
df2 = pd.DataFrame.from_dict(count2)
df2 = df2.rename(columns={0: "words in spam", 1 : "count_"})

df1.plot.bar(legend = False)
y_pos = np.arange(len(df1["words in non-spam"]))
plt.xticks(y_pos, df1["words in non-spam"])
plt.title('More frequent words in non-spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()

df2.plot.bar(legend = False, color = 'orange')
y_pos = np.arange(len(df2["words in spam"]))
plt.xticks(y_pos, df2["words in spam"])
plt.title('More frequent words in spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()

In [None]:
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
np.shape(X)

In [None]:
data["v1"]=data["v1"].map({'spam':1,'ham':0})
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, data['v1'], test_size=0.33, random_state=42)
print([np.shape(X_train), np.shape(X_test)])

In [None]:
list_alpha = np.arange(1/100000, 20, 0.11)
score_train = np.zeros(len(list_alpha))
score_test = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test= np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
    bayes = naive_bayes.MultinomialNB(alpha=alpha)
    bayes.fit(X_train, y_train)
    score_train[count] = bayes.score(X_train, y_train)
    score_test[count]= bayes.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, bayes.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, bayes.predict(X_test))
    count = count + 1 

In [None]:
matrix = np.matrix(np.c_[list_alpha, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns = 
             ['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=10)

In [None]:
best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]

In [None]:
models[models['Test Precision']==1].head(n=5)

In [None]:
best_index = models[models['Test Precision']==1]['Test Accuracy'].idxmax()
bayes = naive_bayes.MultinomialNB(alpha=list_alpha[best_index])
bayes.fit(X_train, y_train)
models.iloc[best_index, :]

In [None]:
m_confusion_test = metrics.confusion_matrix(y_test, bayes.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

### Question 1:
Run the program and make an interpretation of every obtained result inside each cell.

### Question 2:
Apply the same reasoning using the SVM model with the gaussian kernel then compare its accuracy to that of Naïve Bayes.

In [None]:
# EDIT THIS CELL