# AAS SPAM Detector

The objective is to develop a SPAM classifier capable of reaching at least 70% accuracy. You can, and should, use all that was presented in the theoretical notebooks.

The dataset is different from the toy one used in the class, instead the work will be done on the Enron SPAM dataset. The Enron-Spam dataset is a fantastic ressource collected by V. Metsis, I. Androutsopoulos and G. Paliouras and described in their publication ["Spam Filtering with Naive Bayes - Which Naive Bayes?"](https://nes.aueb.gr/ipl/nlp/pubs/ceas2006_paper.pdf). The dataset contains a total of 17.171 spam and 16.545 non-spam ("ham") e-mail messages (33.716 e-mails total). The original dataset and documentation can be found [here](http://www2.aueb.gr/users/ion/data/enron-spam/readme.txt).

In [28]:
import os
import nltk
import math
import tqdm
import joblib
import polars as pl
import numpy as np

from joblib import Parallel, delayed
from collections import Counter

from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef, accuracy_score, f1_score

import gensim
from gensim.models.fasttext import FastText

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /home/mantunes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mantunes/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/mantunes/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/mantunes/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/mantunes/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [15]:
df = pl.read_csv('datasets/enron_spam_data.csv.gz')

In [16]:
df.describe

<bound method DataFrame.describe of shape: (42_619, 2)
┌─────────────────────────────────┬─────────┐
│ text                            ┆ is_spam │
│ ---                             ┆ ---     │
│ str                             ┆ i64     │
╞═════════════════════════════════╪═════════╡
│ Re: New Sequences Window        ┆ 0       │
│                                 ┆         │
│     …                           ┆         │
│ [zzzzteana] RE: Alexander       ┆ 0       │
│                                 ┆         │
│ Mar…                            ┆         │
│ [zzzzteana] Moscow bomber       ┆ 0       │
│                                 ┆         │
│ Man…                            ┆         │
│ [IRR] Klez: The Virus That  Wo… ┆ 0       │
│ Re: [zzzzteana] Nothing like m… ┆ 0       │
│ …                               ┆ …       │
│ Какой у тебя любимый цвет? Фан… ┆ 1       │
│ Что ты сегодня ел, Чудо-печень… ┆ 1       │
│ Что ты думаешь о спорте? Секре… ┆ 1       │
│ Какой твой любимый вид 

In [17]:
def _nltk_pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif nltk_tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:          
        return None


def _nltk_pos_lemmatizer(lemmatizer, token, tag):
    if tag is None:
        return lemmatizer.lemmatize(token)
    else:        
        return lemmatizer.lemmatize(token, tag)


def text_pre_processing(txt, m=2):
    if txt is not None:
        lemmatizer = nltk.WordNetLemmatizer()
        stop_words = set(nltk.corpus.stopwords.words('english'))

        tokens = nltk.word_tokenize(txt)
        tokens = [w for w in tokens if w.isalpha()]
        tokens = [w for w in tokens if w not in stop_words]
        tokens = nltk.pos_tag(tokens)
        tokens = [(t[0], _nltk_pos_tagger(t[1])) for t in tokens]
        tokens = [_nltk_pos_lemmatizer(lemmatizer, w, t).lower() for w,t in tokens]
        tokens = [w for w in tokens if len(w) > m]
    else:
        tokens = []
    return tokens

In [18]:
dataset = [(text, label) for (text, label) in df.rows()]

In [19]:
corpus = [text for text, _ in dataset]
y = [label for _, label in dataset]

In [20]:
#tokens_clean = [text_pre_processing(text) for text in corpus]
tokens_clean = Parallel(n_jobs=-1)(delayed(text_pre_processing)(text) for text in corpus)
tokens_merge = [' '.join(line) for line in tokens_clean]

In [49]:
X_train, X_test, y_train, y_test = train_test_split(tokens_merge, y, test_size=0.2, random_state=42, stratify=y)

In [50]:
X_train

['work greatt hello welcome medzonli moonbeam shop please introduce one ieading online pandemonium armaceuticai shop coronal disaster proportionality afflict rhomboid lag chamfer haycock furiosity andmanyother save coachhouse total confi appease dentiaiity worldwide phonetic shlpplng blackleg miilion customer country converter day',
 'transwestern deal fyi kim original message krishnarao pinnamaneni send wednesday june watson kimberly kiatsupaibul seksan subject transwestern deal let discuss result tomorrow also historical spot price information please see folks thanks krishna',
 'you hand select free info doctype html public html html head meta meta mshtml body center table tbody font http img http img http img http img http img http font center strong you hand select access exclusive information font http free table tbody center http img http img http noshade font arial you receive mail member subscribe unsubscribe http font arial click here http reply email remove subject line must 

In [59]:
# train Bag of Words model
cv = CountVectorizer(ngram_range = (1, 2), max_features=512)
X_train_cv = cv.fit_transform(X_train)
print(f'Training Data CV : {X_train_cv.shape}')

# transform X_test using CV
X_test_cv = cv.transform(X_test)
print(f'Test Data CV : {X_test_cv.shape}')

Training Data CV : (34095, 512)
Test Data CV : (8524, 512)


In [60]:
# define the list of classifiers
clfs = [
    ('LR', LogisticRegression(random_state=42, max_iter=600)),
    ('KNN', KNeighborsClassifier(n_neighbors=1)),
    ('NB', GaussianNB()),
    ('RFC', RandomForestClassifier(random_state=42)),
    ('MLP', MLPClassifier(random_state=42, learning_rate='adaptive', max_iter=1000))
]

# whenever possible used joblib to speed-up the training
with joblib.parallel_backend('loky', n_jobs=-1):
    for label, clf in clfs:
        # train the model
        clf.fit(X_train_cv.toarray(), y_train)

        # generate predictions
        predictions = clf.predict(X_test_cv.toarray())

        # compute the performance metrics
        mcc = matthews_corrcoef(y_test, predictions)
        acc = accuracy_score(y_test, predictions)
        f1 = f1_score(y_test, predictions, average='weighted')
        print(f'{label:3} {acc:.2f} {f1:.2f} {mcc:.2f}')

LR  0.91 0.91 0.81
KNN 0.89 0.89 0.77
NB  0.70 0.69 0.51
RFC 0.94 0.94 0.87
MLP 0.94 0.94 0.87


In [29]:
#text_model = gensim.models.fasttext.load_facebook_model('vec/cc.en.300.bin.gz')
text_model = gensim.models.fasttext.FastText(vector_size=300, window=7, min_count=3, workers=os.cpu_count(), seed=42)
text_model.build_vocab(tokens_clean)

In [30]:
text_model

<gensim.models.fasttext.FastText at 0x7f29d55c5b10>

In [31]:
def div_norm(x):
   norm_value = np.linalg.norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x


def word_vector_to_sentence_vector(sentence:list, model):
    vectors = []
    # for all the tokens in the setence
    for token in sentence:
        if token in model:
            vectors.append(model[token])
    # add the EOS token
    if '\n' in model:
        vectors.append(model['\n'])
    # normalize all the vectors
    vectors = [div_norm(x) for x in vectors]
    return np.mean(vectors, axis=0)

In [32]:
X = np.array([word_vector_to_sentence_vector(sentence, text_model.wv) for sentence in tqdm.tqdm(tokens_clean)])

100%|██████████| 42619/42619 [00:36<00:00, 1171.36it/s] 


In [33]:
X

array([[ 0.00279531, -0.00512118, -0.00034927, ..., -0.00136312,
         0.0035082 ,  0.0112902 ],
       [ 0.00109208, -0.00869433, -0.00416818, ..., -0.00940189,
         0.00230878,  0.00696706],
       [ 0.00596662, -0.01144918,  0.00143926, ..., -0.00331017,
         0.00716811,  0.01155127],
       ...,
       [ 0.00058237,  0.00910582, -0.01642736, ..., -0.03440075,
         0.02302997, -0.01838688],
       [-0.00693345,  0.02170273,  0.02433146, ..., -0.0011592 ,
         0.00071695,  0.01745962],
       [ 0.00632776, -0.00383493,  0.00124999, ..., -0.0118996 ,
        -0.03748241,  0.01994425]], dtype=float32)

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

In [35]:
# define the list of classifiers
clfs = [
    ('LR', LogisticRegression(random_state=42, max_iter=600)),
    ('KNN', KNeighborsClassifier(n_neighbors=1)),
    ('NB', GaussianNB()),
    ('RFC', RandomForestClassifier(random_state=42)),
    ('MLP', MLPClassifier(random_state=42, learning_rate='adaptive', max_iter=1000))
]

# whenever possible used joblib to speed-up the training
with joblib.parallel_backend('loky', n_jobs=-1):
    for label, clf in clfs:
        # train the model
        clf.fit(X_train, y_train)

        # generate predictions
        predictions = clf.predict(X_test)

        # compute the performance metrics
        mcc = matthews_corrcoef(y_test, predictions)
        acc = accuracy_score(y_test, predictions)
        f1 = f1_score(y_test, predictions, average='weighted')
        print(f'{label:3} {acc:.2f} {f1:.2f} {mcc:.2f}')

LR  0.85 0.85 0.69
KNN 0.89 0.89 0.78
NB  0.57 0.56 0.23
RFC 0.86 0.86 0.73
MLP 0.93 0.93 0.86
