# AAS SPAM Detector

The objective is to develop a SPAM classifier capable of reaching at least 70% accuracy. You can, and should, use all that was presented in the theoretical notebooks.

The dataset is different from the toy one used in the class, instead the work will be done on the Enron SPAM dataset. The Enron-Spam dataset is a fantastic ressource collected by V. Metsis, I. Androutsopoulos and G. Paliouras and described in their publication ["Spam Filtering with Naive Bayes - Which Naive Bayes?"](https://nes.aueb.gr/ipl/nlp/pubs/ceas2006_paper.pdf). The dataset contains a total of 17.171 spam and 16.545 non-spam ("ham") e-mail messages (33.716 e-mails total). The original dataset and documentation can be found [here](http://www2.aueb.gr/users/ion/data/enron-spam/readme.txt).

In [37]:
import os
import math
import nltk
import tqdm
import joblib
import numpy as np
import pandas as pd
import polars as pl
import seaborn as sns
from nltk.stem import WordNetLemmatizer
import gensim
from gensim.models.fasttext import FastText
from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef, accuracy_score, f1_score

from sklearn.manifold import TSNE
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [38]:
df = pl.read_csv('datasets/enron_spam_data.csv.gz')

In [39]:
df.describe

<bound method DataFrame.describe of shape: (42_619, 2)
┌─────────────────────────────────┬─────────┐
│ text                            ┆ is_spam │
│ ---                             ┆ ---     │
│ str                             ┆ i64     │
╞═════════════════════════════════╪═════════╡
│ Re: New Sequences Window        ┆ 0       │
│                                 ┆         │
│     …                           ┆         │
│ [zzzzteana] RE: Alexander       ┆ 0       │
│                                 ┆         │
│ Mar…                            ┆         │
│ [zzzzteana] Moscow bomber       ┆ 0       │
│                                 ┆         │
│ Man…                            ┆         │
│ [IRR] Klez: The Virus That  Wo… ┆ 0       │
│ Re: [zzzzteana] Nothing like m… ┆ 0       │
│ …                               ┆ …       │
│ Какой у тебя любимый цвет? Фан… ┆ 1       │
│ Что ты сегодня ел, Чудо-печень… ┆ 1       │
│ Что ты думаешь о спорте? Секре… ┆ 1       │
│ Какой твой любимый вид 

In [40]:
def div_norm(x):
   norm_value = np.linalg.norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x


def word_vector_to_sentence_vector(sentence:list, model):
    vectors = []
    # for all the tokens in the setence
    for token in sentence:
        if token in model:
            vectors.append(model[token])
    # add the EOS token
    if '\n' in model:
        vectors.append(model['\n'])
    # normalize all the vectors
    vectors = [div_norm(x) for x in vectors]
    return np.mean(vectors, axis=0)


In [41]:
dataset = df.rows()
dataset = [(text, label) for (label, text) in dataset]
targets = [label for _, label in dataset]

In [42]:
lemmatizer = WordNetLemmatizer()
stop_words = set(nltk.corpus.stopwords.words('english'))
corpus = [text for text, text in dataset]  # Ensure all elements are strings
sentences_clean = [' '.join([lemmatizer.lemmatize(w).lower() for w in nltk.word_tokenize(sample) 
                if len(lemmatizer.lemmatize(w)) > 2 and w.isalpha() and w not in stop_words])
                for sample in corpus]

In [43]:
text_model = FastText(vector_size=256, window=7, min_count=3, workers=os.cpu_count(), seed=42)
text_model.build_vocab(sentences_clean)

In [44]:
X = np.array([word_vector_to_sentence_vector(sentence, text_model.wv) for sentence in tqdm.tqdm(sentences_clean)])

100%|██████████| 42619/42619 [04:47<00:00, 148.12it/s] 


In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, targets, stratify=targets, test_size=0.2, random_state=42)

MemoryError: Unable to allocate 47.5 GiB for an array with shape (42619,) and data type <U299456

In [None]:
print(f'Training Data : {len(X_train)}')
print(f'Testing Data  : {len(X_test)}')

# define the list of classifiers
clfs = [
    ('LR', LogisticRegression(random_state=42, multi_class='auto', max_iter=600)),
    ('KNN', KNeighborsClassifier(n_neighbors=1)),
    ('NB', GaussianNB()),
    ('RFC', RandomForestClassifier(random_state=42)),
    ('MLP', MLPClassifier(random_state=42, learning_rate='adaptive', max_iter=1000))
]

# whenever possible used joblib to speed-up the training
with joblib.parallel_backend('loky', n_jobs=-1):
    for label, clf in clfs:
        # train the model
        clf.fit(X_train, y_train)

        # generate predictions
        predictions = clf.predict(X_test)

        # compute the performance metrics
        mcc = matthews_corrcoef(y_test, predictions)
        acc = accuracy_score(y_test, predictions)
        f1 = f1_score(y_test, predictions, average='weighted')
        print(f'{label:3} {acc:.2f} {f1:.2f} {mcc:.2f}')