# Spam classificator

# Problem definition

We need to classify either if an email is SPAM or not.
This is a BINARY CLASSIFICATION problem, where the label will be spam and get binary values 0/1.

## Precission vs Recall tradeoff

We also want to study the tradeoff between PRECISION and RECALL.

**Precision**: rate of correct classifications. If from 100 mails we classify 70 as spam and 50 are spam, the precision is 50/70. It is, from all the classifications we made, how many of them are correct.

**Recall**: rate of detected spams. If from 100 mails there are 80 spams, and we classify correctly 70, the recall is 70/80. It is how many of the mails that are true spams are correctly classified.

*Precision* = TP / TP+FP

*Recall* = TP / TP+FN

So for a spam filter, we want to maximize the recall.

# Data gathering

### Downloading

In [137]:
import os
import tarfile
import urllib.request

DOWNLOAD_PATH = "datasets"

SPAM_URL = 'https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2'
SPAM_2_URL = 'https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2'
HAM_URL = 'https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2'

def fetch_spam_data(spam_url=SPAM_URL, spam2_url=SPAM_2_URL, ham_url=HAM_URL, download_path=DOWNLOAD_PATH):
    if not os.path.isdir(download_path):
        os.makedirs(download_path)
    for url in (spam_url, spam2_url, ham_url):
        filename = url.split('/')[-1]
        path = os.path.join(download_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        with tarfile.open(path) as tar:
            tar.extractall(path=download_path)

fetch_spam_data()

### Loading data

In [138]:
from os import listdir
import pandas as pd

SPAM_FOLDER = os.path.join(DOWNLOAD_PATH, 'spam')
SPAM2_FOLDER = os.path.join(DOWNLOAD_PATH, 'spam_2')
HAM_FOLDER = os.path.join(DOWNLOAD_PATH, 'easy_ham')

exclude = ['0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1']

def load_spam_data(spam_folder=SPAM_FOLDER,  spam2_folder=SPAM2_FOLDER, ham_folder=HAM_FOLDER):
    data = []
    for filename in listdir(spam_folder):
        if filename in exclude:
            continue
        with open(os.path.join(spam_folder, filename), 'rb') as f:
            text = f.read().decode('latin-1')
            data.append((text, 1))
    for filename in listdir(spam2_folder):
        with open(os.path.join(spam2_folder, filename), 'rb') as f:
            text = f.read().decode('latin-1')
            data.append((text, 1))
    for filename in listdir(ham_folder):
        with open(os.path.join(ham_folder, filename), 'rb') as f:
            text = f.read().decode('latin-1')
            data.append((text, 0))       

    spam_df = pd.DataFrame(data, columns=['text', 'spam'])
    return spam_df

spam_df = load_spam_data()
# save in csv
spam_df.to_csv(os.path.join(DOWNLOAD_PATH, 'spam.csv'), index=False)

In [139]:
spam_df.head()

Unnamed: 0,text,spam
0,From 12a1mailbot1@web.de Thu Aug 22 13:17:22 ...,1
1,From ilug-admin@linux.ie Thu Aug 22 13:27:39 ...,1
2,From sabrina@mx3.1premio.com Thu Aug 22 14:44...,1
3,From wsup@playful.com Thu Aug 22 16:17:00 200...,1
4,From social-admin@linux.ie Thu Aug 22 16:37:3...,1


In [140]:
print(spam_df['spam'].value_counts())

spam
0    2551
1    1897
Name: count, dtype: int64


# Data preparation

## Train test split

In [141]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(spam_df, spam_df['spam']):
    strat_train_set = spam_df.iloc[train_index]
    strat_test_set = spam_df.iloc[test_index]

print(f"Training set size: {len(strat_train_set)}")
print(f"Test set size: {len(strat_test_set)}")
print(f"Training set spam ratio: {strat_train_set['spam'].value_counts() / len(strat_train_set)}")
print(f"Test set spam ratio: {strat_test_set['spam'].value_counts() / len(strat_test_set)}")

Training set size: 3558
Test set size: 890
Training set spam ratio: spam
0    0.573637
1    0.426363
Name: count, dtype: float64
Test set spam ratio: spam
0    0.573034
1    0.426966
Name: count, dtype: float64


## Feature engineering

I am going to create two types of new features:
- stats about the email: proportion of upper/lower chars, num of exclamations, question marks, etc
- vector of processed words

### Mail stats

In [226]:
# Function to calc stats from sender
import re
def sender_stats(text):
    sender = re.findall(r'From: (.*)', text)[0]
    sender_num_rate = len([c for c in sender if c.isdigit()]) / len(sender)
    sender_upper_rate = len([c for c in sender if c.isupper()]) / len(sender)
    sender_exclamation_rate = len([c for c in sender if c == '!']) / len(sender)

    return np.array([sender_num_rate, sender_upper_rate, sender_exclamation_rate])

# Function to calc stats from subject

CURRENCY_SYMBOLS = ['$', '£', '€', '¥', '₹', '₽', '₩', '₴', '₱', '₲', '₪', '₫', '₵', '₭', '₦', '₸', '₼', '₡', '₢', '₯', '₠', '₧', '₣', '₤', '₶', '₸', '₺', '₼', '₽', '₾', '₿']

def subject_stats(text):
    subject = re.findall(r'Subject: (.*)', text)[0]
    subject_num_rate = len([c for c in subject if c.isdigit()]) / len(subject)
    subject_upper_rate = len([c for c in subject if c.isupper()]) / len(subject)
    subject_currency_rate = len([c for c in subject if c in CURRENCY_SYMBOLS]) / len(subject)
    subject_exclamation_rate = len([c for c in subject if c == '!']) / len(subject)
    
    return np.array([subject_num_rate, subject_upper_rate, subject_currency_rate, subject_exclamation_rate])

In [227]:
# Create a custom Transformer that will be called from a ColumnTransformer
# This custom transformer will ingest a Pandas Df of dimmension (n, 1) and return a numpy array of dimmension (n, x), x = number of new features

from sklearn.base import BaseEstimator, TransformerMixin

class AttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        # X is a Pandas Df of dimmension (n, 1)
        # Each new feature is a np array of dimmension (n, 1)

        avg_word_len = X['text'].apply(lambda x: len(x) / len(x.split())).values.reshape(-1, 1)
        rate_upper = X['text'].apply(lambda x: len([c for c in x if c.isupper()]) / len(x)).values.reshape(-1, 1)
        rate_exclamation = X['text'].apply(lambda x: len([c for c in x if c == '!']) / len(x)).values.reshape(-1, 1)
        rate_question = X['text'].apply(lambda x: len([c for c in x if c == '?']) / len(x)).values.reshape(-1, 1)
        # Sender stats
        sender_num_rate, sender_upper_rate, sender_exclamation_rate = np.array(list(X['text'].apply(sender_stats))).T
        # Subject stats
        subject_num_rate, subject_upper_rate, subject_currency_rate, subject_exclamation_rate = np.array(list(X['text'].apply(subject_stats))).T


        return np.c_[
            avg_word_len, rate_upper, rate_exclamation, rate_question, 
            sender_num_rate, sender_upper_rate, sender_exclamation_rate,
            subject_num_rate, subject_upper_rate, subject_currency_rate, subject_exclamation_rate
        ]

In [228]:
# Create a pipeline that adds stats features and then uses StdScaler to scale the data

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

stats_preprocessing_pipeline = Pipeline([
    ('add_stats', AttributesAdder()),
    ('std_scaler', StandardScaler())
])

# Create column transformer that will apply the stats_preprocessing_pipeline to the text column
stats_column_transformer = ColumnTransformer([
    ('stats', stats_preprocessing_pipeline, ['text'])
], remainder='passthrough')

In [229]:
# Test on sample data
sample = strat_train_set.sample(5)
print(sample)

sample_pr = stats_column_transformer.fit_transform(sample)
print(sample_pr)

                                                   text  spam
104   From hadley@cb.offermonkey.com  Mon Aug 26 20:...     1
347   Return-Path: ler@lerami.lerctr.org\nDelivery-D...     1
1305  From money@viplook.net  Mon Jul 22 18:09:50 20...     1
3829  From rssfeeds@jmason.org  Tue Sep 24 10:47:26 ...     0
2761  From fork-admin@xent.com  Mon Sep 30 13:53:28 ...     0
[[ 0.92059662  1.30098222 -0.5         1.57704399  0.          1.03190896
   0.          0.50822246  0.7387665  -0.5         0.          1.        ]
 [-0.47931231 -0.44061642 -0.5        -0.16764798  0.          0.85953535
   0.         -0.95156546 -0.37867197 -0.5         0.          1.        ]
 [-1.63988276 -0.25936013  2.         -1.0209632   0.          0.01522388
   0.          1.68686604 -1.36655236  2.          0.          1.        ]
 [ 1.10685563 -1.49746375 -0.5        -1.0209632   0.         -1.78352664
   0.         -0.95156546 -0.48086649 -0.5         0.          0.        ]
 [ 0.09174282  0.89645808 -0.5  

#### Word Vectorizer

In [253]:
# Create a word vectorizer pipeline that will be applied to the text column
# It must replace NUM, URL, EMAIL, CURRENCY, IP
# It must take hyperparameters: tolower, word_min_len, stem, strip_header
# This vectorizer will be called from a ColumnTransformer

from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re

class WordVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, tolower=True, stem=True, strip_header=True):
        self.tolower = tolower
        self.stem = stem
        self.strip_header = strip_header
        self.stemmer = PorterStemmer()
        self.stopwords = set(stopwords.words('english'))
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # X is a Pandas Df of dimmension (n, 1)
        # Each new feature is a np array of dimmension (n, 1)
        X = X.copy()
        X['text'] = X['text'].apply(self._clean_text)
        return X['text'].values

    def _clean_text(self, text):
        if self.strip_header:
            # Header is everything before the first blank line
            text = text.split('\n\n', 1)[1]

        if self.tolower:
            text = text.lower()

        # Remove stop words
        text = ' '.join([w for w in text.split() if w not in self.stopwords])

        # NUM
        text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUM', text)
        # URL
        text = re.sub(r'(?:https?|ftp)://[\w/\-?=%.]+\.[\w/\-?=%.]+', 'URL', text)
        # EMAIL
        text = re.sub(r'\S+@\S+', 'EMAIL', text)
        # CURRENCY
        # any char in CURRENCY_SYMBOLS
        text = re.sub(r'[{}]+'.format(''.join(CURRENCY_SYMBOLS)), 'CURRENCY', text)
        # IP
        text = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', 'IP', text)

        # STEM
        if self.stem:
            text = ' '.join([self.stemmer.stem(w) for w in text.split()])
        

        return text

In [257]:
corpus = strat_train_set['text'].sample(1).values
print(corpus)

['From johnhall@evergo.net  Tue Sep 17 23:29:48 2002\nReturn-Path: <johnhall@evergo.net>\nDelivered-To: yyyy@localhost.example.com\nReceived: from localhost (jalapeno [127.0.0.1])\n\tby jmason.org (Postfix) with ESMTP id CF5C316F03\n\tfor <jm@localhost>; Tue, 17 Sep 2002 23:29:47 +0100 (IST)\nReceived: from jalapeno [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Tue, 17 Sep 2002 23:29:47 +0100 (IST)\nReceived: from mail.evergo.net ([206.191.151.2]) by dogma.slashnull.org\n    (8.11.6/8.11.6) with SMTP id g8HKUvC25665 for <jm@jmason.org>;\n    Tue, 17 Sep 2002 21:30:58 +0100\nReceived: (qmail 31515 invoked from network); 17 Sep 2002 20:31:20 -0000\nReceived: from dsl.206.191.151.102.evergo.net (HELO JMHALL)\n    (206.191.151.102) by mail.evergo.net with SMTP; 17 Sep 2002 20:31:20 -0000\nReply-To: <johnhall@evergo.net>\nFrom: "John Hall" <johnhall@evergo.net>\nTo: <yyyy@example.com>, "\'Gary Lawrence Murphy\'" <garym@canada.com>\nCc: "\'Stephen

In [258]:
vectorizer = WordVectorizer()
X_vec = vectorizer.fit_transform(pd.DataFrame(corpus, columns=['text']))
print(X_vec)

["> from: email email > sent: tuesday, septemb num, num num:num > > ... we'v fight war terror long > > there' commerce, think we'd /realize/ > > escal violenc solution. > > well said! > > --j. yeah. certainli solut carthaginian problem barbari pirates. wait ... ... actual ... rather perman solution."]


In [259]:
# Count how many URL, EMAIL, CURRENCY, IP, NUM, PHONE, TIME, DATE, PERCENT are in the X_vec
num_url = np.sum(X_vec == 'URL')
num_email = np.sum(X_vec == 'EMAIL')
num_currency = np.sum(X_vec == 'CURRENCY')
num_ip = np.sum(X_vec == 'IP')
num_num = np.sum(X_vec == 'NUM')

print(f"URL: {num_url}")
print(f"EMAIL: {num_email}")
print(f"CURRENCY: {num_currency}")
print(f"IP: {num_ip}")
print(f"NUM: {num_num}")


URL: 0
EMAIL: 0
CURRENCY: 0
IP: 0
NUM: 0
