# Spam Filtering with Memory-Based and Naive Bayes models

The following notebook has been written as part of the project for the L101 module. It aims at studying Memory-Based and Naive Bayes models for Spam Filtering. It is composed of four parts:
- Data pre-processing: multiple steps of pre-processing on the train set
- Adaptation set pre-processing
- Test set pre-processing
- Naive Bayes models: apply Naive Bayes models to the data pre-processed previously  


Memory-Based models are run thanks to TiMBL Software (Daelemans et al., 2000). To do so, we run in our console the command line: `timbl -f data_timbl/data_700.train -t data_timbl/data_700.test -wgr -dID -k1 +vcs`  
This commmand line is composed of the following arguments:
- `-f`: file with the train set in the C4.5 format (see Section 2.2 in the report)
- `-t`: file with the adaptation/test set in the C4.5 format
- `-w`: feature-weighting scheme (gr: Gain Ratio, 0: Equal Weights)
- `-d`: distance-weighting scheme (z: Equal Distance, ID: Inverse Distance, IL: Inverse Linear, ED: Exponential Decay)
- `+v`: output format (cs: class statistics with precision, recall, f1-score and AUC metrics)  


Models and parameters selection are carried out with the train and adaptation sets, while methods comparison is realised with the train and test sets.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup 
import nltk
import heapq
from info_gain import info_gain
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB
from sklearn import metrics
from sklearn import preprocessing

## Data pre-processing

### Data reading and pre-processing

Read XML files containing messages and transform them into a dataframes

In [2]:
def xml_to_df(file_path):
    file = open(file_path,"rb")
    data = BeautifulSoup(file)
    data = data.find_all('message')
    df = pd.DataFrame()
    for msg in data:
        tag_dict = {}
        for tag in msg.children:
            if tag.name is not None:
                if tag.text_normal is None:
                    tag_dict[tag.name] = [tag.string]
                else:
                    tmp = tag.find_all("text_normal")
                    text_normal = ""
                    for i in range(len(tmp)-1):
                        tag_dict[tag.name + "_normal_" + str(i)] = tmp[i].get_text().replace(tmp[i+1].get_text(),'')
                        text_normal += "\n" + tag_dict[tag.name + "_normal_" + str(i)]
                    tag_dict[tag.name + "_normal_" + str(len(tmp)-1)] = tmp[len(tmp)-1].get_text()
                    text_normal += "\n" + tag_dict[tag.name + "_normal_" + str(len(tmp)-1)]
                    tag_dict[tag.name + "_normal"] = text_normal
                if tag.text_embedded is not None:
                    tmp = tag.find_all("text_embedded")
                    for i in range(len(tmp)-1):
                        tag_dict[tag.name + "_embedded_" + str(i)] = tmp[i].get_text().replace(tmp[i+1].get_text(),'')
                    tag_dict[tag.name + "_embedded_" + str(len(tmp)-1)] = tmp[len(tmp)-1].get_text()
        df = df.append(pd.DataFrame(tag_dict))
    df = df.drop(df.columns[df.notnull().sum() == 0], axis=1)
    return df

In [3]:
df_gen = xml_to_df("GenSpam/train_GEN.ems")
df_gen['spam'] = False
df_spam = xml_to_df("GenSpam/train_SPAM.ems")
df_spam['spam'] = True

In [4]:
df = df_gen.append(df_spam)
df = df.sample(frac = 1) 

In [5]:
df = df.reset_index().drop(['index'], axis=1)
df.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_embedded_1,...,message_body_embedded_57,message_body_embedded_58,message_body_embedded_59,message_body_embedded_60,message_body_embedded_61,message_body_embedded_62,message_body_embedded_63,message_body_embedded_64,message_body_embedded_65,spam
0,"Fri, 04 Apr 2003 22:00:48 PST",\n,org,\n\n^ Q-tips ( &CHAR ) &NAME &NAME : A House ...,\n \n\n^ Q-tips ( &CHAR ) &NAME &NAME : A Hous...,"text/html; charset=""us-ascii""",\n\n^ &NAME &NAME &NAME - &NAME &NAME &NAME A...,\n \n\n^ &NAME &NAME &NAME - &NAME &NAME &NAME...,,,...,,,,,,,,,,True
1,"Wed, 26 Mar 03 05:18:06 GMT",com,org,"\n\n^ Re : online drug store , valium , viagr...","\n \n\n^ Re : online drug store , valium , via...",multipart/alternative,,,,,...,,,,,,,,,,True
2,"Fri, 2 Jun 2000 13:31:19 +0100 (BST)",ac.uk,ac.uk,\r\n\r\n^ Re : &NAME ! \r\n,\n \r\n\r\n^ Re : &NAME ! \r\n,TEXT/PLAIN; charset=US-ASCII,\r\n\r\n^ Watch for buses when crossing the r...,\n \r\n\r\n^ Watch for buses when crossing the...,\r\n\r\n^ &NAME ! ! ! ! ! \r\n^ It went reall...,,...,,,,,,,,,,False
3,"Wed, 30 Jan 2002 19:49:42 +0000",ac.uk,ac.uk,\r\n\r\n^ Re : Information \r\n,\n \r\n\r\n^ Re : Information \r\n,text/plain; charset=us-ascii,"\r\n\r\n^ PS - I like "" &NAME "" ' if it 's al...","\n \r\n\r\n^ PS - I like "" &NAME "" ' if it 's ...","\r\n\r\n^ Dear &NAME , \r\n^ &NAME , I forgot...",\r\n\r\n^ Oh ! ! ! \r\n^ Good thing I checked...,...,,,,,,,,,,False
4,"Wed, 9 Apr 2003 12:34:43 +0300",\n,org,\n\n^ STOP &NAME STARTING FROM TODAY ( &NAME ...,\n \n\n^ STOP &NAME STARTING FROM TODAY ( &NAM...,text/html; charset=ISO-8859-1,,,,,...,,,,,,,,,,True


Clean text chunks

In [6]:
filter_col = [col for col in df if col.startswith('message_body') or col.startswith('subject_normal')]
for col in filter_col:
    df[col] = df[col].apply(lambda x: str(x).replace('\n','').replace('\r','').replace('^',''))
df['to'] = df['to'].apply(lambda x: str(x).replace('\n',''))
df['from'] = df['from'].apply(lambda x: str(x).replace('\n',''))

Save pre-processed dataframes

In [None]:
df.to_csv("/content/drive/MyDrive/Cambridge/SpamFiltering/df_pre_processing.csv", sep=',', index=False)

In [None]:
df = pd.read_csv("df_pre_processing.csv", sep=',')

### Data formatting

In [7]:
# number of attributes retained to build models
nb_words = 700

Extract lemmas for tokens

In [8]:
wordnet_lemmatizer = WordNetLemmatizer()

def sentence_lemma(x):
    try:
        word_list = nltk.word_tokenize(x)
        lemmatized_output = ' '.join([wordnet_lemmatizer.lemmatize(w) for w in word_list])
        return lemmatized_output
    except:
        return None

In [9]:
df['subject_body'] = df['subject_normal_0'] + df['message_body_normal_0']
df['subject_body'] = df['subject_body'].str.lower()
df['subject_body'] = df['subject_body'].apply(lambda x: sentence_lemma(x))

Extract tokens from subject and body text chunks, compute the information gain of all tokens, and rank tokens according to their information gain

In [10]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df.loc[df['subject_body'].notna(),'subject_body'])
feature_names = vectorizer.get_feature_names()

In [11]:
to_add = True
q = []  # priority queue

# our tokenization method discards punctuation from the token list
# as punctuation distribution differs from genuine and spam messages, we manually add them to the list of tokens to consider
punctuation = ['"',"'","(",")","-","+","[","]","{","}",";",":",",","\",","<",">",".","/","?","@","#","$","%","^","&","*","_","~"]
token_list = vectorizer.get_feature_names() + punctuation

text_all = ' '.join(df.loc[df['subject_body'].notna(),'subject_body'])

for token in token_list:
    to_add = True
    # we only consider tokens that appear at least in 4 messages, and that have less than 15 characters
    if len(token) > 15 or text_all.count(token) <= 4:
        to_add = False
        continue
    if to_add:
        try:
            # we add tokens with their information gain to the priority queue
            heapq.heappush(q, (info_gain.info_gain(df.spam, df.subject_body.str.contains(token)), token))
        except:
            continue
heapq.nlargest(10, q)

[(0.0719170742000782, 'wrote'),
 (0.06851656953960494, 'nan'),
 (0.04473188593734856, 'click'),
 (0.04434977821089897, 'lick'),
 (0.04413880021512451, 'clic'),
 (0.043527897371007984, 'wr'),
 (0.039914334000032414, ','),
 (0.039635402286571364, 'hope'),
 (0.038302958857321656, 'rot'),
 (0.03604220837976613, 'cli')]

700 tokens with best information gain

In [12]:
vocab_best_ig = ['click',
 'lick',
 'clic',
 'cli',
 'lic',
 'wrote',
 'ick',
 'cl',
 'bsc',
 'subscrib',
 'ubscribe',
 'subscribe',
 'cribe',
 'ribe',
 'remov',
 'scribe',
 'unsub',
 'scr',
 'offer',
 'cr',
 'unsubs',
 'remove',
 'emove',
 'unsubscr',
 'unsubscri',
 'unsubscrib',
 'nsubscribe',
 'unsubscribe',
 'unsu',
 'cri',
 'mov',
 'receiv',
 'recei',
 'free',
 'offe',
 'ck',
 'site',
 'move',
 'hope',
 'wr',
 'fre',
 'rot',
 'receive',
 'eceive',
 'sub',
 'rib',
 'web',
 'eb',
 'ffer',
 'bsit',
 'website',
 'ebsite',
 'websi',
 'bsi',
 'hop',
 'your',
 'li',
 'edi',
 'our',
 'college',
 'rday',
 'rec',
 'iday',
 'think',
 'hink',
 'rem',
 'oved',
 'but',
 'mailing',
 'ub',
 'receiving',
 'guaran',
 'guarantee',
 'ite',
 'wel',
 'ailing',
 'oing',
 'fer',
 'product',
 'going',
 'goin',
 'morrow',
 'tomorro',
 'tomorrow',
 'orrow',
 'mark',
 'sit',
 'good',
 'valu',
 'cre',
 'well',
 'evenin',
 'sc',
 'universit',
 'evening',
 'university',
 'off',
 'emoved',
 'my',
 'goo',
 'sorry',
 'duct',
 'low',
 'prod',
 'love',
 'lin',
 'lov',
 'ov',
 'thanks',
 'anyway',
 'affiliate',
 'iz',
 'know',
 'kno',
 'sav',
 'coll',
 'here',
 'ib',
 'opt',
 'online',
 'va',
 'nive',
 'onli',
 'onlin',
 'stud',
 'ei',
 'removed',
 'oon',
 'credi',
 'rod',
 'moved',
 'ote',
 'ved',
 'nl',
 'quite',
 'dear',
 'fe',
 'fr',
 'thank',
 'rant',
 'mo',
 'ze',
 'though',
 'oin',
 'link',
 'christ',
 'saturday',
 'prove',
 'edit',
 'sd',
 'tues',
 'tuesday',
 'some',
 'nda',
 'thin',
 'sex',
 'som',
 'friday',
 'orry',
 'guaranteed',
 'riday',
 'ic',
 'shipp',
 'below',
 'ante',
 'credit',
 'rda',
 'monday',
 'onday',
 'dit',
 'save',
 'go',
 'acy',
 'lua',
 'list',
 'student',
 'iv',
 'riber',
 'dollar',
 'medic',
 'subscriber',
 'meeting',
 'usine',
 'busines',
 'usiness',
 'business',
 'siness',
 'cial',
 'there',
 'thur',
 'partner',
 'request',
 'market',
 'servic',
 'thurs',
 'ff',
 'service',
 'servi',
 'weekend',
 'rida',
 'opted',
 'pray',
 'sunday',
 'week',
 'pra',
 'special',
 'pecial',
 'privacy',
 'si',
 'doll',
 'hin',
 'req',
 'ur',
 'incr',
 'future',
 'futur',
 'yesterday',
 'altho',
 'money',
 'ive',
 'thursday',
 'tue',
 'limi',
 'ncr',
 'serv',
 'holiday',
 'wee',
 'althoug',
 'fut',
 'although',
 'oney',
 'shipping',
 'valuable',
 'day',
 'morn',
 'lecture',
 'bel',
 'limited',
 'million',
 'millio',
 'mited',
 'ere',
 'rtn',
 'limite',
 'morning',
 'lion',
 'noon',
 'name',
 'red',
 'fast',
 'xu',
 'hri',
 'rtg',
 'soo',
 'ser',
 'marketing',
 'lowest',
 'noo',
 'did',
 'so',
 'chee',
 'siz',
 'soon',
 'pt',
 'which',
 'smiley',
 'ley',
 'hich',
 'seem',
 'mortgage',
 ':',
 'thu',
 'mill',
 'owes',
 'hopeful',
 'tn',
 'ipp',
 'increase',
 'limit',
 'hall',
 'mg',
 'sol',
 'hav',
 'ot',
 'am',
 'hopefully',
 'rwa',
 'ney',
 'had',
 'this',
 'lio',
 'ia',
 'smile',
 'afternoon',
 'dol',
 'inc',
 'approved',
 'cheer',
 'hat',
 'pay',
 'prayer',
 'simply',
 'christmas',
 'tha',
 'proved',
 'eek',
 'ara',
 'uld',
 'lot',
 'vice',
 'imply',
 'after',
 'fter',
 'wednesday',
 'exual',
 'have',
 'sexual',
 'approve',
 'ning',
 'age',
 'cia',
 'quit',
 'thing',
 'upply',
 'west',
 'lu',
 'meet',
 'oval',
 'priv',
 'postal',
 'supply',
 'speak',
 'aft',
 'asy',
 'talk',
 'ther',
 'line',
 'him',
 'ray',
 'eq',
 'ip',
 'ec',
 'du',
 'ym',
 'ably',
 'ship',
 'tee',
 'gt',
 'lovely',
 'thousand',
 'ial',
 ',',
 'na',
 'erc',
 'doctor',
 'gua',
 'organis',
 'next',
 'sun',
 'octor',
 'lunch',
 'dinner',
 'shoul',
 'qual',
 'probabl',
 'purchase',
 'mile',
 'should',
 'her',
 'his',
 'spec',
 'bc',
 'rv',
 'probably',
 'nin',
 'ni',
 'forward',
 'month',
 'nice',
 'might',
 'tg',
 'perhaps',
 'onth',
 'iva',
 'went',
 'room',
 'ce',
 'sim',
 'hip',
 'size',
 'fte',
 'color',
 'oo',
 'aff',
 'church',
 'duat',
 'col',
 'ken',
 'hey',
 'wed',
 'easy',
 '00',
 'ase',
 'marked',
 'href',
 'agin',
 'vac',
 'ply',
 'posta',
 'ua',
 'say',
 'trade',
 'chase',
 'lender',
 'custo',
 'rom',
 'yet',
 'graduate',
 'customer',
 'removal',
 'oup',
 'su',
 'ail',
 'pg',
 'movie',
 'peak',
 'hee',
 'dv',
 'custom',
 'vi',
 'solicit',
 'value',
 'cust',
 'suppl',
 'yy',
 'uality',
 'that',
 'birthday',
 'saving',
 'sf',
 'oi',
 'trademark',
 'qualit',
 'rather',
 'arg',
 'rma',
 'quality',
 'scription',
 'price',
 'rade',
 'chas',
 'over',
 'pro',
 'retail',
 'nex',
 'linguist',
 'high',
 'hig',
 'safe',
 'ay',
 'hundred',
 'rid',
 'reall',
 'usa',
 'really',
 'hing',
 'ag',
 'more',
 'dieti',
 'eight',
 'faster',
 'dieting',
 'ger',
 'finish',
 'finis',
 'from',
 'bout',
 'eas',
 'net',
 'hear',
 'study',
 'nte',
 'rice',
 'about',
 'bz',
 'yme',
 'zp',
 'ease',
 'linguistic',
 'pri',
 'till',
 'hg',
 'mat',
 'sometime',
 'reserved',
 'video',
 'yon',
 'unsol',
 'solicite',
 'subscribed',
 'bscribed',
 'seein',
 'ici',
 'solicited',
 'sound',
 'seeing',
 'unsolicited',
 'weight',
 'urself',
 'seems',
 'gues',
 'redu',
 'sand',
 'ourself',
 'fro',
 'language',
 'nger',
 'zi',
 'gag',
 'gage',
 'mp',
 'got',
 'yourself',
 'me',
 'guess',
 'flat',
 'athe',
 'obligation',
 '30pm',
 'langu',
 'payment',
 'ant',
 'uess',
 'fit',
 'nso',
 'then',
 'earn',
 'rever',
 'fwd',
 'ideo',
 'umm',
 '100',
 'people',
 'motion',
 'she',
 'edici',
 'dicine',
 'they',
 'edicine',
 'uage',
 'deo',
 'medicine',
 'emark',
 'emar',
 'exclusive',
 'loan',
 'kt',
 'message',
 'vid',
 'messag',
 'doc',
 '30p',
 'revers',
 'formal',
 'ture',
 'instruction',
 'ont',
 'nth',
 'weigh',
 'also',
 'informatio',
 'information',
 'lend',
 'nformation',
 'ness',
 'muscle',
 'avenue',
 'reserve',
 'income',
 'erect',
 'tv',
 'ek',
 'discreet',
 'hl',
 'enter',
 'see',
 'yd',
 'ality',
 'hda',
 'instruct',
 'qr',
 'proven',
 'gation',
 'exclu',
 'christian',
 'wish',
 'instruc',
 'sage',
 'isf',
 'informat',
 'ende',
 'would',
 'formation',
 'ello',
 'guy',
 'yz',
 'through',
 'dad',
 'meal',
 'fb',
 'imp',
 'improve',
 'throug',
 'improv',
 'ich',
 'rmation',
 'liga',
 'git',
 'nigh',
 'mar',
 'far',
 'ngt',
 'hic',
 'pill',
 'night',
 'course',
 'informa',
 'investment',
 'cur',
 'ques',
 'instru',
 'nlarge',
 'enlarge',
 'ling',
 'deliver',
 'mati',
 '%',
 'ear',
 'scribed',
 'risk',
 'geo',
 'okay',
 'nlar',
 'sag',
 'bh',
 'linguistics',
 'mation',
 'uc',
 'satis',
 'instant',
 'recurring',
 'enlarg',
 'enlar',
 'lud',
 'prof',
 'wonder',
 'prob',
 'tio',
 'subscription',
 'uite',
 'clud',
 'practical',
 'quest',
 'tion',
 'liver',
 'exerci',
 'resting',
 'fri',
 'fini',
 'med',
 '34',
 'busy',
 'interesti',
 'icin',
 'ave',
 'vie',
 'stuff',
 'hello',
 'usc',
 'disco',
 'interestin',
 'coul',
 'sletter',
 'afe',
 'gi',
 'reduce',
 'interesting',
 'could',
 'consumer',
 'rough',
 'rof',
 'thro',
 'medical',
 'exercise',
 'thr',
 'ran',
 'agra',
 'diet',
 'usi',
 'iu']

List of the tokens retained

In [13]:
vocab_best_ig = pd.DataFrame(heapq.nlargest(nb_words,q))[1]

In [14]:
col_names_tf = [col + '_tf' for col in vocab_best_ig]

#### Dataframe with Term-Frequency attributes

Number of occurrences of the tokens retained

In [15]:
df_tf = df.copy()
for token in vocab_best_ig:
    df_tf[token + '_tf'] = df_tf.subject_body.str.count(token)
df_tf[col_names_tf] = df_tf[col_names_tf].fillna(0)
df_tf.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_embedded_1,...,mean_tf,valuable_tf,invit_tf,committee_tf,tin_tf,work_tf,sci_tf,cult_tf,link_tf,7th_tf
0,"Fri, 04 Apr 2003 22:00:48 PST",,org,Q-tips ( &CHAR ) &NAME &NAME : A House Full ...,Q-tips ( &CHAR ) &NAME &NAME : A House Full ...,"text/html; charset=""us-ascii""",&NAME &NAME &NAME - &NAME &NAME &NAME April ...,&NAME &NAME &NAME - &NAME &NAME &NAME April ...,,,...,0,0,0,0,2,0,0,0,1,0
1,"Wed, 26 Mar 03 05:18:06 GMT",com,org,"Re : online drug store , valium , viagra , z...","Re : online drug store , valium , viagra , z...",multipart/alternative,,,,,...,0,0,0,0,0,0,0,0,0,0
2,"Fri, 2 Jun 2000 13:31:19 +0100 (BST)",ac.uk,ac.uk,Re : &NAME !,Re : &NAME !,TEXT/PLAIN; charset=US-ASCII,Watch for buses when crossing the road . In...,Watch for buses when crossing the road . In...,&NAME ! ! ! ! ! It went really well ( compa...,,...,0,0,0,0,0,0,1,0,0,0
3,"Wed, 30 Jan 2002 19:49:42 +0000",ac.uk,ac.uk,Re : Information,Re : Information,text/plain; charset=us-ascii,"PS - I like "" &NAME "" ' if it 's all the sam...","PS - I like "" &NAME "" ' if it 's all the sam...","Dear &NAME , &NAME , I forgot to mention to...",Oh ! ! ! Good thing I checked my email just...,...,0,0,0,0,0,0,0,0,0,0
4,"Wed, 9 Apr 2003 12:34:43 +0300",,org,STOP &NAME STARTING FROM TODAY ( &NAME : &NA...,STOP &NAME STARTING FROM TODAY ( &NAME : &NA...,text/html; charset=ISO-8859-1,,,,,...,0,0,0,0,1,0,0,0,0,0


We stored best tokens count in a dataframe for later use

In [16]:
df_tf_train = df_tf[list(col_names_tf) + ['spam']].dropna()
df_tf_train = df_tf_train.astype(int)
df_tf_train.to_csv("data_tf/data_" + str(nb_words) + ".train", sep=',', index=False, header=False)

#### Dataframe with Boolean attributes

Boolean attributes indicating if a given message contains a given token

In [17]:
df_timbl = df.copy()
for token in vocab_best_ig:
    df_timbl[token] = df_timbl.subject_body.str.contains(token)
df_timbl.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_embedded_1,...,mean,valuable,invit,committee,tin,work,sci,cult,link,7th
0,"Fri, 04 Apr 2003 22:00:48 PST",,True,Q-tips ( &CHAR ) &NAME &NAME : A House Full ...,Q-tips ( &CHAR ) &NAME &NAME : A House Full ...,"text/html; charset=""us-ascii""",&NAME &NAME &NAME - &NAME &NAME &NAME April ...,&NAME &NAME &NAME - &NAME &NAME &NAME April ...,,,...,False,False,False,False,True,False,False,False,True,False
1,"Wed, 26 Mar 03 05:18:06 GMT",com,True,"Re : online drug store , valium , viagra , z...","Re : online drug store , valium , viagra , z...",multipart/alternative,,,,,...,False,False,False,False,False,False,False,False,False,False
2,"Fri, 2 Jun 2000 13:31:19 +0100 (BST)",ac.uk,False,Re : &NAME !,Re : &NAME !,TEXT/PLAIN; charset=US-ASCII,Watch for buses when crossing the road . In...,Watch for buses when crossing the road . In...,&NAME ! ! ! ! ! It went really well ( compa...,,...,False,False,False,False,False,False,True,False,False,False
3,"Wed, 30 Jan 2002 19:49:42 +0000",ac.uk,True,Re : Information,Re : Information,text/plain; charset=us-ascii,"PS - I like "" &NAME "" ' if it 's all the sam...","PS - I like "" &NAME "" ' if it 's all the sam...","Dear &NAME , &NAME , I forgot to mention to...",Oh ! ! ! Good thing I checked my email just...,...,False,False,False,False,False,False,False,False,False,False
4,"Wed, 9 Apr 2003 12:34:43 +0300",,True,STOP &NAME STARTING FROM TODAY ( &NAME : &NA...,STOP &NAME STARTING FROM TODAY ( &NAME : &NA...,text/html; charset=ISO-8859-1,,,,,...,False,False,False,False,True,False,False,False,False,False


We stored best boolean attributes in a dataframe for later use

In [18]:
df_timbl_train = df_timbl[list(vocab_best_ig) + ['spam']].dropna()
df_timbl_train = df_timbl_train.astype(int)
df_timbl_train.to_csv("data_timbl/data_" + str(nb_words) + ".train", sep=',', index=False, header=False)

## Adaptation set pre-processing

We load and extract data from the adaptation set

In [19]:
df_gen_adap = xml_to_df("GenSpam/adapt_GEN.ems")
df_gen_adap['spam'] = False
df_spam_adap = xml_to_df("GenSpam/adapt_SPAM.ems")
df_spam_adap['spam'] = True

In [20]:
df_adap = df_gen_adap.append(df_spam_adap)
df_adap = df_adap.sample(frac = 1) 

In [21]:
df_adap = df_adap.reset_index().drop(['index'], axis=1)
df_adap.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_embedded_1,...,message_body_embedded_5,message_body_embedded_6,message_body_normal_4,message_body_normal_5,message_body_normal_6,message_body_normal_7,message_body_normal_8,message_body_embedded_7,spam,message_body
0,"Tue, 10 Sep 2002 13:24:26 -1700",net,edu,"\n\n^ &NAME &NAME Watch , ' &NAME ' &NAME \n","\n \n\n^ &NAME &NAME Watch , ' &NAME ' &NAME \n","text/html; charset=""iso-8859-1""",\n\n^ SPECIAL ALERT The &NAME &NAME &NAME : &...,\n \n\n^ SPECIAL ALERT The &NAME &NAME &NAME :...,,,...,,,,,,,,,True,
1,"Tue, 25 Mar 2003 02:19:55 -0800",com,ac.uk,\n\n^ ... Lock in on low rates - &NAME &NUM -...,\n \n\n^ ... Lock in on low rates - &NAME &NUM...,text/html;,\n\n^ &NAME If you wish to unsubscribe from &...,\n \n\n^ &NAME If you wish to unsubscribe from...,,,...,,,,,,,,,True,
2,"Wed, 4 Dec 2002 17:14:13 -0000",ac.uk,ac.uk,\n\n^ &NAME applications \n,\n \n\n^ &NAME applications \n,"text/plain; charset=""us-ascii""",\n\n^ The Lab 's Industrial Supporters ' &NAM...,\n \n\n^ The Lab 's Industrial Supporters ' &N...,,,...,,,,,,,,,False,
3,"Fri, 22 Nov 2002 09:44:23 -0800",com,ac.uk,\n\n^ COPY ANY &NAME TO A &NAME \n,\n \n\n^ COPY ANY &NAME TO A &NAME \n,"text/plain; charset=""iso-8859-1""",\n\n^ UNSUBSCRIBE AT THE BOTTOM \n^ Dear &NAM...,\n \n\n^ UNSUBSCRIBE AT THE BOTTOM \n^ Dear &N...,,,...,,,,,,,,,True,
4,"Mon, 3 Feb 2003 21:22:25 -0800",com,\n,\n\n^ NEW / / COPY ANY &NAME TO &NAME \n,\n \n\n^ NEW / / COPY ANY &NAME TO &NAME \n,"text/plain; charset=""iso-8859-1""",\n\n^ UNSUBSCRIBE AT THE BOTTOM \n^ Dear &NAM...,\n \n\n^ UNSUBSCRIBE AT THE BOTTOM \n^ Dear &N...,,,...,,,,,,,,,True,


We clean and format data

In [22]:
filter_col = [col for col in df_adap if col.startswith('message_body') or col.startswith('subject_normal')]
for col in filter_col:
    df_adap[col] = df_adap[col].apply(lambda x: str(x).replace('\n','').replace('\r','').replace('^',''))

In [23]:
df_adap['subject_body'] = df_adap['subject_normal_0'] + df_adap['message_body_normal_0']
df_adap['subject_body'] = df_adap['subject_body'].str.lower()
df_adap['subject_body'] = df_adap['subject_body'].apply(lambda x: sentence_lemma(x))

We extract Term-Frequency attributes for the best tokens retained and save the corresponding dataframe

In [24]:
df_tf_adap = df_adap.copy()
for token in vocab_best_ig:
    df_tf_adap[token + '_tf'] = df_tf_adap.subject_body.str.count(token)
df_tf_adap[col_names_tf] = df_tf_adap[col_names_tf].fillna(0)
df_tf_adap.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_embedded_1,...,mean_tf,valuable_tf,invit_tf,committee_tf,tin_tf,work_tf,sci_tf,cult_tf,link_tf,7th_tf
0,"Tue, 10 Sep 2002 13:24:26 -1700",net,edu,"&NAME &NAME Watch , ' &NAME ' &NAME","&NAME &NAME Watch , ' &NAME ' &NAME","text/html; charset=""iso-8859-1""",SPECIAL ALERT The &NAME &NAME &NAME : &NAME ...,SPECIAL ALERT The &NAME &NAME &NAME : &NAME ...,,,...,0,0,0,0,1,0,0,0,0,0
1,"Tue, 25 Mar 2003 02:19:55 -0800",com,ac.uk,... Lock in on low rates - &NAME &NUM - more...,... Lock in on low rates - &NAME &NUM - more...,text/html;,&NAME If you wish to unsubscribe from &NAME ...,&NAME If you wish to unsubscribe from &NAME ...,,,...,0,0,0,0,0,0,0,0,0,0
2,"Wed, 4 Dec 2002 17:14:13 -0000",ac.uk,ac.uk,&NAME applications,&NAME applications,"text/plain; charset=""us-ascii""",The Lab 's Industrial Supporters ' &NAME &NA...,The Lab 's Industrial Supporters ' &NAME &NA...,,,...,0,0,0,2,0,0,0,0,0,0
3,"Fri, 22 Nov 2002 09:44:23 -0800",com,ac.uk,COPY ANY &NAME TO A &NAME,COPY ANY &NAME TO A &NAME,"text/plain; charset=""iso-8859-1""",UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,,,...,0,0,0,0,0,0,0,0,0,0
4,"Mon, 3 Feb 2003 21:22:25 -0800",com,\n,NEW / / COPY ANY &NAME TO &NAME,NEW / / COPY ANY &NAME TO &NAME,"text/plain; charset=""iso-8859-1""",UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,,,...,0,0,0,0,0,0,0,0,0,0


In [25]:
df_tf_adap = df_tf_adap[list(col_names_tf) + ['spam']].dropna()
df_tf_adap = df_tf_adap.astype(int)
df_tf_adap.to_csv("data_tf/data_" + str(nb_words) + ".adap", sep=',', index=False, header=False)

We extract boolean attributes for the best tokens retained and save the corresponding dataframe

In [26]:
df_timbl_adap = df_adap.copy()
for token in vocab_best_ig:
    df_timbl_adap[token] = df_timbl_adap.subject_body.str.contains(token)
df_timbl_adap.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_embedded_1,...,mean,valuable,invit,committee,tin,work,sci,cult,link,7th
0,"Tue, 10 Sep 2002 13:24:26 -1700",net,True,"&NAME &NAME Watch , ' &NAME ' &NAME","&NAME &NAME Watch , ' &NAME ' &NAME","text/html; charset=""iso-8859-1""",SPECIAL ALERT The &NAME &NAME &NAME : &NAME ...,SPECIAL ALERT The &NAME &NAME &NAME : &NAME ...,,,...,False,False,False,False,True,False,False,False,False,False
1,"Tue, 25 Mar 2003 02:19:55 -0800",com,True,... Lock in on low rates - &NAME &NUM - more...,... Lock in on low rates - &NAME &NUM - more...,text/html;,&NAME If you wish to unsubscribe from &NAME ...,&NAME If you wish to unsubscribe from &NAME ...,,,...,False,False,False,False,False,False,False,False,False,False
2,"Wed, 4 Dec 2002 17:14:13 -0000",ac.uk,True,&NAME applications,&NAME applications,"text/plain; charset=""us-ascii""",The Lab 's Industrial Supporters ' &NAME &NA...,The Lab 's Industrial Supporters ' &NAME &NA...,,,...,False,False,False,True,False,False,False,False,False,False
3,"Fri, 22 Nov 2002 09:44:23 -0800",com,True,COPY ANY &NAME TO A &NAME,COPY ANY &NAME TO A &NAME,"text/plain; charset=""iso-8859-1""",UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,,,...,False,False,False,False,False,False,False,False,False,False
4,"Mon, 3 Feb 2003 21:22:25 -0800",com,True,NEW / / COPY ANY &NAME TO &NAME,NEW / / COPY ANY &NAME TO &NAME,"text/plain; charset=""iso-8859-1""",UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,UNSUBSCRIBE AT THE BOTTOM Dear &NAME / Memb...,,,...,False,False,False,False,False,False,False,False,False,False


In [27]:
df_timbl_adap = df_timbl_adap[list(vocab_best_ig) + ['spam']]
df_timbl_adap = df_timbl_adap.astype(int)
df_timbl_adap.to_csv("data_timbl/data_" + str(nb_words) + ".adap", sep=',', index=False, header=False)

## Test set pre-processing

We load and extract data from the test set

In [28]:
df_gen_test = xml_to_df("GenSpam/test_GEN.ems")
df_gen_test['spam'] = False
df_spam_test = xml_to_df("GenSpam/test_SPAM.ems")
df_spam_test['spam'] = True

In [29]:
df_test = df_gen_test.append(df_spam_test)
df_test = df_test.sample(frac = 1) 

In [30]:
df_test = df_test.reset_index().drop(['index'], axis=1)
df_test.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_normal_1,...,message_body_embedded_4,message_body_embedded_5,message_body_embedded_6,message_body_embedded_7,message_body_normal_2,message_body_normal_3,message_body_normal_4,message_body_normal_5,spam,message_body
0,"Tue, 10 Jun 2003 21:00:26 +0100 (BST)",ac.uk,ac.uk,\n\n^ Re : results \n,\n \n\n^ Re : results \n,TEXT/PLAIN; charset=US-ASCII,\n\n^ Is n't it always the case ... ! \n^ Che...,\n \n\n^ Is n't it always the case ... ! \n^ C...,\n\n^ too good to be true ... see you next we...,,...,,,,,,,,,False,
1,24 Jun 2003 20:20:40 -0000,email,ac.uk,"\n\n^ Easy Mortgage Shopping , Apply Free , A...","\n \n\n^ Easy Mortgage Shopping , Apply Free ,...","text/html; charset=""us-ascii""",\n\n^ The following message was sent to you a...,\n \n\n^ The following message was sent to you...,,,...,,,,,,,,,True,
2,"Tue, 18 Feb 2003 02:53:45 -0500",com,ac.uk,\n\n^ hranostajovu mutarent ted.briscoe Come ...,\n \n\n^ hranostajovu mutarent ted.briscoe Com...,"text/html; charset=""iso-8859-1""","\n\n^ ted.briscoe \n^ On January 1st &NUM , t...","\n \n\n^ ted.briscoe \n^ On January 1st &NUM ,...",,,...,,,,,,,,,True,
3,"Thu, 27 Feb 2003 19:53:11 +0100",com,ac.uk ac.uk,\n\n^ Opinion Polls by &NAME : Your opinion i...,\n \n\n^ Opinion Polls by &NAME : Your opinion...,text/plain; charset=iso-8859-1,"\n\n^ Dear Sir or Madam , \n^ Your opinion is...","\n \n\n^ Dear Sir or Madam , \n^ Your opinion ...",,,...,,,,,,,,,True,
4,"Sun, 25 May 2003 15:46:27 +0100 (BST)",ac.uk,ac.uk,\n\n^ Re : background music \n,\n \n\n^ Re : background music \n,TEXT/PLAIN; charset=US-ASCII,"\n\n^ &NAME , did you get returned all the em...","\n \n\n^ &NAME , did you get returned all the ...",,,...,,,,,,,,,False,


We clean and format data

In [31]:
filter_col = [col for col in df_test if col.startswith('message_body') or col.startswith('subject_normal')]
for col in filter_col:
    df_test[col] = df_test[col].apply(lambda x: str(x).replace('\n','').replace('\r','').replace('^',''))

In [32]:
df_test['subject_body'] = df_test['subject_normal_0'] + df_test['message_body_normal_0']
df_test['subject_body'] = df_test['subject_body'].str.lower()
df_test['subject_body'] = df_test['subject_body'].apply(lambda x: sentence_lemma(x))

We extract Term-Frequency attributes for the best tokens retained and save the corresponding dataframe

In [33]:
df_tf_test = df_test.copy()
for token in vocab_best_ig:
    df_tf_test[token + '_tf'] = df_tf_test.subject_body.str.count(token)
df_tf_test[col_names_tf] = df_tf_test[col_names_tf].fillna(0)
df_tf_test.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_normal_1,...,mean_tf,valuable_tf,invit_tf,committee_tf,tin_tf,work_tf,sci_tf,cult_tf,link_tf,7th_tf
0,"Tue, 10 Jun 2003 21:00:26 +0100 (BST)",ac.uk,ac.uk,Re : results,Re : results,TEXT/PLAIN; charset=US-ASCII,"Is n't it always the case ... ! Cheers , &...","Is n't it always the case ... ! Cheers , &...","too good to be true ... see you next week , ...",,...,0,0,0,0,0,0,0,0,0,0
1,24 Jun 2003 20:20:40 -0000,email,ac.uk,"Easy Mortgage Shopping , Apply Free , Any Cr...","Easy Mortgage Shopping , Apply Free , Any Cr...","text/html; charset=""us-ascii""",The following message was sent to you as an ...,The following message was sent to you as an ...,,,...,0,1,0,0,1,0,0,0,0,0
2,"Tue, 18 Feb 2003 02:53:45 -0500",com,ac.uk,hranostajovu mutarent ted.briscoe Come get it,hranostajovu mutarent ted.briscoe Come get it,"text/html; charset=""iso-8859-1""","ted.briscoe On January 1st &NUM , the Europ...","ted.briscoe On January 1st &NUM , the Europ...",,,...,1,0,0,0,0,0,0,0,0,0
3,"Thu, 27 Feb 2003 19:53:11 +0100",com,ac.uk ac.uk,Opinion Polls by &NAME : Your opinion is in ...,Opinion Polls by &NAME : Your opinion is in ...,text/plain; charset=iso-8859-1,"Dear Sir or Madam , Your opinion is in dema...","Dear Sir or Madam , Your opinion is in dema...",,,...,0,1,1,0,1,0,0,0,1,0
4,"Sun, 25 May 2003 15:46:27 +0100 (BST)",ac.uk,ac.uk,Re : background music,Re : background music,TEXT/PLAIN; charset=US-ASCII,"&NAME , did you get returned all the emails ...","&NAME , did you get returned all the emails ...",,,...,0,0,0,0,0,0,0,0,0,0


In [34]:
df_tf_test = df_tf_test[list(col_names_tf) + ['spam']].dropna()
df_tf_test = df_tf_test.astype(int)
df_tf_test.to_csv("data_tf/data_" + str(nb_words) + ".test", sep=',', index=False, header=False)

We extract boolean attributes for the best tokens retained and save the corresponding dataframe

In [35]:
df_timbl_test = df_test.copy()
for token in vocab_best_ig:
    df_timbl_test[token] = df_timbl_test.subject_body.str.contains(token)
df_timbl_test.head()

Unnamed: 0,date,from,to,subject_normal_0,subject_normal,content-type,message_body_normal_0,message_body_normal,message_body_embedded_0,message_body_normal_1,...,mean,valuable,invit,committee,tin,work,sci,cult,link,7th
0,"Tue, 10 Jun 2003 21:00:26 +0100 (BST)",ac.uk,False,Re : results,Re : results,TEXT/PLAIN; charset=US-ASCII,"Is n't it always the case ... ! Cheers , &...","Is n't it always the case ... ! Cheers , &...","too good to be true ... see you next week , ...",,...,False,False,False,False,False,False,False,False,False,False
1,24 Jun 2003 20:20:40 -0000,email,True,"Easy Mortgage Shopping , Apply Free , Any Cr...","Easy Mortgage Shopping , Apply Free , Any Cr...","text/html; charset=""us-ascii""",The following message was sent to you as an ...,The following message was sent to you as an ...,,,...,False,True,False,False,True,False,False,False,False,False
2,"Tue, 18 Feb 2003 02:53:45 -0500",com,True,hranostajovu mutarent ted.briscoe Come get it,hranostajovu mutarent ted.briscoe Come get it,"text/html; charset=""iso-8859-1""","ted.briscoe On January 1st &NUM , the Europ...","ted.briscoe On January 1st &NUM , the Europ...",,,...,True,False,False,False,False,False,False,False,False,False
3,"Thu, 27 Feb 2003 19:53:11 +0100",com,True,Opinion Polls by &NAME : Your opinion is in ...,Opinion Polls by &NAME : Your opinion is in ...,text/plain; charset=iso-8859-1,"Dear Sir or Madam , Your opinion is in dema...","Dear Sir or Madam , Your opinion is in dema...",,,...,False,True,True,False,True,False,False,False,True,False
4,"Sun, 25 May 2003 15:46:27 +0100 (BST)",ac.uk,True,Re : background music,Re : background music,TEXT/PLAIN; charset=US-ASCII,"&NAME , did you get returned all the emails ...","&NAME , did you get returned all the emails ...",,,...,False,False,False,False,False,False,False,False,False,False


In [36]:
df_timbl_test = df_timbl_test[list(vocab_best_ig)[:nb_words] + ['spam']]
df_timbl_test = df_timbl_test.astype(int)
df_timbl_test.to_csv("data_timbl/data_" + str(nb_words) + ".test", sep=',', index=False, header=False)

## Naive Bayes models

Column names correspond to the tokens with best information gain

In [37]:
col_names = list(vocab_best_ig)

We load all dataframes with Term-Frequency attributes previously created

In [39]:
df_tf_train = pd.read_csv('data_tf/data_700.train')
df_tf_train.columns = col_names + ['spam']
df_tf_adap = pd.read_csv('data_tf/data_700.adap')
df_tf_adap.columns = col_names + ['spam']
df_tf_test = pd.read_csv('data_tf/data_700.test')
df_tf_test.columns = col_names + ['spam']

We create a new dataframe with normalized Term-Frequency attributes

In [40]:
df_tf_train_val = df_tf_train.values
min_max_scaler = preprocessing.MinMaxScaler()
df_tf_train_norm = min_max_scaler.fit_transform(df_tf_train_val)
df_tf_train_norm = pd.DataFrame(df_tf_train_norm)
df_tf_train_norm.columns = col_names + ['spam']

df_tf_adap_norm = pd.DataFrame(min_max_scaler.transform(df_tf_adap.values))
df_tf_adap_norm.columns = col_names + ['spam']

df_tf_test_norm = pd.DataFrame(min_max_scaler.transform(df_tf_test.values))
df_tf_test_norm.columns = col_names + ['spam']

We load all dataframes with Boolean attributes previously created

In [42]:
df_timbl_train = pd.read_csv('data_timbl/data_700.train')
df_timbl_train.columns = col_names + ['spam']
df_timbl_adap = pd.read_csv('data_timbl/data_700.adap')
df_timbl_adap.columns = col_names + ['spam']
df_timbl_test = pd.read_csv('data_timbl/data_700.test')
df_timbl_test.columns = col_names + ['spam']

Loop to apply different Naive Bayes models to our data, and compute different metrics (precision, recall, F1-score, and AUC) with the train and **adaptation** sets

In [43]:
model_names = ["Multinomial NB TF", "Multinomial NB Boolean", "Multivariate Bernoulli NB", "Gaussian NB Boolean", "Gaussian NB Normalized"]

for model_name in model_names:
    # we consider different numbers of attributes
    for nb_words in range(50,750,50):
        # model and data selection
        if model_name == "Multinomial NB TF":
            model = MultinomialNB()
            df_model_train = df_tf_train[col_names[:nb_words] + ['spam']]
            df_model_adap = df_tf_adap[col_names[:nb_words] + ['spam']]
        elif model_name == "Multinomial NB Boolean":
            model = MultinomialNB()
            df_model_train = df_timbl_train[col_names[:nb_words] + ['spam']]
            df_model_adap = df_timbl_adap[col_names[:nb_words] + ['spam']]
        elif model_name == "Multivariate Bernoulli NB":
            model = BernoulliNB()
            df_model_train = df_timbl_train[col_names[:nb_words] + ['spam']]
            df_model_adap = df_timbl_adap[col_names[:nb_words] + ['spam']]
        elif model_name == "Gaussian NB Boolean":
            model = GaussianNB()
            df_model_train = df_timbl_train[col_names[:nb_words] + ['spam']]
            df_model_adap = df_timbl_adap[col_names[:nb_words] + ['spam']]
        elif model_name == "Gaussian NB Normalized":
            model = GaussianNB()
            df_model_train = df_tf_train_norm[col_names[:nb_words] + ['spam']]
            df_model_adap = df_tf_adap_norm[col_names[:nb_words] + ['spam']]

        # we create train, adapatation and test sets for models
        X_train = df_model_train.drop('spam', axis=1)
        Y_train = df_model_train['spam']
        X_adap = df_model_adap.drop('spam', axis=1)
        Y_adap = df_model_adap['spam']

        # we fit models and make predictions
        model.fit(X_train, Y_train)
        Y_pred = model.predict(X_adap)

        # we compute different metrics about predictions
        precision = metrics.precision_score(Y_adap, Y_pred)
        recall = metrics.recall_score(Y_adap, Y_pred)
        f1_score = metrics.f1_score(Y_adap, Y_pred)
        fpr, tpr, thresholds = metrics.roc_curve(Y_adap, Y_pred)
        auc = metrics.auc(fpr, tpr)

        # we store results in a dataframe
        results = pd.DataFrame([[model_name, nb_words, precision, recall, f1_score, auc]], columns=['model_name', 'nb_words', 'precision', 'recall', 'f1_score', 'auc'])
        df_save = pd.read_csv("nb_results.csv")
        df_save = df_save.append(results)
        df_save.to_csv("nb_results.csv", index=False)

Loop to apply different Naive Bayes models to our data, and compute different metrics (precision, recall, F1-score, and AUC) with the train and **test** sets

In [44]:
model_names = ["Multinomial NB TF", "Multinomial NB Boolean", "Multivariate Bernoulli NB", "Gaussian NB Boolean", "Gaussian NB Normalized"]

for model_name in model_names:
    # we consider different numbers of attributes
    for nb_words in range(50,750,50):
        # model and data selection
        if model_name == "Multinomial NB TF":
            model = MultinomialNB()
            df_model_train = df_tf_train[col_names[:nb_words] + ['spam']]
            df_model_test = df_tf_test[col_names[:nb_words] + ['spam']]
        elif model_name == "Multinomial NB Boolean":
            model = MultinomialNB()
            df_model_train = df_timbl_train[col_names[:nb_words] + ['spam']]
            df_model_test = df_timbl_test[col_names[:nb_words] + ['spam']]
        elif model_name == "Multivariate Bernoulli NB":
            model = BernoulliNB()
            df_model_train = df_timbl_train[col_names[:nb_words] + ['spam']]
            df_model_test = df_timbl_test[col_names[:nb_words] + ['spam']]
        elif model_name == "Gaussian NB Boolean":
            model = GaussianNB()
            df_model_train = df_timbl_train[col_names[:nb_words] + ['spam']]
            df_model_test = df_timbl_test[col_names[:nb_words] + ['spam']]
        elif model_name == "Gaussian NB Normalized":
            model = GaussianNB()
            df_model_train = df_tf_train_norm[col_names[:nb_words] + ['spam']]
            df_model_test = df_tf_test_norm[col_names[:nb_words] + ['spam']]

        # we create train, adapatation and test sets for models
        X_train = df_model_train.drop('spam', axis=1)
        Y_train = df_model_train['spam']
        X_test = df_model_test.drop('spam', axis=1)
        Y_test = df_model_test['spam']

        # we fit models and make predictions
        model.fit(X_train, Y_train)
        Y_pred = model.predict(X_test)

        # we compute different metrics about predictions
        precision = metrics.precision_score(Y_test, Y_pred)
        recall = metrics.recall_score(Y_test, Y_pred)
        f1_score = metrics.f1_score(Y_test, Y_pred)
        fpr, tpr, thresholds = metrics.roc_curve(Y_test, Y_pred)
        auc = metrics.auc(fpr, tpr)

        # we store results in a dataframe
        results = pd.DataFrame([[model_name + " Test", nb_words, precision, recall, f1_score, auc]], columns=['model_name', 'nb_words', 'precision', 'recall', 'f1_score', 'auc'])
        df_save = pd.read_csv("nb_results.csv")
        df_save = df_save.append(results)
        df_save.to_csv("nb_results.csv", index=False)