## Logistic Regression: Email Spam Detection

#### In this exercise, the fundamentals of Logistic Regression are shown, posing one of the first problems that were solved through the use of Machine Learning techniques: the detection of SPAM.

## Excercise statement

#### It is proposed to build a machine learning system capable of predicting whether a certain email corresponds to a SPAM email or not, for this, the following data set will be used:
<b><a href='https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07'> 2007 TREC Public Spam Corpus </a></b>

#### The corpus trec07p contains 75,419 messages:
 - ham: 25220
 - spam: 50199

#### These messages constitute all the messages delivered to a particular server between these dates:
 - 8th April, 2007
 - 6th July, 2007


 ## 1. Complementary functions

 #### In this practical case related to the detection of SPAM emails, the data set we have is made up of emails, with their corresponding headers and additional fields. Therefore, they require preprocessing prior to being ingested by the Machine Learning algorithm.

In [2]:
# This class facilitates the preprocessing of emails that have HTML code
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

# This function is responsible for removing the HTML tags found in the text of the email
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

# Example of removing HTML tags from text
t = '<tr><td align="left"><a href="../../issues/51/16.html#article">Phrack World News</a></td>'
strip_tags(t)

'Phrack World News'

#### In addition to removing possible HTML tags found in the email, other preprocessing actions must be performed to prevent messages from containing unnecessary noise. Among them is the elimination of punctuation marks, elimination of possible fields of the email that are not relevant or elimination of the affixes of a word keeping only the root of the word (Stemming). The class shown below performs these transformations.

In [3]:
import email
import string
import nltk

class Parser:

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                   msg.get_content_type())
        content_type = msg.get_content_type()
        # Returning the content of the email
        return {"subject": subject,
                "body": body,
                "content_type": content_type}

    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body

    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actions,
        clean the punctuation symbols and do stemming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

### Reading an email in raw format

In [5]:
inmail = open("Dataset/trec07p/data/inmail.2").read()
print(inmail)


From bounce-debian-mirrors=ktwarwic=speedy.uwaterloo.ca@lists.debian.org  Sun Apr  8 13:09:29 2007
Return-Path: <bounce-debian-mirrors=ktwarwic=speedy.uwaterloo.ca@lists.debian.org>
Received: from murphy.debian.org (murphy.debian.org [70.103.162.31])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with ESMTP id l38H9S0I003031
	for <ktwarwic@speedy.uwaterloo.ca>; Sun, 8 Apr 2007 13:09:28 -0400
Received: from localhost (localhost [127.0.0.1])
	by murphy.debian.org (Postfix) with QMQP
	id 90C152E68E; Sun,  8 Apr 2007 12:09:05 -0500 (CDT)
Old-Return-Path: <yan.morin@savoirfairelinux.com>
X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-26) on murphy.debian.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.1 required=4.0 tests=BAYES_05 autolearn=no 
	version=3.1.4
X-Original-To: debian-mirrors@lists.debian.org
Received: from xenon.savoirfairelinux.net (savoirfairelinux.net [199.243.85.90])
	by murphy.debian.org (Postfix) with ESMTP id 827432E3E5
	for <debian-mirrors@lists.debian.org>; Sun,  8 Apr 2

### Email Parsing

In [6]:
p = Parser()
p.parse('Dataset/trec07p/data/inmail.2')

{'subject': ['typo', 'debianreadm'],
 'body': ['Hi',
  'ive',
  'updat',
  'gulu',
  'I',
  'check',
  'mirror',
  'It',
  'seem',
  'littl',
  'typo',
  'debianreadm',
  'file',
  'exampl',
  'httpgulususherbrookecadebianreadm',
  'ftpftpfrdebianorgdebianreadm',
  'test',
  'lenni',
  'access',
  'releas',
  'diststest',
  'the',
  'current',
  'test',
  'develop',
  'snapshot',
  'name',
  'etch',
  'packag',
  'test',
  'unstabl',
  'pass',
  'autom',
  'test',
  'propog',
  'releas',
  'etch',
  'replac',
  'lenni',
  'like',
  'readmehtml',
  'yan',
  'morin',
  'consult',
  'en',
  'logiciel',
  'libr',
  'yanmorinsavoirfairelinuxcom',
  '5149941556',
  'To',
  'unsubscrib',
  'email',
  'debianmirrorsrequestlistsdebianorg',
  'subject',
  'unsubscrib',
  'troubl',
  'contact',
  'listmasterlistsdebianorg'],
 'content_type': 'text/plain'}

### Index reading

#### These complementary functions are responsible for loading into memory the path of each email and its corresponding tag {spam, ham}

In [8]:
index = open("Dataset/trec07p/full/index").readlines()
index 

../data/inmail.329\n',
 'spam ../data/inmail.330\n',
 'spam ../data/inmail.331\n',
 'ham ../data/inmail.332\n',
 'spam ../data/inmail.333\n',
 'spam ../data/inmail.334\n',
 'spam ../data/inmail.335\n',
 'spam ../data/inmail.336\n',
 'spam ../data/inmail.337\n',
 'spam ../data/inmail.338\n',
 'ham ../data/inmail.339\n',
 'spam ../data/inmail.340\n',
 'ham ../data/inmail.341\n',
 'spam ../data/inmail.342\n',
 'spam ../data/inmail.343\n',
 'spam ../data/inmail.344\n',
 'spam ../data/inmail.345\n',
 'ham ../data/inmail.346\n',
 'spam ../data/inmail.347\n',
 'spam ../data/inmail.348\n',
 'spam ../data/inmail.349\n',
 'spam ../data/inmail.350\n',
 'ham ../data/inmail.351\n',
 'ham ../data/inmail.352\n',
 'spam ../data/inmail.353\n',
 'spam ../data/inmail.354\n',
 'spam ../data/inmail.355\n',
 'ham ../data/inmail.356\n',
 'spam ../data/inmail.357\n',
 'spam ../data/inmail.358\n',
 'spam ../data/inmail.359\n',
 'spam ../data/inmail.360\n',
 'ham ../data/inmail.361\n',
 'spam ../data/inmail.362

In [12]:

DATASET_PATH = "Dataset/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label":label, "email_path":os.path.join(DATASET_PATH, path)})
    return ret_indexes

def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

indexes = parse_index("Dataset/trec07p/full/index", 10)
indexes

[{'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.1'},
 {'label': 'ham', 'email_path': 'Dataset/trec07p/data/inmail.2'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.3'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.4'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.5'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.6'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.7'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.8'},
 {'label': 'spam', 'email_path': 'Dataset/trec07p/data/inmail.9'},
 {'label': 'ham', 'email_path': 'Dataset/trec07p/data/inmail.10'}]

## 2. Preprocessing the data in the dataset 

#### With the functions presented above, it is possible to programmatically read emails and process of them to eliminate those components that are not useful for detecting SPAM emails. However, each of the emails is still represented by a Python dictionary with a series of words.


In [15]:
# We load the index and the labels in memory
index = parse_index("Dataset/trec07p/full/index", 2)
# We read the first email
import os
open(index[0]["email_path"]).read()

'From bounce-debian-mirrors=ktwarwic=speedy.uwaterloo.ca@lists.debian.org  Sun Apr  8 13:09:29 2007\nReturn-Path: <bounce-debian-mirrors=ktwarwic=speedy.uwaterloo.ca@lists.debian.org>\nReceived: from murphy.debian.org (murphy.debian.org [70.103.162.31])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with ESMTP id l38H9S0I003031\n\tfor <ktwarwic@speedy.uwaterloo.ca>; Sun, 8 Apr 2007 13:09:28 -0400\nReceived: from localhost (localhost [127.0.0.1])\n\tby murphy.debian.org (Postfix) with QMQP\n\tid 90C152E68E; Sun,  8 Apr 2007 12:09:05 -0500 (CDT)\nOld-Return-Path: <yan.morin@savoirfairelinux.com>\nX-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-26) on murphy.debian.org\nX-Spam-Level: \nX-Spam-Status: No, score=-1.1 required=4.0 tests=BAYES_05 autolearn=no \n\tversion=3.1.4\nX-Original-To: debian-mirrors@lists.debian.org\nReceived: from xenon.savoirfairelinux.net (savoirfairelinux.net [199.243.85.90])\n\tby murphy.debian.org (Postfix) with ESMTP id 827432E3E5\n\tfor <debian-mirrors@lists.de

In [17]:
# We parse the first mail
mail, label = parse_email(index[0])
print("The mail is:", label)
print(mail)

The mail is: ham
{'subject': ['typo', 'debianreadm'], 'body': ['Hi', 'ive', 'updat', 'gulu', 'I', 'check', 'mirror', 'It', 'seem', 'littl', 'typo', 'debianreadm', 'file', 'exampl', 'httpgulususherbrookecadebianreadm', 'ftpftpfrdebianorgdebianreadm', 'test', 'lenni', 'access', 'releas', 'diststest', 'the', 'current', 'test', 'develop', 'snapshot', 'name', 'etch', 'packag', 'test', 'unstabl', 'pass', 'autom', 'test', 'propog', 'releas', 'etch', 'replac', 'lenni', 'like', 'readmehtml', 'yan', 'morin', 'consult', 'en', 'logiciel', 'libr', 'yanmorinsavoirfairelinuxcom', '5149941556', 'To', 'unsubscrib', 'email', 'debianmirrorsrequestlistsdebianorg', 'subject', 'unsubscrib', 'troubl', 'contact', 'listmasterlistsdebianorg'], 'content_type': 'text/plain'}


#### The Logistic Regression algorithm is not able to ingest text as part of the data set. Therefore, a number of additional functions must be applied that transform the text of parsed emails into a numeric representation.

### CountVectorizer app


In [18]:
from sklearn.feature_extraction.text import CountVectorizer

# Preapración del email en una cadena de texto
prep_email = [" ".join(mail['subject']) + " ".join(mail['body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("Email:", prep_email, "\n")
print("Input characteristics:", vectorizer.get_feature_names())


Email: ['typo debianreadmHi ive updat gulu I check mirror It seem littl typo debianreadm file exampl httpgulususherbrookecadebianreadm ftpftpfrdebianorgdebianreadm test lenni access releas diststest the current test develop snapshot name etch packag test unstabl pass autom test propog releas etch replac lenni like readmehtml yan morin consult en logiciel libr yanmorinsavoirfairelinuxcom 5149941556 To unsubscrib email debianmirrorsrequestlistsdebianorg subject unsubscrib troubl contact listmasterlistsdebianorg'] 

Input characteristics: ['5149941556', 'access', 'autom', 'check', 'consult', 'contact', 'current', 'debianmirrorsrequestlistsdebianorg', 'debianreadm', 'debianreadmhi', 'develop', 'diststest', 'email', 'en', 'etch', 'exampl', 'file', 'ftpftpfrdebianorgdebianreadm', 'gulu', 'httpgulususherbrookecadebianreadm', 'it', 'ive', 'lenni', 'libr', 'like', 'listmasterlistsdebianorg', 'littl', 'logiciel', 'mirror', 'morin', 'name', 'packag', 'pass', 'propog', 'readmehtml', 'releas', 'rep

In [19]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2
  1 1 1 1 4 1 1 1 2 1 2 1 1 1]]


### OneHotEncoding App

In [20]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] + mail['body']]

enc = OneHotEncoder(handle_unknown='ignore')
X = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names())
print("\nValues:\n", X.toarray())

Features:
 ['x0_5149941556' 'x0_Hi' 'x0_I' 'x0_It' 'x0_To' 'x0_access' 'x0_autom'
 'x0_check' 'x0_consult' 'x0_contact' 'x0_current'
 'x0_debianmirrorsrequestlistsdebianorg' 'x0_debianreadm' 'x0_develop'
 'x0_diststest' 'x0_email' 'x0_en' 'x0_etch' 'x0_exampl' 'x0_file'
 'x0_ftpftpfrdebianorgdebianreadm' 'x0_gulu'
 'x0_httpgulususherbrookecadebianreadm' 'x0_ive' 'x0_lenni' 'x0_libr'
 'x0_like' 'x0_listmasterlistsdebianorg' 'x0_littl' 'x0_logiciel'
 'x0_mirror' 'x0_morin' 'x0_name' 'x0_packag' 'x0_pass' 'x0_propog'
 'x0_readmehtml' 'x0_releas' 'x0_replac' 'x0_seem' 'x0_snapshot'
 'x0_subject' 'x0_test' 'x0_the' 'x0_troubl' 'x0_typo' 'x0_unstabl'
 'x0_unsubscrib' 'x0_updat' 'x0_yan' 'x0_yanmorinsavoirfairelinuxcom']

Values:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Auxiliary functions for data set preprocessing

In [26]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end='')
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['subject']) + " ".join(mail['body']))
        y.append(label)
    return X, y

## 3. Training the algorithm 

In [28]:
# We only read a subset of 200 emails
X_train, y_train = create_prep_dataset("Dataset/trec07p/full/index", 200)
X_train


Parsing email: 200

ain mean signific percentag return We call pptl one watch friday highli anticip report field It move 13 friday news isnt even yet just wait till word hit street On second thought dont wait compani premium petroleum pptl current 00085 13 target 00450 five bagger At time pptl number survey drill project progress We heard major discoveri made recommend reader capit opportun right away headlin texan want hold Em game legal tax dobb american dream threat record smith drug prescrib 1 doctor ahmadinejad pardon UK sailor collin Im scientist I believ god bush appoint ambassador',
 'iso88591qtodaysweatherdirectforecastforwaterlooweatherdirect waterloo ontario canada subscrib chang profil contact us long term\xa0\xa0 14 day trend\xa0\xa0 weather map waterloo ON sunday april 8 2007 hourli forecast time temperatur condit 5 pm 1°c cloudi sunni break 6 pm 0°c cloudi sunni break 7 pm 1°c cloudi sunni break 8 pm 2°c cloudi clear break 9 pm 2°c variabl cloudi 10 pm 3°c cloudi period 11 pm 3°c cloudi per

#### We apply vectorization to the data

In [29]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [31]:
print(X_train.toarray())
print("\nFeatures:", len(vectorizer.get_feature_names()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features: 7124


In [32]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names()])

Unnamed: 0,000,0000,000000,000713,00085,002,003,00450,0089,009,...,ⲩⱥ,㶫иï26,䡢ҫƹĳʒҫµŀͻⱥ塢soho,澫21ҵƽ,绰02035907148,绰۹ϵͳctsƽ,饻jwk,뼰ʱϵ,쫷ƹư,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
197,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam'

#### Training the logistic regression algorithm with the preprocessed data set

In [34]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression()

## 4. Prediction 

### Reading a set of new emails

In [49]:
# We read 288 emails from our data set and we only kept the last 88
# These 50 emails have not been used to train the algorithm
X, y = create_prep_dataset("Dataset/trec07p/full/index", 288)
X_test = X[200:]
y_test = y[200:]

Parsing email: 288

### Preprocessing emails with the vectorizer created earlier

In [50]:
X_test = vectorizer.transform(X_test)

### Mail type prediction

In [51]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam'],
      dtype='<U4')

In [57]:
print("Predictions:\n", y_pred)
print("\nReal labels:\n", y_test)

Predictions:
 ['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'ham' 'ham' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham'
 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam']

Real labels:
 ['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 

### Evaluation of the results

In [58]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.932


## 