# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical represensation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
!pip install nltk



In [82]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /home/cherif/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/cherif/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/cherif/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/cherif/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [94]:
import pandas as pd

df = pd.read_csv("emails.csv")
df.head()
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [95]:
import string
punc = string.punctuation

df["clean_text"] = df.text

for p in punc:
    df["clean_text"] = df["clean_text"].apply(lambda x: x.replace(p,""))
    
df.clean_text

0       Subject naturally irresistible your corporate ...
1       Subject the stock trading gunslinger  fanny is...
2       Subject unbelievable new homes made easy  im w...
3       Subject 4 color printing special  request addi...
4       Subject do not have money  get software cds fr...
                              ...                        
5723    Subject re  research and development charges t...
5724    Subject re  receipts from visit  jim   thanks ...
5725    Subject re  enron case study update  wow  all ...
5726    Subject re  interest  david   please  call shi...
5727    Subject news  aurora 5  2 update  aurora versi...
Name: clean_text, Length: 5728, dtype: object

### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [96]:
df.clean_text = df.clean_text.apply(lambda x: x.lower()) 

### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [97]:
digits = "0123456789"

for d in digits:
    df.clean_text = df.clean_text.apply(lambda x: x.replace(d, ""))
    
df.clean_text

0       subject naturally irresistible your corporate ...
1       subject the stock trading gunslinger  fanny is...
2       subject unbelievable new homes made easy  im w...
3       subject  color printing special  request addit...
4       subject do not have money  get software cds fr...
                              ...                        
5723    subject re  research and development charges t...
5724    subject re  receipts from visit  jim   thanks ...
5725    subject re  enron case study update  wow  all ...
5726    subject re  interest  david   please  call shi...
5727    subject news  aurora    update  aurora version...
Name: clean_text, Length: 5728, dtype: object

### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [98]:
liste = [0,1,2,3]
liste[2:] + [1]

[2, 3, 1]

In [99]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(df.clean_text[0])

def remove(stopwords,tokens):
    return [w for w in tokens if not w in stopwords]
            
df.clean_text = df.clean_text.apply(lambda x: remove(stop_words, word_tokenize(x)))
df.clean_text.head()

0    [subject, naturally, irresistible, corporate, ...
1    [subject, stock, trading, gunslinger, fanny, m...
2    [subject, unbelievable, new, homes, made, easy...
3    [subject, color, printing, special, request, a...
4    [subject, money, get, software, cds, software,...
Name: clean_text, dtype: object

In [100]:
def recompose(tokens):
    string =""
    for token in tokens:
        string = string + " " + token
    return string

df.clean_text = df.clean_text.apply(recompose) 
df.clean_text

0        subject naturally irresistible corporate iden...
1        subject stock trading gunslinger fanny merril...
2        subject unbelievable new homes made easy im w...
3        subject color printing special request additi...
4        subject money get software cds software compa...
                              ...                        
5723     subject research development charges gpg forw...
5724     subject receipts visit jim thanks invitation ...
5725     subject enron case study update wow day super...
5726     subject interest david please call shirley cr...
5727     subject news aurora update aurora version fas...
Name: clean_text, Length: 5728, dtype: object

### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [102]:
df.clean_text = df.clean_text.apply(word_tokenize) 
df.clean_text

0       [subject, naturally, irresistible, corporate, ...
1       [subject, stock, trading, gunslinger, fanny, m...
2       [subject, unbelievable, new, homes, made, easy...
3       [subject, color, printing, special, request, a...
4       [subject, money, get, software, cds, software,...
                              ...                        
5723    [subject, research, development, charges, gpg,...
5724    [subject, receipts, visit, jim, thanks, invita...
5725    [subject, enron, case, study, update, wow, day...
5726    [subject, interest, david, please, call, shirl...
5727    [subject, news, aurora, update, aurora, versio...
Name: clean_text, Length: 5728, dtype: object

In [104]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

df.clean_text = df.clean_text.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
df.clean_text

0       [subject, naturally, irresistible, corporate, ...
1       [subject, stock, trading, gunslinger, fanny, m...
2       [subject, unbelievable, new, home, made, easy,...
3       [subject, color, printing, special, request, a...
4       [subject, money, get, software, cd, software, ...
                              ...                        
5723    [subject, research, development, charge, gpg, ...
5724    [subject, receipt, visit, jim, thanks, invitat...
5725    [subject, enron, case, study, update, wow, day...
5726    [subject, interest, david, please, call, shirl...
5727    [subject, news, aurora, update, aurora, versio...
Name: clean_text, Length: 5728, dtype: object

In [105]:
df.clean_text = df.clean_text.apply(recompose) 
df.clean_text

0        subject naturally irresistible corporate iden...
1        subject stock trading gunslinger fanny merril...
2        subject unbelievable new home made easy im wa...
3        subject color printing special request additi...
4        subject money get software cd software compat...
                              ...                        
5723     subject research development charge gpg forwa...
5724     subject receipt visit jim thanks invitation v...
5725     subject enron case study update wow day super...
5726     subject interest david please call shirley cr...
5727     subject news aurora update aurora version fas...
Name: clean_text, Length: 5728, dtype: object

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [107]:
texts = ['i love football',
         'football is a game i love',
        'football football football']



liste=[]
for x in df.clean_text:
    liste.append(x)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(liste)

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [108]:
vectorizer.get_feature_names()



['aa',
 'aaa',
 'aaaenerfax',
 'aadedeji',
 'aagrawal',
 'aal',
 'aaldous',
 'aaliyah',
 'aall',
 'aanalysis',
 'aaron',
 'aawesome',
 'ab',
 'aba',
 'abacha',
 'abacus',
 'abahy',
 'abaixo',
 'abandon',
 'abandoned',
 'abandonment',
 'abargain',
 'abarr',
 'abattoir',
 'abb',
 'abbas',
 'abbestellen',
 'abbott',
 'abbreviated',
 'abbreviation',
 'abc',
 'abcsearch',
 'abdalla',
 'abdallat',
 'abdelnour',
 'abdul',
 'abdulla',
 'abdullah',
 'abeis',
 'abel',
 'abello',
 'aber',
 'abernathy',
 'abetted',
 'abeyance',
 'abf',
 'abhay',
 'abide',
 'abidjan',
 'abiiity',
 'abilene',
 'ability',
 'abit',
 'abitibi',
 'abklaeren',
 'abl',
 'able',
 'abler',
 'abliged',
 'ablx',
 'ably',
 'abn',
 'abnegate',
 'abnormal',
 'abnormality',
 'aboard',
 'abolish',
 'abondantly',
 'abook',
 'aboriginal',
 'abormalities',
 'abort',
 'abortive',
 'abouts',
 'aboutthis',
 'aboutus',
 'aboutvenita',
 'aboveground',
 'abovenet',
 'abovetelefax',
 'abqewvbgf',
 'abr',
 'abraham',
 'abramov',
 'abramowicz

In [110]:
import pandas as pd

X_bow = pd.DataFrame(X.toarray(),columns = vectorizer.get_feature_names())

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [113]:
from sklearn.naive_bayes import MultinomialNB

y = df.spam

nb_model = MultinomialNB()

from sklearn.model_selection import cross_val_score
cross_val_score(nb_model, X_bow, y, cv=5, scoring="accuracy", n_jobs=-1)

array([0.98691099, 0.9895288 , 0.991274  , 0.98777293, 0.99213974])

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !