# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical represensation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [None]:
# !pip install nltk

In [1]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /Users/jinru/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jinru/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /Users/jinru/nltk_data...
[nltk_data] Downloading package omw-1.4 to /Users/jinru/nltk_data...


True

In [2]:
import pandas as pd

df = pd.read_csv("emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [21]:
import string
def remove_punctuations(sentence):
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') 
    return sentence

In [14]:
df['text'] = df['text'].apply(lambda x: remove_punctuations(x))

### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [15]:
def lower_case(text):
    return text.lower()

df['text'] = df['text'].apply(lambda x: lower_case(x))

### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [17]:
def remove_numbers(text):
    return ''.join(char for char in text if not char.isdigit())

In [19]:
df['text'] = df['text'].apply(lambda x: remove_numbers(x))

### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [34]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english')) # you can also choose other languages

def remove_stopwords(text):
    tokenized_sentence = word_tokenize(text)
    return [word for word in tokenized_sentence if not word in stop_words]

In [37]:
df['clean_text'] = df['text'].apply(lambda x: remove_stopwords(x))

In [45]:
df['clean_text']

0       [subject, naturally, irresistible, corporate, ...
1       [subject, stock, trading, gunslinger, fanny, m...
2       [subject, unbelievable, new, homes, made, easy...
3       [subject, color, printing, special, request, a...
4       [subject, money, get, software, cds, software,...
                              ...                        
5723    [subject, research, development, charges, gpg,...
5724    [subject, receipts, visit, jim, thanks, invita...
5725    [subject, enron, case, study, update, wow, day...
5726    [subject, interest, david, please, call, shirl...
5727    [subject, news, aurora, update, aurora, versio...
Name: clean_text, Length: 5728, dtype: object

### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [41]:
from nltk.stem import WordNetLemmatizer

def lemmatize_words(text):
    lemmatized = [WordNetLemmatizer().lemmatize(word, pos = "v") for word in text]
    return ' '.join(lemmatized)

In [60]:
df['clean_text'] = df['clean_text'].apply(lambda x: lemmatize_words(x))

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
X =  df['clean_text']

In [62]:
X

0       subject naturally irresistible corporate ident...
1       subject stock trade gunslinger fanny merrill m...
2       subject unbelievable new home make easy im wan...
3       subject color print special request additional...
4       subject money get software cds software compat...
                              ...                        
5723    subject research development charge gpg forwar...
5724    subject receipt visit jim thank invitation vis...
5725    subject enron case study update wow day super ...
5726    subject interest david please call shirley cre...
5727    subject news aurora update aurora version fast...
Name: clean_text, Length: 5728, dtype: object

In [63]:
count_vectorizer = CountVectorizer()
X_bow = count_vectorizer.fit_transform(X)

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [72]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

cv_nb = cross_validate( MultinomialNB(), X_bow, df.spam, scoring = "accuracy")

cv_nb['test_score'].mean()

0.9886525373998796

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !