# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [222]:
# !pip install nltk

In [223]:
import string
import nltk
import pandas as pd

In [224]:
# When importing nltk for the first time, we need to also download a few built-in libraries

#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('omw-1.4')

In [225]:
df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [226]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [227]:
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text


remove_punctuation("Subject: the stock trading gunslinger")

'Subject the stock trading gunslinger'

In [228]:
df["clean_text"] = df["text"].apply(remove_punctuation)

df["clean_text"].head(2)

0    Subject naturally irresistible your corporate ...
1    Subject the stock trading gunslinger  fanny is...
Name: clean_text, dtype: object

### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [229]:
def lower_case(text):
    return text.lower()

In [230]:
df["clean_text"] = df["clean_text"].apply(lower_case)


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [231]:
def remove_numbers_lc(text):
    return ''.join(char for char in text if not char.isdigit())

In [232]:
import re

In [1]:
def remove_numbers(text):
    pattern = r'[0-9]'
    return re.sub(pattern, '', text)

In [3]:
import re
remove_numbers("283dsklj2jgfk2")

'dskljjgfk'

In [234]:
%%time
df["text"].apply(remove_numbers).head(5)

CPU times: user 168 ms, sys: 9.89 ms, total: 178 ms
Wall time: 178 ms


0    Subject: naturally irresistible your corporate...
1    Subject: the stock trading gunslinger  fanny i...
2    Subject: unbelievable new homes made easy  im ...
3    Subject:  color printing special  request addi...
4    Subject: do not have money , get software cds ...
Name: text, dtype: object

In [235]:
%%time
# A bit slower
df["text"].apply(remove_numbers).head(1)

CPU times: user 200 ms, sys: 484 µs, total: 201 ms
Wall time: 199 ms


0    Subject: naturally irresistible your corporate...
Name: text, dtype: object

In [236]:
df["clean_text"] = df["clean_text"].apply(remove_numbers)

In [237]:
df["clean_text"].head(5)

0    subject naturally irresistible your corporate ...
1    subject the stock trading gunslinger  fanny is...
2    subject unbelievable new homes made easy  im w...
3    subject  color printing special  request addit...
4    subject do not have money  get software cds fr...
Name: clean_text, dtype: object

### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [238]:
len(set(nltk.corpus.stopwords.words('english')))

179

In [239]:
def remove_stopwords(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text_split = text.split(" ")
    text = [word for word in text_split if word and word not in stopwords]
    return text


In [240]:
# Not a good idea! Slower and seems to rearrange the words.
def remove_stopwords_set(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text_split = set(text.split(" "))
    text_without_sw = text_split - (stopwords & text_split)
    return " ".join(text_without_sw)

In [241]:
%%time
df["clean_text"].apply(remove_stopwords).head(5)

CPU times: user 1.52 s, sys: 200 ms, total: 1.72 s
Wall time: 1.72 s


0    [subject, naturally, irresistible, corporate, ...
1    [subject, stock, trading, gunslinger, fanny, m...
2    [subject, unbelievable, new, homes, made, easy...
3    [subject, color, printing, special, request, a...
4    [subject, money, get, software, cds, software,...
Name: clean_text, dtype: object

In [242]:
%%time
df["clean_text"] = df["clean_text"].apply(remove_stopwords)

CPU times: user 1.45 s, sys: 290 ms, total: 1.74 s
Wall time: 1.74 s


In [243]:
df["clean_text"].head(5)

0    [subject, naturally, irresistible, corporate, ...
1    [subject, stock, trading, gunslinger, fanny, m...
2    [subject, unbelievable, new, homes, made, easy...
3    [subject, color, printing, special, request, a...
4    [subject, money, get, software, cds, software,...
Name: clean_text, dtype: object

### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [244]:
def lemmatize(text):
    lemmatizer = nltk.stem. WordNetLemmatizer()
    # Lemmatizing the verbs
    v_lemmatized = [
        lemmatizer.lemmatize(word, pos = "v") # v --> verbs
        for word in text
    ]
    # 2 - Lemmatizing the nouns
    v_n_lemmatized = [
        lemmatizer.lemmatize(word, pos = "n") # n --> nouns
        for word in v_lemmatized
    ]

    v_n_r_lemmatized = [
        lemmatizer.lemmatize(word, pos = "r") # n --> nouns
        for word in v_n_lemmatized
    ]

    return " ".join(v_n_r_lemmatized)

In [245]:
df["clean_text"] = df["clean_text"].apply(lemmatize)

In [246]:
df["clean_text"].head(5)

0    subject naturally irresistible corporate ident...
1    subject stock trade gunslinger fanny merrill m...
2    subject unbelievable new home make easy im wan...
3    subject color print special request additional...
4    subject money get software cd software compati...
Name: clean_text, dtype: object

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [247]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X_bow = count_vectorizer.fit_transform(df["clean_text"])
X_bow

<5728x28083 sparse matrix of type '<class 'numpy.int64'>'
	with 477378 stored elements in Compressed Sparse Row format>

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [256]:
import numpy as np

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score

# Feature/Target
X = df["clean_text"]
y = df["spam"]

# Pipeline vectorizer + Naive Bayes
pipeline_naive_bayes = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB()
)

# Cross-validation
cv_results = cross_validate(pipeline_naive_bayes, X, y, cv = 5, scoring = ["recall"])
average_recall = cv_results["test_recall"].mean()
np.round(average_recall,2)

0.57

In [257]:
from sklearn.model_selection import GridSearchCV

# Define the grid of parameters
parameters = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2)),
    'multinomialnb__alpha': (0.1,1)
}

# Perform Grid Search
grid_search = GridSearchCV(
    pipeline_naive_bayes,
    parameters,
    scoring = "recall",
    cv = 5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(df["clean_text"], df["spam"])

# Best score
print(f"Best Score = {grid_search.best_score_}")

# Best params
print(f"Best params = {grid_search.best_params_}")

Fitting 5 folds for each of 4 candidates, totalling 20 fits


Best Score = 0.9473703911660116
Best params = {'multinomialnb__alpha': 0.1, 'tfidfvectorizer__ngram_range': (1, 1)}


🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !

In [272]:
import numpy as np

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score

# Feature/Target
X = df["clean_text"]
y = df["spam"]

# Pipeline vectorizer + Naive Bayes
pipeline_naive_bayes = make_pipeline(
    TfidfVectorizer(ngram_range=(1,1)),
    MultinomialNB(alpha=0.1)
)

# Cross-validation
cv_results = cross_validate(pipeline_naive_bayes, X, y, cv = 5, scoring = ["recall"])
average_recall = cv_results["test_recall"].mean()
np.round(average_recall,4)

0.9474