# Ham or Spam?

In [20]:
import pandas as pd

df = pd.read_csv("emails.csv")

df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


The dataset is made up of email that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

## Remove Punctuation

👇 Create a function to remove the punctuation. Apply it to the entire data and add the output as a new column in the dataframe called `clean_text`

In [21]:
import string

In [22]:
def remove_punct(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '') 
    return text

df['clean_text'] = df['text'].map(lambda x : remove_punct(x))

## Lower Case

👇 Create a function to lower case the text. Apply it to `clean_text`

In [23]:
def lower(text):
    return text.lower()

df['clean_text'] = df['clean_text'].map(lambda x : lower(x))

## Remove Numbers

👇 Create a function to remove numbers from the text. Apply it to `clean_text`

In [24]:
def remove_digits(text):
    text = ''.join(c for c in text if not c.isdigit())
    return text

df['clean_text'] = df['clean_text'].map(lambda x : remove_digits(x))

## Remove StopWords

👇 Create a function to remove stopwords from the text. Apply it to `clean_text`.

In [11]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/florent/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/florent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

In [26]:
def remove_stopw(text):
    word_tokens = word_tokenize(text) 
    text = ''.join(w+' ' for w in word_tokens if not w in stop_words)
    return text

df['clean_text'] = df['clean_text'].map(lambda x : remove_stopw(x))

In [27]:
print(df['clean_text'][2])

subject unbelievable new homes made easy im wanting show homeowner pre approved home loan fixed rate offer extended unconditionally credit way factor take advantage limited time opportunity ask visit website complete minute post approval form look foward hearing dorcas pittman 


## Lemmatize

👇 Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`.

In [64]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/florent/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [28]:
text = 'beer beers hello men man better good'

In [29]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [30]:
def lem_text(text):
    word_tokens = word_tokenize(text) 
    lemmatized = [lemmatizer.lemmatize(word) for word in word_tokens]
    text = ''.join(w+' ' for w in lemmatized)
    return text
df['clean_text'] = df['clean_text'].map(lambda x : lem_text(x))

In [31]:
df.head(5)

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home made easy im wan...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...


## Bag-of-words Modelling

👇 Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer . Save as `X_bow`.

In [32]:
text

'beer beers hello men man better good'

In [33]:
df.shape

(5728, 3)

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X_bow = vectorizer.fit_transform(df['clean_text'])

👇 Cross-validate a MultinomialNB model with the Bag-of-words. Score the model's accuracy.

In [35]:
from sklearn.naive_bayes import MultinomialNB


y = df['spam']

nb_model = MultinomialNB()

nb_model.fit(X_bow,y)

nb_model.score(X_bow,y)

0.9949371508379888

⚠️ Please push the exercice once you are done 🙃

## 🏁 