# Naïve Bayes Spam classification

In this assignment, you would implement a Naïve Bayes Classifier (NBC) that, given the text of an SMS-message, classify it either as **spam** or **ham** (not-spam).

Our NBC will be based on a [Multinomial Distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) (so it's called a [Multinomial NBC](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naïve_Bayes).

Given a string of text, we should classify it either as $C=1$ (spam) or $C=0$ (not spam).

We will do so by counting the **occurences of words** in the text. Denote by $x_i$ the **frequency** (just the number of times) that $i$-th word (from our dictionary of all words in our dataset) occured in a given message.
Then we **assume** that the probability distribution of $x_i$-s is **multinomial** 

$$ p(\mathbf{x} \mid C_k) = \prod_{i=1}^n p_{ki}^{x_i} (1 - p_{ki})^{(1-x_i)} $$

where $p_{ki}$ is the probability that $i$-th word occurs in a text of $k$-th class (there are two classes, $k=1$ – spam and $k=2$ - not-spam).

# 1) Import dataset 

Let me help you a bit:

In [1]:
import pandas as pd
df = pd.read_csv('NaiveBayes_HW_data.csv', header = None,sep = '\t',names=['label', 'sms_message'])
# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Instead of 'ham' and 'spam', let us make it 0 and 1 respectively (we're detecting spam, after all).

In [2]:
df['label'] = df.label.map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Have a look at the size of our dataset:

In [3]:
df.shape

(5572, 2)

# 2) Making a bag-of-words

Now, use the `CountVectorizer` method from sklearn to turn the texts into bag-of-words vectors.

Everything should be clear from [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)!

# 3) Train the NBC

Use the `MultinomialNB` class from `sklearn.naive_bayes` to build a model, and then `.fit()` it to our data!

Split the data into 80% train and 20% test.

Here's [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) with the examples.

# 4) Evaluating the model

Now let's evaluate our classifier!

Use the trained model to `.predict()` the classes of the 20% test messages, and measure:

_(below # stands for "number", "# of" = "number of")_

**Accuracy** = (# of correct predictions)/(total # of predictions)

**Precision** = (# of True Positives)/(# of True Positives + # of False Positives)

**Recall(sensitivity)** = (# of True Positives)/(# of True Positives + # of False Negatives)]