## Bayes Algorithms:
**Bernoulli Naive Bayes**:<br>
Assumes all features are binary (two values, 1 or 0)

**Multinomial Naive Bayes**:<br>
Used when we have discrete data (eg. movie ratings 1-5, each rating has a frequency). In text learning, we have the count of each word to predict the class or label.

**Gaussian Naive Bayes**:<br>
Assumes a normal distribution and is used in cases when all our features are continuous. 

**Discrete**: Countable in a finite amount of time

## Import libraries. Load spam email CSV.

In [1]:
import pandas as pd

df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


## Convert 'Category' feature to binary 'spam' column.

In [3]:
df["spam"] = df["Category"].apply(lambda x: 1 if x == "spam" else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


## Apply training/testing sample set splits.

In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.25)

## CountVectorizer: Gets a list of unique words in a set of strings and sets those as column headers (features) with word counts as the feature value.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()

x_train_count = v.fit_transform(x_train.values)
x_train_count.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Train model and score accuracy.

In [18]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train_count, y_train)

MultinomialNB()

In [20]:
x_test_count = v.transform(x_test)
model.score(x_test_count, y_test)

0.9820531227566404

## Make prediction using model for two email strings.

In [19]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

## You can use Pipeline to combine Vectorizer and MultinomialDB steps.

In [21]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [22]:
clf.fit(x_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [23]:
clf.score(x_test, y_test)

0.9820531227566404

In [24]:
clf.predict(emails)

array([0, 1], dtype=int64)