# Naive Baye's Classifier

- In this python machine learning tutorial for beginners we will build email spam classifier using naive bayes algorithm. We will use sklearn CountVectorizer to convert email text into a matrix of numbers and then use sklearn MultinomialNB classifier to train our model. 
- The model score with this approach comes out to be very high (around 98%). Sklearn pipeline allows us to handle pre processing transformations easily with its convenient api. In the end there is an exercise where you need to classify sklearn wine dataset using naive bayes.

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  
df = pd.read_csv('titanic.csv')

In [64]:
df.drop(['PassengerId','Name','Ticket','Cabin', 'SibSp', 'Embarked', 'Parch'],axis=1,inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [65]:
X = df.drop('Survived',axis=1)
y = df['Survived']

In [66]:
dummies = pd.get_dummies(X.Sex)
X = pd.concat([X,dummies],axis=1)
X.drop(['Sex', 'male'], inplace=True, axis=1)
X.head()


Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True
3,1,35.0,53.1,True
4,3,35.0,8.05,False


In [67]:
X.columns[X.isna().any()]
X.Age.fillna(X.Age.mean(), inplace=True)
X.columns[X.isna().any()]

Index([], dtype='object')

In [68]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [69]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train,y_train)

In [70]:
model.score(X_test,y_test)

0.7597765363128491

In [71]:
df_spam = pd.read_csv('spam.csv')
df_spam.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [75]:
df_spam.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [77]:
df_spam['spam'] = df_spam['Category'].apply(lambda x: 1 if x=='spam' else 0)
df_spam.drop(['Category'],axis=1,inplace=True)
df_spam.head()

Unnamed: 0,Message,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


In [78]:
X = df_spam['Message']
y = df_spam['spam']

In [79]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [94]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
# X_train_count.toarray()[:3]

In [95]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

In [96]:
X_test_count = v.transform(X_test)
model.score(X_test_count,y_test)

0.9919282511210762

In [97]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [98]:
clf.fit(X_train,y_train)

In [99]:
clf.score(X_test,y_test)

0.9919282511210762