# Naive Bayes

**Bernoulli Naive Bayes** : It assumes that all our features are binary such that they take only two values. Means 0s can represent “word does not occur in the document” and 1s as "word occurs in the document" .

**Multinomial Naive Bayes** : Its is used when we have discrete data (e.g. movie ratings ranging 1 and 5 as each rating will have certain frequency to represent). In text learning we have the count of each word to predict the class or label.

**Gaussian Naive Bayes** : Because of the assumption of the normal distribution, Gaussian Naive Bayes is used in cases when all our features are continuous. For example in Iris dataset features are sepal width, petal width, sepal length, petal length. So its features can have different values in data set as width and length can vary. We can’t represent features in terms of their occurrences. This means data is continuous. Hence we use Gaussian Naive Bayes here.

# Naive Bayes  (Predicting survival from titanic crash)

**Finding Probability of perticular event**

In [22]:
import pandas as pd

In [23]:
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S


In [25]:
df.shape

(891, 10)

In [26]:
df.drop(['Name','SibSp','Parch','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [27]:
inputs = df.drop('Survived',axis='columns')
target = df.Survived

In [28]:
#inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})

dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


In [29]:
inputs = pd.concat([inputs,dummies],axis='columns')   # to join 2 DF
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0


drop any one column to save from 'dummy variable trap theory'

In [30]:
inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1


In [31]:
inputs.columns[inputs.isna().any()]    # is there any NA value in any column

Index(['Age'], dtype='object')

In [32]:
inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [33]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1
3,1,35.0,53.1,1
4,3,35.0,8.05,0


In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

### GaussianNB

In [35]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [36]:
model.fit(X_train,y_train)

GaussianNB()

In [37]:
model.score(X_test,y_test)

0.7835820895522388

In [38]:
X_test[0:10]

Unnamed: 0,Pclass,Age,Fare,female
670,2,40.0,39.0,1
285,3,33.0,8.6625,0
421,3,21.0,7.7333,0
649,3,23.0,7.55,1
300,3,29.699118,7.75,1
158,3,29.699118,8.6625,0
885,3,39.0,29.125,1
533,3,29.699118,22.3583,1
722,2,34.0,13.0,0
477,3,29.0,7.0458,0


In [39]:
y_test[0:10]

670    1
285    0
421    0
649    1
300    1
158    0
885    0
533    1
722    0
477    0
Name: Survived, dtype: int64

In [40]:
model.predict(X_test[0:10])

array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0], dtype=int64)

In [41]:
model.predict_proba(X_test[:10])     # Predicting probability of servived or not

array([[0.20197138, 0.79802862],
       [0.96362492, 0.03637508],
       [0.95720855, 0.04279145],
       [0.39461774, 0.60538226],
       [0.41846687, 0.58153313],
       [0.96254347, 0.03745653],
       [0.40424059, 0.59575941],
       [0.40734085, 0.59265915],
       [0.92037948, 0.07962052],
       [0.9621664 , 0.0378336 ]])

**Calculate the score using cross validation**

In [42]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.768     , 0.816     , 0.792     , 0.7983871 , 0.70967742])

# Naive Bayes (Spam Detection)

In [43]:
import pandas as pd

In [47]:
df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [48]:
df.groupby('Category').describe()            # priting frequency of each element from that columns

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,640,Please call our customer service representativ...,4


In [99]:
df.Category.describe()

count     5572
unique       2
top        ham
freq      4825
Name: Category, dtype: object

In [49]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [50]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

In [51]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### MultinomialNB

In [52]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

MultinomialNB()

In [53]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

In [54]:
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9842067480258435

**Sklearn Pipeline**

In [55]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [56]:
clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [57]:
clf.score(X_test,y_test)

0.9842067480258435

In [58]:
clf.predict(emails)

array([0, 1], dtype=int64)