#<span style="color:#0b486b">SIT 112 - Data Science Concepts</span>

---
Lecturer: Dinh Phung | dinh.phung@deakin.edu.au<br />
Assistant: Adham Beyki | abeyki@deakin.edu.au

School of Information Technology, <br />
Deakin University, VIC 3215, Australia.

---
## <span style="color:#0b486b">Practical Session 10: Naive Bayes Classifier</span> 

## <span style="color:#0b486b">Naive Bayes Classifier</span> 


Naive Bayes is one of the most practical classification machine learning algorithms. 

* fast
* good performance
* simple yet very effective
* robust to irrelative features

So why is it called naive?

Because it does not consider the dependency between features and assume all features are independent of each other which is not the case in reality. This is a naive assumption, hence the name.

The accuracy is very good although this naive assumption. A famous example of NB usage is spam filtering.

---
### Example1

We assume we have collected the below data for the past 5 days. Based on this data, can we predict if our subject will play in a setting like:

    outlook  = overcast
    temp     = hot
    humidity = normal
    windy    = no

<!-- <img src="nb_data.png" width="800"> -->
<img src="nb_data.png" width="800">
<br />

First we have to find a representation for our data. We can construct a dictionary to convert stings into numbers and then save them in a dataframe. 

    outlook: sunny=0, overcast=1, rainy=2
    temp: hot=0, mild=1, cool=2
    humidity: normal=0, high=1
    wind: no=0, yes=1
    play: np=0, yes=1

In [1]:
from __future__ import division

import numpy as np
import pandas as pd

In [2]:
data = {
    'outlook': [0, 1, 2, 0, 1],
    'temp'   : [0, 1, 2, 1, 0],
    'humid'  : [0, 0, 1, 0, 1],
    'wind'   : [0, 0, 1, 1, 0],
    'play'   : [1, 1, 0, 0, 0,]    
}

df = pd.DataFrame(data)

In [3]:
df

Unnamed: 0,humid,outlook,play,temp,wind
0,0,0,1,0,0
1,0,1,1,1,0
2,1,2,0,2,1
3,0,0,0,1,1
4,1,1,0,0,0


Now we use Bayes rule to construct a Naive Bayes classifier. We can write:

$$Pr\left(p|o,t,h,w\right)\propto Pr\left(p\right)Pr(o|p)Pr(t|p)Pr(h|p)Pr(w|p)$$

To calculate $Pr(p)$ we use marginal probablity.

In [4]:
def marginal_prob(df, col):
    ll = [(ss, (df[col] == ss).sum()) for ss in set(df[col])]
    total_count = [b for a,b in ll]
    total_count = sum(total_count)
    
    ll2 = [(a, b/total_count) for a, b in ll]
    return dict(ll2)

In [6]:
marginal_prob(df, 'wind')

{0: 0.59999999999999998, 1: 0.40000000000000002}

To calculate probability of a feature given the class (play) we use conditinoal probability.

In [7]:
def conditional_prob(df, f, c, val):
    
    states = set(df[f])
    df2 = df[df[c] == val][f]
    ll = [[ss, (df2 == ss).sum()] for ss in states]
    total_count = [b for a,b in ll]
    total_count = sum(total_count)
    
    ll2 = [(a, b/total_count) for a, b in ll]
    return dict(ll2)

In [9]:
conditional_prob(df, 'wind', 'play', 1)

{0: 1.0, 1: 0.0}

Now we can use Bayes rule:

In [10]:
o = 1
t = 0
h = 0
w = 0

c = 0
p0 = marginal_prob(df, 'play')[c] * conditional_prob(df, 'outlook', 'play', c)[o] * conditional_prob(df, 'temp', 'play', c)[t] \
* conditional_prob(df, 'humid', 'play', c)[h] * conditional_prob(df, 'wind', 'play', c)[w]

c = 1
p1 = marginal_prob(df, 'play')[c] * conditional_prob(df, 'outlook', 'play', c)[o] * conditional_prob(df, 'temp', 'play', c)[t] \
* conditional_prob(df, 'humid', 'play', c)[h] * conditional_prob(df, 'wind', 'play', c)[w]

In [12]:
p0, p1

(0.0074074074074074051, 0.10000000000000001)

In [13]:
# normalizing
p_sum = p0 + p1
p0 /= p_sum
p1 /= p_sum

print "probability of not playing: {}".format(p0)
print "probability of playing    : {}".format(p1)

probability of not playing: 0.0689655172414
probability of playing    : 0.931034482759


---
### Example 2

Suppose we have documents below as our training set. 

    d1: Chinese Beijing Chinese , class = C
    d2: Chinese Chinese Shanghai, class = C
    d3: Chinese Macao           , class = C
    d4: Tokyo Japan Chinese     , class = J


Train a NB classifier and predict if `d5` belongs to class C or J.

    d5: Chinese Chinese Chinese Tokyo Japan, class = ?