## Naive Bayes Classifier

Naive Bayes:
* calculates P(C|X), i.e. the probability that a data point X should be classified as class C
* Bayes' Rule states this should be P(C|X) = P(X|C)P(C) / P(X)
* simplifying assumption 1: remove P(X) because:
    * we don't care about actual probabilities, we just want to know the most likely class for a data point X
    * P(X) is hard to calculate and it only scales the answer
* simplifying assumption 2: assume independence between variables
    * in practice this means we can use the product rule, so: P(C|X) = P(C|x1) \* P(C|x2) etc. where x1, x2 .. xn are the different variables/features
* final equation is therefore P(C=c1|X) = P(C=c1|x1) \* P(C=c1|x2) ... \* P(C=c1|xn) \* P(C=c1)
* the probability of P(C=c1) is the **prior** which can either be assumed to be uniform, or taken from the distribution of classes in the dataset

In [1]:
import pandas as pd
import numpy as np

#### Step 1 - calculate class probabilities
* figure out P(C=c1), P(C=c2) ... P(C=cn)
* or use a passed-in prior

#### Step 2 - calculate feature probabilities
* for each feature create a probability distribution using the mean and std of the values
* OR for categorical variables create a discrete p.d.

#### Step 3 - ready to process values
* for a training example:
    * for each class:
        * for each feature check whether categorical or not
            * if categorical, use the discrete p.d. to get a probability
            * otherwise use the Gaussian equation to get a probability
        * multiply together probabilities for all features
        * multiply by the class probability and store the value
    * choose the class with the highest value (or if one or more are equal, pick at random)

In [6]:
targets = np.array([2,3,1,1,2,3,2,3,74,75,2,100,1,2,3,2,2,2,1])
for t in np.unique(targets):
    print(t)

targets.

1
2
3
74
75
100


In [10]:
import scipy.stats
import math

_mean = 2
_std = 1
x = 0.5

var = float(_std)**2
#3.1415926
denom = (2*math.pi*var)**.5
num = math.exp(-(float(x)-float(_mean))**2/(2*var))
print('Manual formula: {}'.format(num/denom))
print('Scipy: {}'.format(scipy.stats.norm(_mean,_std).pdf(x)))

Manual formula: 0.12951759566589174
Scipy: 0.12951759566589174


In [15]:
df = pd.DataFrame()
df['x1'] = np.random.randint(1,200,100)
df['x2'] = np.random.randint(1,200,100)
df['x3'] = np.random.randint(1,200,100)
df['y'] =  np.random.randint(1,5,100)

x = df[['x1', 'x2', 'x3']]

In [26]:
for c in x.columns:
    print(x[c].mean())

print(x.iloc[:,0].mean())

115.41
98.95
108.06
115.41


In [31]:
l = []
df.apply(lambda x: l.append(x['x1']), axis=1)
print(l)

[113, 164, 27, 160, 114, 119, 73, 86, 92, 75, 176, 101, 167, 57, 51, 108, 168, 187, 82, 96, 132, 136, 107, 91, 174, 61, 127, 124, 171, 181, 176, 137, 103, 158, 29, 123, 75, 87, 131, 110, 29, 152, 12, 164, 104, 195, 105, 97, 180, 167, 132, 71, 35, 172, 141, 153, 179, 46, 69, 49, 84, 138, 107, 66, 140, 143, 113, 90, 153, 163, 83, 128, 34, 183, 61, 151, 190, 10, 30, 191, 161, 81, 182, 99, 164, 174, 191, 194, 78, 143, 106, 77, 147, 109, 92, 180, 20, 21, 90, 73]


In [35]:
test = pd.DataFrame()
test['x1'] = [150]
test['x2'] = [3]
test['x3'] = [111]
print(test.head())

predictions = []
test.apply(lambda x: predictions.append(x['x1']), axis=1)
predictions

    x1  x2   x3
0  150   3  111


[150]

In [41]:
df = pd.DataFrame()
df['x1'] = [100,100,200,200,100]
df['x2'] = [5,5,10,10,5]
df['x3'] = [4,4,8,8,4]

classes = [1,1,2,2,1]

idx = [x==2 for x in classes]
df.iloc[idx]

Unnamed: 0,x1,x2,x3
2,200,10,8
3,200,10,8
