# Naive Bayes Document Classification

We must compute several values, 

Priors:

$P(c)=\frac{N_c}{N}$
where $N_c$ is just number of documents with class and $N$ number of documents

We will calculate the conditional probabilities of each word in the document. For the purposes of this calculation we will not calculate conditional probabilities for every single word, but only the words in D1 and D2

Using $$P(w|c)=\frac{count(w,c)+\lambda}{count(c)+|V|\cdot \lambda}$$
Using $\lambda = 0.1$
Example calculation:


$P(rose|vegetable)=\frac{0+0.1}{8+7\cdot 0.1}$
Other calculations outlined below

We then find the maximum probablity of a document being in a class by using
Where $c$ is class and $d$ document
$P(c|d)=P(c) \cdot \prod_i^n{P(d_i|c)}$

Example calculation:
$P(flower|D1)=P(flower) \cdot P(rose|flower) \cdot P(lily|flower) \cdot P(apple|flower) \cdot P(carrot|flower)$



In [84]:
def p(wc, c, v, l=0.1):
    return (wc + l)/(c + v * l)

P={}

P[('rose', 'vegetable')] = p(0, 8, 7)
P[('lily', 'vegetable')] = p(0, 8, 2)
P[('apple', 'vegetable')] = p(0, 8, 2)
P[('carrot', 'vegetable')] = p(1, 8, 2)

P[('rose', 'flower')] = p(6, 13, 7)
P[('lily', 'flower')] = p(1, 13, 2)
P[('apple', 'flower')] = p(0, 13, 2)
P[('carrot', 'flower')] = p(0, 13, 2)

P[('rose', 'fruit')] = p(1, 14, 7)
P[('lily', 'fruit')] = p(1, 14, 2)
P[('apple', 'fruit')] = p(2, 14, 2)
P[('carrot', 'fruit')] = p(1, 14, 2)

#Priors
P['vegetable'] = 1/4
P['flower'] = 3/8
P['fruit'] = 3/8

D1_flower = P['flower']*P[('rose', 'flower')]*P[('lily', 'flower')]*P[('apple', 'flower')]*P[('carrot', 'flower')]
print("D1_flower", D1_flower)
D1_fruit = P['fruit']*P[('rose', 'fruit')]*P[('lily', 'fruit')]*P[('apple', 'fruit')]*P[('carrot', 'fruit')]
print("D1_fruit", D1_fruit)
D1_vegetable = P['vegetable']*P[('rose', 'vegetable')]*P[('lily', 'vegetable')]*P[('apple', 'vegetable')]*P[('carrot', 'vegetable')]
print("D1_vegetable", D1_vegetable)

D1_flower 7.985671244629444e-07
D1_fruit 2.490268929586247e-05
D1_vegetable 5.732867232465228e-08


We take the argmax of these values and find that the fruit class is the most probable.

Similarly for D2

In [86]:
P[('pea', 'vegetable')] = p(2, 8, 3)
P[('lotus', 'vegetable')] = p(1, 8, 2)
P[('grape', 'vegetable')] = p(0, 8, 2)

P[('pea', 'flower')] = p(1, 13, 3)
P[('lotus', 'flower')] = p(0, 13, 2)
P[('grape', 'flower')] = p(0, 13, 2)

P[('pea', 'fruit')] = p(0, 14, 3)
P[('lotus', 'fruit')] = p(1, 14, 2)
P[('grape', 'fruit')] = p(2, 14, 2)

D2_flower = P['flower']*(P[('pea', 'flower')]**2)*P[('lotus', 'flower')]*P[('grape', 'flower')]
print("D2_flower", D2_flower)
D2_fruit = P['fruit']*(P[('pea', 'fruit')]**2)*P[('lotus', 'fruit')]*P[('grape', 'fruit')]
print("D2_fruit", D2_fruit)
D2_vegetable = P['vegetable']*(P[('pea', 'vegetable')]**2)*P[('lotus', 'vegetable')]*P[('grape', 'vegetable')]
print("D2_vegetable", D2_vegetable)

D2_flower 1.47219552641001e-07
D2_fruit 2.1008472857159783e-07
D2_vegetable 2.618107011591733e-05


We find that D2 is classed as vegetable