# Naive Bayesian Network 

## Introduction
Let's classify some spam emails!




## Definitions

$$
P(SPAM=1) = \frac{\texttt{# of emails that are spam}}{\texttt{# of emails in the data set}}
$$ 

$$
P(SPAM=1\mid OFFER=1) := \frac{\texttt{# of emails that are spam and contain the word "offer"}}{\texttt{# of emails that contain the word "offer"}}
$$

## Hypothesis

$$
P(OFFER =1 \mid SPAM=1)  > P(OFFER = 1 \mid SPAM=0)
$$

If given a spam email, the probability of that it contains the word *offer* is higher than the probability of that it does NOT contain the word *offer*, then we can infer that:

$$
P(SPAM=1 \mid OFFER=1) > P(SPAM = 1)
$$

## Proofs

$$ \begin{align}
P(SPAM=1 \mid OFFER=1) = \frac{P(SPAM=1) P(OFFER=1 \mid SPAM=1)}{P(OFFER=1)} \\
= \frac{\frac{\texttt{# of SPAM emails}}{\texttt{# of total emails}}\frac{\texttt{# of SPAM emails with the word OFFER}}{\texttt{# of SPAM emails}}}{\frac{\texttt{# of emails with the word OFFER}}{\texttt{# of total emails}}}\\
= \frac{\texttt{# of SPAM emails with the word OFFER}}{\texttt{# of emails with the word OFFER}}\\
= P(SPAM=1 \mid OFFER=1)\\
\end{align}
$$


## Bayes' Rule

$$
P(SPAM=0 \mid OFFER=1) = \frac{P(SPAM=0)P(OFFER=1 \mid SPAM=0)}{P(OFFER=1)} \\
$$



In [2]:
from naive_bayes.email_set import EmailSet
from naive_bayes.email_set import build_and_save_email_set
from naive_bayes.feature_prob import FeatureProbability

# If you haven't pickled it, then run 
build_and_save_email_set()

es = EmailSet.get()
fp = FeatureProbability.from_email_set(es)

code = es.word_encoding_dictionary.word_to_code("offer")
print "Code: %s" % code
print "Ham count: %s" % fp.class_count.ham_count
print "Spam count: %s" % fp.class_count.spam_count
print "Code count: %s" % fp.code_count[code]
print "Prob ratio: %s" % fp.code_prob_ratio(code)

Dataset already processed!
Code: 3751
Ham count: 3672
Spam count: 1500
Code count: {'spam_count': 141, 'ham_count': 61}
Prob ratio: 5.65849180328
