# Naive Bayesian - Multinomial

In [1]:
import numpy as np
import matplotlib.pyplot as plt

$$
P(y|w) = \frac{P(w|y)P(y)}{P(w)}
$$

$$ P(w_i \in train \mid y=k) = \frac{count(w_i, k)}{\sum_{i=1}^{n} count(w_i, k)} $$
    
Example:

| | docID  | words in doc    | China?   |    
|---:|:-------------|:-----------|:------|
| Training set | 1  | Chinese Beijing Chinese       | Yes   |
|  | 2  | Chinese Chinese Shanghai    | Yes  |
|  | 3  | Chinese Macao    | Yes   |
|  | 4  | Tokyo Japan Chinese   | No   |
| Test set | 5  | Chinese Chinese Chinese Tokyo Japan    | ?   |


## Laplace smoothing

$$ P(w_i \in train \mid y=k) = \frac{count(w_i, k) + 1}{\sum_{i=1}^{n} count(w_i, k) + n} $$

## Priors

$$P(y = k) = \frac{\Sigma_{i=1}^{m}1(y=k)}{m} $$

## Total likelihood

$$ P(w \in test \mid y=k) = \prod_{i=1}^{n} P(w_i \mid y=k)^{\text{freq of }w_i}$$
    
## Probability

$$P (y = k \mid w \in test) = P(y=k)\prod_{i=1}^{n} P(w_i \mid y=k)^{\text{freq of }w_i}$$

## Log probability
   
$$P (y = k \mid w \in test) = \log \ P(y=k) + \sum_{i=1}^{n} (\text{freq of }w_i) * \log \ p(w_i \mid y=k)$$

## Let's implement

### 1. Prepare some data

In [17]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

print(train.data[0]) #first 300 words
print("Target: ", train.target[0])  #start with 1, soc.religion.christian

#transform our X to frequency data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train.data)
X_test = vectorizer.transform(test.data)
X_test = X_test.toarray()  #vectorizer gives us a sparse matrix; convert back to dense matrix

y_train = train.target
y_test = test.target

print("X_train: ", X_train[0])
print("y_train: ", y_train[0])

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### 2. Calculating likelihood anrd prior

### 3. Predict

### 4. Let's use them