# Agenda:
### Probabilistic reasoning
Bayes theorem
    - Cancer test
    - Scientific Method: Evolution by natural selection
Naive Bayes
    - Hidden features: get them from observed features
    - Process: Get frequencies of observable features for each class of hidden
    features
        - Requires a training set where the hidden features are known
    - To make a prediction for a new example, given its observable features,
    see which hidden class, given the frequencies observed for these features

### Probability of a hypothesis:

$$ Pr(h|D) $$

where $h$ is the hypothesis and $D$ is the data.

Often $h$ is a class assignment to an example.

Better:

$$argmax_{h\in H} Pr(h|D)$$

Where $H$ is a set of hypotheses: we want the $h$ with the higest probability given our data.

### Bayes's Theorem

Sometimes it's easier to know the probability of the data given a hypothesis (rather than the other way around). That is, sometimes we know this:

$$Pr(D|h)$$

Example: We don't know whether Feona or Ginger wrote an comment on YouTube, but we know that Feona curses a lot, and Ginger is much more polite. Suppose our data is this comment:

<span>"This is so wack! You all need to ##$$@ yourself, #$@@$##!" </span>

Who do you think wrote this? That is, which is:
higher: $Pr(Feona|D)$ or $Pr(Ginger|D)$?


One thing is sure, we'd expect Feona to say such things, but not Ginger. That is, we know:
$Pr(D|Feona) > Pr(D|Ginger)$


Thus it would be nice to have something that relates the probabilities $Pr(Feona|D)$, $Pr(Ginger|D)$ to $Pr(D|Feona)$, $Pr(D|Ginger)$. Or, more generally:

$$Pr(h|D) \sim Pr(D|h)$$

And we do, Baye's Theorem. We get it through these rules:

$$Pr(A . B) = Pr(A|B)Pr(B)$$
and
$$Pr(B . A) = Pr(B|A)Pr(A)$$
and 
$$Pr(A . B) = Pr(B . A)$$

Doing some algebra we get Bayes's Theorem:

$$Pr(A|B) = \frac{Pr(B|A)Pr(A)}{Pr(B)}$$

Or, using h and D:
    $$ Pr(h|D) = \frac{Pr(D|h)Pr(h)}{Pr(D)}$$

Let's try to use this in an example:

![](imgs/Bayes1.png)
![](imgs/Bayes2.png)
![](imgs/Bayes3.png)
![](imgs/Bayes4.png)
![](imgs/Bayes5.png)



### Bayesian Learning

Algorithm:
For each $h \in H$ calculate 
   $$ Pr(h|D) = \frac{Pr(D|h)Pr(h)}{Pr(D)}$$
    
and find the highest $h$.

A good thing: we dont' need $Pr(D)$, since we are comparing hypotheses using the same D. We won't get a true probability, but we'll know which $h$ is best.


So we have: 

## $$argmax_{h\in H} Pr(h)Pr(D|h)$$

Where $Pr(D|h)$ equals the product of all the probabilities of each of the observations $d$ given h. That is:

$$ \prod_{d \in D}Pr(d|h)$$


In [9]:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
import numpy as np

gnb = GaussianNB()
data = load_iris()
X = data.data
y = data.target
print(X.shape)
print(y.shape)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

gnb.fit(X_train, y_train)
acc_train = gnb.score(X_train, y_train)
acc_test = gnb.score(X_test, y_test)

print("Accuracy on training data:", acc_train)
print("Accuracy on test data:", acc_test)

(150, 4)
(150,)
Accuracy on training data: 0.94
Accuracy on test data: 0.98


Let's try tome text classification. The code below is taken from the sklearn documentation examples: Working with Text Data: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset

The data:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In [17]:
#Limiting ourselves to 4 categories to try to predict
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

In [18]:
#downloading the dat
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


The next bunch of cells is about preprocessing the data to vectors of word frequencies. I will just refer to the discussion in the link above. 

In [19]:
twenty_train.target_names
print("\n".join(twenty_train.data[0].split("\n")[:3]))
print(twenty_train.target_names[twenty_train.target[0]])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
comp.graphics


In [20]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int32)

In [21]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


In [22]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [23]:
count_vect.vocabulary_.get(u'algorithm')

4690

In [24]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

In [25]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

In [26]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [27]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


In [28]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

In [29]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [31]:
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.83488681757656458

In [32]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

metrics.confusion_matrix(twenty_test.target, predicted)

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

           avg / total       0.88      0.83      0.84      1502



array([[192,   2,   6, 119],
       [  2, 347,   4,  36],
       [  2,  11, 322,  61],
       [  2,   2,   1, 393]], dtype=int64)