<a href="https://colab.research.google.com/github/fbeilstein/machine_learning/blob/master/seminar_5_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Estimating $P(C_k)$

Suppose in a training set there are 15 documents labeled $C_1="Physics"$, 25 documents labeled $C_2="Economics"$ and $C_3="Religion"$.

Estimate $P(C_k)$ using MLE (emphirical frequencies).

#Multinomial model parameters estimation

Suppose your dictionary contains words $\{"position", "velocity", "stocks"\}$.
Suppose you have $3$ documents labeled as $"Physics"$.

Your 1st document contains 

* word "position" $x_1=0$ times
* word "velocity" $x_2=1$ times
* word "stocks" $x_3=0$ times.

Your 2nd document contains 

* word "position" $x_1=0$ times
* word "velocity" $x_2=1$ times
* word "stocks" $x_3=0$ times.

Your 3rd document contains 

* word "position" $x_1=3$ times
* word "velocity" $x_2=0$ times
* word "stocks" $x_3=0$ times.

Suppose you adopt multinomial model for word occurence

$$
P(\{x_1,x_2,x_3\}|"Phycics")=\frac{(x_1+x_2+x_3)!}{x_1!x_2!x_3!} p_{1}^{x_1}p_{2}^{x_2}p_{3}^{x_3}
$$

With given data use emphirical frequency (with smoothing $\alpha=0.1$) to estimate model parameters $p_1,p_2,p_3$ and verify that $p_1+p_2+p_3=1$.

Note: You can aggregate counts in our three documents to get 

* word "position" $x_1=3$ times
* word "velocity" $x_2=2$ times
* word "stocks" $x_3=0$ times.




##Solution:

We have $K=3$ features ($3$ words in the vocabulary) and $n=5$ trials.


$$
p_i=\frac{n_i+ \alpha}{N+K \alpha}.
$$

So

$$
\begin{aligned}
p_1&=\frac{n_1+ \alpha}{n+K \alpha}=\frac{3+ 0.1}{5+3 \times 0.1}=0.58, \\
p_2&=\frac{n_2+ \alpha}{n+K \alpha}=\frac{2+ 0.1}{5+3 \times 0.1}=0.4, \\
p_3&=\frac{n_3+ \alpha}{n+K \alpha}=\frac{0+ 0.1}{5+3 \times 0.1}=0.02.
\end{aligned}
$$

We see that

$$
p_1+p_2+p_3=0.58+0.4+0.02=1.
$$



#Multivariate Bernoulli model parameters estimation

In the previos setup adopt multivariate Bernoulli model. 
Now we need to count not total number of occurencies in all documents but rather number of documents in which the word occurs.

* word "position" is present in $2$ documents
* word "velocity" is present in $2$ documents
* word "stocks" $x_3=0$ times.

The model 

$$
P(\{x_1,x_2,x_3\})= p_{1}^{x_1}(1-p_1)^{1-x_1} \times p_{2}^{x_2}(1-p_2)^{1-x_2} \times p_{3}^{x_3}(1-p_3)^{1-x_3}
$$

where $x_i$ is either $0$ or $1$ (i.e. the word is either present in the document or not).

With given data use emphirical frequency (with smoothing $\alpha=0.1$) to estimate model parameters $p_1,p_2,p_3$ and verify that $p_1+p_2+p_3=1$.



##Solution

$$
\begin{aligned}
p_1&=\frac{n_1+ \alpha}{n+K \alpha}=\frac{2+ 0.1}{4+3 \times 0.1}=0.49, \\
p_2&=\frac{n_2+ \alpha}{n+K \alpha}=\frac{2+ 0.1}{4+3 \times 0.1}=0.49, \\
p_3&=\frac{n_3+ \alpha}{n+K \alpha}=\frac{0+ 0.1}{4+3 \times 0.1}=0.02.
\end{aligned}
$$


In [0]:
categories=["Physics", "Economy", "Religion"]

P=["In Newtonian mechanics, linear momentum, translational momentum, or simply momentum (pl. momenta) is the product of the mass and velocity of an object.",
   "A neutrino (denoted by the Greek letter ν) is a fermion (an elementary particle with half-integer spin) that interacts only via the weak subatomic force and gravity.",
   "In physics, the center of mass of a distribution of mass in space is the unique point where the weighted relative position of the distributed mass sums to zero. This is the point to which a force may be applied to cause a linear acceleration without an angular acceleration."]

E=["Money is any item or verifiable record that is generally accepted as payment for goods and services and repayment of debts, such as taxes, in a particular country or socio-economic context.",
   "Originally money was a form of receipt, representing grain stored in temple granaries in Sumer in ancient Mesopotamia and later in Ancient Egypt. In this first stage of currency, metals were used as symbols to represent value stored in the form of commodities. ",
   "In economics, inflation is a increase in the general price level of goods and services in an economy over a period of time. When the general price level rises, each unit of currency buys fewer goods and services; consequently, inflation reflects a reduction in the purchasing power per unit of money – a loss of real value in the medium of exchange and unit of account within the economy."]

R=["Christianity is an Abrahamic monotheistic religion based on the life and teachings of Jesus of Nazareth. Its adherents, known as Christians, believe that Jesus is the Christ, the Son of God, and the savior of all people, whose coming as the Messiah was prophesied in the Hebrew Bible, called the Old Testament in Christianity, and chronicled in the New Testament.",
   "Traditionalist Catholicism is a set of religious beliefs made up of the customs, traditions, liturgical forms, public, private and group devotions, and presentations of the teaching of the Catholic Church before the Second Vatican Council",
   "Most modern scholars believe that John the Baptist performed a baptism on Jesus, and view it as a historical event to which a high degree of certainty can be assigned."]

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(training_set.data, training_set.target)
