<a href="https://colab.research.google.com/github/fbeilstein/machine_learning/blob/master/seminar_5_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Road map for Naive Bayes text classifier

![](https://raw.githubusercontent.com/fbeilstein/machine_learning/master/lecture_5_naive_bayes/text.png)


#Estimating $P(C_k)$

Suppose in a training set there are 15 documents labeled $C_1="Physics"$, 25 documents labeled $C_2="Economics"$ and $C_3="Religion"$.

Estimate $P(C_k)$ using MLE (emphirical frequencies).

You are given list of documents $corpus$.
You are also given a list of tags $tags$.

Write function that calculates $C_k$.

In [0]:
from collections import Counter

class tagsEstimator:
  def calculatePriors(self, p):
    counter = Counter(p)
    out=[]
    for i in counter:
      out.append(counter[i]/len(p))
    return out

TE = tagsEstimator()
TE.calculatePriors(["Physics", "Lyrics", "Lyrics", "Lyrics"])

[0.25, 0.75]

#Custom Count Vectorizer

Write your own simple Vectorizer that takes a vector $corpus$ of strings, prepares vocabulary $V$ and returns a table $|corpus|\times|V|$ with counts of words of the vocabulary in each document. 
Your class would be analog of ```sklearn.feature_extraction.text.CountVectorizer```.
The interface of our class should be as follows:
* ```fit_transform(self, corpus)``` - feeds vectorizer a list of strings.
* ```get_feature_names(self)``` - returns of lexigraphically sorted lowercase words in the vocabulary.
* ```toarray(self)``` - returns array $|corpus|\times|V|$ with counts.

Note: you should get rid of punctuation marks and convert your lines into lowercase in all your documents.

Now modify your code so that the counts are boolean ($1$ if the word is present in the document and $0$ otherwise).
This corresponds to ```CountVectorizer(binary = True)```.






.

##Standard Count Vectorizer

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This document is the second document.',
'And this is the third one.', 'Is this the first document?']

vectorizer_C = CountVectorizer()
C = vectorizer_C.fit_transform(corpus)
print("Vocabulary : ", vectorizer_C.get_feature_names())
print("arr :", C.toarray())

Vocabulary :  ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
arr : [[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


##Solution

In [0]:
from collections import Counter
import string

class SimpleCountVectorizer:
  """Vectorizer that counts number of word occurrences in documents"""
  def __init__(self, binary=False):
    self.binary = binary
    
  def fit_transform(self, corpus):
    self.vocabulary = set()
    counters = []
    self.counts = []
    
    for d in corpus:
      counter = Counter()
      table = str.maketrans({key: None for key in string.punctuation})
      d_without_punct = d.translate(table) # deletes all punctuation
      words = d_without_punct.lower().split()
      self.vocabulary.update(set(words))
      counter |= Counter(words)
      counters.append(counter)
    self.vocabulary = sorted(self.vocabulary)  
    
    for d_index in range(0, len(corpus)):
      counts = []
      for word in self.vocabulary:
        count = int(counters[d_index][word] > 0) if self.binary else counters[d_index][word]
        counts.append(count)
      self.counts.append(counts)
 
  def get_feature_names(self):
    return self.vocabulary
  
  def toarray(self):
    return  self.counts
      
  
corpus = ['This is the first document.', 'This document is the second document.',
'And this is the third one.', 'Is this the first document?']
SCV = SimpleCountVectorizer(binary=True)
SCV.fit_transform(corpus)
print("Vocabulary : ", SCV.get_feature_names())
print("arr :", SCV.toarray())

    

Vocabulary :  ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
arr : [[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]


#Multinomial model parameters estimation

Suppose your dictionary contains words $\{"position", "velocity", "stocks"\}$.
Suppose you have $3$ documents labeled as $"Physics"$.

Your 1st document contains 

* word "position" $x_1=0$ times
* word "velocity" $x_2=1$ times
* word "stocks" $x_3=0$ times.

Your 2nd document contains 

* word "position" $x_1=0$ times
* word "velocity" $x_2=1$ times
* word "stocks" $x_3=0$ times.

Your 3rd document contains 

* word "position" $x_1=3$ times
* word "velocity" $x_2=0$ times
* word "stocks" $x_3=0$ times.

Suppose you adopt multinomial model for word occurence

$$
P(\{x_1,x_2,x_3\}|"Phycics")=\frac{(x_1+x_2+x_3)!}{x_1!x_2!x_3!} p_{1}^{x_1}p_{2}^{x_2}p_{3}^{x_3}
$$

With given data use emphirical frequency (with smoothing $\alpha=0.1$) to estimate model parameters $p_1,p_2,p_3$ and verify that $p_1+p_2+p_3=1$.

Note: You can aggregate counts in our three documents to get 

* word "position" $x_1=3$ times
* word "velocity" $x_2=2$ times
* word "stocks" $x_3=0$ times.




##Solution:

We have $K=3$ features ($3$ words in the vocabulary) and $n=3+2+0=5$ "trials" when we pick words from the "bag" with replacement.


$$
p_i=\frac{n_i+ \alpha}{N+K \alpha}.
$$

So

$$
\begin{aligned}
p_1&=\frac{n_1+ \alpha}{n+K \alpha}=\frac{3+ 0.1}{5+3 \times 0.1}=0.58, \\
p_2&=\frac{n_2+ \alpha}{n+K \alpha}=\frac{2+ 0.1}{5+3 \times 0.1}=0.4, \\
p_3&=\frac{n_3+ \alpha}{n+K \alpha}=\frac{0+ 0.1}{5+3 \times 0.1}=0.02.
\end{aligned}
$$

We see that

$$
p_1+p_2+p_3=0.58+0.4+0.02=1.
$$



#Multivariate Bernoulli model parameters estimation

In the previos setup adopt multivariate Bernoulli model. 
Now we need to count not total number of occurencies in all documents but rather number of documents in which the word occurs.

Your 1st document contains

* word "position" is absent (bool frequency is $0$)
* word "velocity" is present (bool frequency is $1$)
* word "stocks" is absent (bool frequency is $0$)

Your 2nd document contains

* word "position" is absent (bool frequency is $0$)
* word "velocity" is present (bool frequency is $1$)
* word "stocks" is absent (bool frequency is $0$)

Your 3rd document contains

* word "position" is present (bool frequency is $1$)
* word "velocity" is absent (bool frequency is $0$)
* word "stocks" is absent (bool frequency is $0$)

We can add boolean frequencies
* word "position" is present in $1$ documents
* word "velocity" is present in $2$ documents
* word "stocks" $x_3=0$ times.

The model 

$$
P(\{x_1,x_2,x_3\})= p_{1}^{x_1}(1-p_1)^{1-x_1} \times p_{2}^{x_2}(1-p_2)^{1-x_2} \times p_{3}^{x_3}(1-p_3)^{1-x_3}
$$

where $x_i$ is either $0$ or $1$ (i.e. the word is either present in the document or not).

With given data use emphirical frequency (with smoothing $\alpha=0.1$) to estimate model parameters $p_1,p_2,p_3$ and verify that $p_1+p_2+p_3=1$.



##Solution

$K=3$ (three words in the dictionary) and $n=1+2+0=3$. Therefore
$$
\begin{aligned}
p_1&=\frac{n_1+ \alpha}{n+K \alpha}=\frac{1+ 0.1}{3+3 \times 0.1}=0.33, \\
p_2&=\frac{n_2+ \alpha}{n+K \alpha}=\frac{2+ 0.1}{3+3 \times 0.1}=0.64, \\
p_3&=\frac{n_3+ \alpha}{n+K \alpha}=\frac{0+ 0.1}{3+3 \times 0.1}=0.03.
\end{aligned}
$$



#Automatic parameters estimation

Write class that performs multinomial and multivariate parameters estimation. The class gets array with counts from Vectorizer and returns vector with $p_i$s. You can also set $\alpha$ in the constructor (default value is $1$). Verify your code with the example above.

##Solution

In [0]:
class MultinomialParametersEstimator:
  """Calculates parameters for Multinomial/multivariate Bernoulli model"""
  def __init__(self, alpha=1):
    self.alpha = alpha
    
  def get_params_for_tag(self, counts):
    sum = counts[0]
    K = len(sum)
    for i in range(1,len(counts)):
      for j in range(0,K):
        sum[j] += counts[i][j]
    
    n = 0
    for i in range(0,K):
      n += sum[i]
      
    for i in range(0,K):
      sum[i] = (sum[i] + self.alpha) / (n + K * self.alpha)
    return sum

  def get_params(self, counts, tags):
    dict={}
    out=[]
    j=0
    for i in tags:
      if not i in dict:
        dict[i]=[] 
      dict[i].append(counts[j])
      j += 1
    for i in dict:
      out.append(self.get_params_for_tag(dict[i]))
    return out

counts=[[1,2,3],[3,4,5],[3,4,5]]
tags=[1,1,0]
e=MultinomialParametersEstimator()
e.get_params(counts,tags)

[[0.23809523809523808, 0.3333333333333333, 0.42857142857142855],
 [0.26666666666666666, 0.3333333333333333, 0.4]]

In the example **Multinomial model parameters estimation** we had
$$counts=[[0,1,0],[0,1,0],[3,0,0]]$$

and
$$
\begin{aligned}
p_1&=0.58, \\
p_2&=0.4, \\
p_3&=0.02.
\end{aligned}
$$

In [0]:
counts = [[0,1,0],[0,1,0],[3,0,0]]
m = MultinomialParametersEstimator(0.1)
m.parameter(counts)

[0.5849056603773585, 0.39622641509433965, 0.01886792452830189]

In the example **Multivariate Bernoulli model parameters estimation**
$$bool counts=[[0,1,0],[0,1,0],[1,0,0]]$$

and
$$
\begin{aligned}
p_1&=0.33, \\
p_2&=0.64, \\
p_3&=0.03.
\end{aligned}
$$

In [0]:
counts = [[0,1,0],[0,1,0],[1,0,0]]
m = MultinomialParametersEstimator(0.1)
m.get_params(counts)

[0.33333333333333337, 0.6363636363636365, 0.030303030303030307]

#Calculating probabilities in multinomial model

$$
L(\{p_1,p_2,p_3\}|\{x_1,x_2,x_3\})=P(\{x_1,x_2,x_3\}|\{p_1,p_2,p_3\})=\frac{(x_1+x_2+x_3)!}{x_1!x_2!x_3!} p_{1}^{x_1}p_{2}^{x_2}p_{3}^{x_3}
$$

In [0]:
import math 

class multinomialLikelihoodCalculator:
  def likelihood(self, x, p):
    sum = 0
    
    for i in range (0, len(x)):
      sum += x[i]

    P = math.gamma(sum+1)
 
    for i in range(0, len(x)):
      P *= (p[i])**(x[i])/math.gamma(x[i] + 1)
      #print("x",x[i],"p",p[i], "P",P)
    
    return P

  def calculateLikelihoods(self, xx, pp):
    out = []
    for i in range(0, len(pp)):
      out.append(self.likelihood(xx, pp[i]))
    
    return out

  
LC = multinomialLikelihoodCalculator()
LC.likelihood([1,2],[3,4])
LC.calculateLikelihoods([2,4,5],[[0.33,0.64,0.03],[0.2,0.3,0.4]])

[3.076715106533376e-06, 0.022992076800000004]

#Document Word counter

In [0]:
  
  class wordCounter:
    def count(self, vocabulary, document):
      counts = []
      table = str.maketrans({key: None for key in string.punctuation})
      d_without_punct = document.translate(table) # deletes all punctuation
      words = d_without_punct.lower().split()
      for i in vocabulary:
        counts.append(words.count(i))
      return counts
  
WC = wordCounter()
WC.count(["s","d"],"s s d hjhkj")

[2, 1]

#Posterior calculator

In [0]:
from collections import Counter

class posteriorCalculator:
  def calculatePosteriors(self, L, p):
    out=[]
    for i in range(0, len(L)):
      out.append(p[i] * L[i])
    return out

TE = posteriorCalculator()
TE.calculatePosteriors([0.6, 0.4],[0.3, 0.7])

[0.18, 0.27999999999999997]

#Text classification with our own corpus

In [0]:
categories=["Physics", "Economy", "Religion"]

P=["In Newtonian mechanics, linear momentum, translational momentum, or simply momentum (pl. momenta) is the product of the mass and velocity of an object.",
   "A neutrino (denoted by the Greek letter ν) is a fermion (an elementary particle with half-integer spin) that interacts only via the weak subatomic force and gravity.",
   "In physics, the center of mass of a distribution of mass in space is the unique point where the weighted relative position of the distributed mass sums to zero. This is the point to which a force may be applied to cause a linear acceleration without an angular acceleration."]

E=["Money is any item or verifiable record that is generally accepted as payment for goods and services and repayment of debts, such as taxes, in a particular country or socio-economic context.",
   "Originally money was a form of receipt, representing grain stored in temple granaries in Sumer in ancient Mesopotamia and later in Ancient Egypt. In this first stage of currency, metals were used as symbols to represent value stored in the form of commodities. ",
   "In economics, inflation is a increase in the general price level of goods and services in an economy over a period of time. When the general price level rises, each unit of currency buys fewer goods and services; consequently, inflation reflects a reduction in the purchasing power per unit of money – a loss of real value in the medium of exchange and unit of account within the economy."]

R=["Christianity is an Abrahamic monotheistic religion based on the life and teachings of Jesus of Nazareth. Its adherents, known as Christians, believe that Jesus is the Christ, the Son of God, and the savior of all people, whose coming as the Messiah was prophesied in the Hebrew Bible, called the Old Testament in Christianity, and chronicled in the New Testament.",
   "Traditionalist Catholicism is a set of religious beliefs made up of the customs, traditions, liturgical forms, public, private and group devotions, and presentations of the teaching of the Catholic Church before the Second Vatican Council",
   "Most modern scholars believe that John the Baptist performed a baptism on Jesus, and view it as a historical event to which a high degree of certainty can be assigned."]

tags = [0,0,0,1,1,1,2,2,2]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

#model = make_pipeline(TfidfVectorizer(), MultinomialNB())
#model.fit(training_set.data, training_set.target)

vectorizer = TfidfVectorizer()

vectorizer.fit_transform(P)
print(vectorizer.get_feature_names())
vectorizer.fit_transform(E)
print(vectorizer.get_feature_names())
vectorizer.fit_transform(R)
print(vectorizer.get_feature_names())


training_set={'data':P+E+R,'target':tags}
model.fit(training_set['data'], training_set['target']);
labels = model.predict(["Economy is physics"])
print(categories[labels[0]])

In [0]:
categories=["Physics", "Economy", "Religion"]

P=["In Newtonian mechanics, linear momentum, translational momentum, or simply momentum (pl. momenta) is the product of the mass and velocity of an object.",
   "A neutrino (denoted by the Greek letter ν) is a fermion (an elementary particle with half-integer spin) that interacts only via the weak subatomic force and gravity.",
   "In physics, the center of mass of a distribution of mass in space is the unique point where the weighted relative position of the distributed mass sums to zero. This is the point to which a force may be applied to cause a linear acceleration without an angular acceleration."]

E=["Money is any item or verifiable record that is generally accepted as payment for goods and services and repayment of debts, such as taxes, in a particular country or socio-economic context.",
   "Originally money was a form of receipt, representing grain stored in temple granaries in Sumer in ancient Mesopotamia and later in Ancient Egypt. In this first stage of currency, metals were used as symbols to represent value stored in the form of commodities. ",
   "In economics, inflation is a increase in the general price level of goods and services in an economy over a period of time. When the general price level rises, each unit of currency buys fewer goods and services; consequently, inflation reflects a reduction in the purchasing power per unit of money – a loss of real value in the medium of exchange and unit of account within the economy."]

R=["Christianity is an Abrahamic monotheistic religion based on the life and teachings of Jesus of Nazareth. Its adherents, known as Christians, believe that Jesus is the Christ, the Son of God, and the savior of all people, whose coming as the Messiah was prophesied in the Hebrew Bible, called the Old Testament in Christianity, and chronicled in the New Testament.",
   "Traditionalist Catholicism is a set of religious beliefs made up of the customs, traditions, liturgical forms, public, private and group devotions, and presentations of the teaching of the Catholic Church before the Second Vatican Council",
   "Most modern scholars believe that John the Baptist performed a baptism on Jesus, and view it as a historical event to which a high degree of certainty can be assigned."]

tags = [0,0,0,1,1,1,2,2,2]
corpus = P + E + R

d="Physics is economics money force and gravity"

TE = tagsEstimator()
tagsPriorProb = TE.calculatePriors(tags)


SCV = SimpleCountVectorizer(binary=False)
SCV.fit_transform(corpus)
arrayOfCounts = SCV.toarray()
vocabulary =   SCV.get_feature_names()

WC = wordCounter()
countOfDocument = WC.count(vocabulary, d)
MPE = MultinomialParametersEstimator(0.1)
estimatedParams = MPE.get_params(arrayOfCounts, tags)
LC = multinomialLikelihoodCalculator()
likelihood = LC.calculateLikelihoods(countOfDocument, estimatedParams)
PC = posteriorCalculator()
posterior = PC.calculatePosteriors(likelihood, tagsPriorProb)
index_max = max(range(len(posterior)), key=posterior.__getitem__)
print("Answer:", categories[index_max])



Answer: Physics
