<a href="https://colab.research.google.com/github/boywaiter/nlp_notes/blob/master/nlp_notes02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 Linear Text Claasification

**text classification**: given a text document, assign it a discrete label $y\in\mathcal{Y}$, where $\mathcal Y$ is the set of possible labels.

## 2.1 The bag of words
A document or an instance is represented as a column vector, $\boldsymbol x=[0, 1, 1, 0, 0, 2, 0, 1, 13, 0 ,\ldots]^\textrm{T}$, where $x_j$ is the count of the $j$-th word. The length of $\boldsymbol x$ is $V=|\mathcal V|$, the size of vocabulary.

**Bag of words**: contains only word count information, no order information of words.

**Weights** $\boldsymbol \theta$ assigns for each word a score, measuring the compatability with the label.

To predict the label $\hat y$ for a given $\boldsymbol x$, we compute a score $\Psi(\boldsymbol x, y)$ for each $y\in \mathcal Y$,
$$
\Psi (\boldsymbol x, y)= \mathbf \theta \cdot \boldsymbol f(\boldsymbol x, y)=\sum_{j=1}^V \theta_j\cdot f_j(\boldsymbol x,y) \tag{2.1}
$$
where, it may seem awkward, the $\boldsymbol f(\boldsymbol x, y)$ returns a column vector of length $K\times V$ and with $(K-1)\times V$ zeros,  
$$
{\boldsymbol{f}(\boldsymbol{x}, y=1)=[\boldsymbol{x} ; \underbrace{0 ; 0 ; \ldots ; 0}_{(K-1)\times V}] }\\ \tag{2.3}
$$
$$
{\boldsymbol{f}(\boldsymbol{x}, y=2)=[\underbrace{0 ; 0 ; \ldots ; 0 }_{V};\boldsymbol{x} ; \underbrace{0 ; 0 ; \ldots ; 0}_{(K-2)\times V}]}\tag{2.4}
$$
$$
{\boldsymbol{f}(\boldsymbol{x}, y=K)=[\underbrace{0 ; 0 ; \ldots ; 0}_{(K-1) \times V} ; \boldsymbol{x}]}\tag{2.5}
$$
The weights $\boldsymbol \theta$ is thus a column vector of the length $K\times V$, i.e., $\boldsymbol \theta\in \mathbb R^{KV}$, where $\theta_{(k-1)*V+1:kV}$ are for label $y=k$. 

We predict the label $\hat y$ with
$$
\hat y =\mathop{\text{argmax}}\limits_{y\in \mathcal Y} \Psi(\boldsymbol x,y)\tag{2.6}
$$
$$
\Psi(\boldsymbol x,y )=\boldsymbol \theta\cdot\boldsymbol f(\boldsymbol x,y)\tag{2.7}
$$
It is common to add an **offset feature** at the end of $\boldsymbol x$, which is always 1. The weight for it can be thought of as a bias.




In [0]:
import torch

In [0]:
K=3
V=5
x=torch.randint(10,(V,1))
x=torch.cat((x,torch.ones(1,1,dtype=torch.long)),0)#offset features
weights=torch.ones(K*(V+1))

In [0]:
def feature_function(x,y):
  return {(y-1)*(V+1)+i:x[i] for i in range(0,V+1)}

In [0]:
print(feature_function(x,2))

{6: tensor([7]), 7: tensor([5]), 8: tensor([8]), 9: tensor([7]), 10: tensor([1]), 11: tensor([1])}


In [0]:
def compute_score(x,y,weights):
  total=0
  for feature, count in feature_function(x,y).items():
    total+=weights[feature] * count
  return total

In [0]:
print(x)
print(weights)
total=compute_score(x,2,weights)
print(total)

tensor([[7],
        [5],
        [8],
        [7],
        [1],
        [1]])
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
tensor([29.])


## 2.2 Naive Bayes
$\textrm{p}(\boldsymbol x, y)$ is the **joint probability** of a text instance. A dataset of $N$ labeled instances is $\{(\boldsymbol x^{(i)}, y^{(i)})\}_{i=1}^N$. We assume these instances are **independent and identically distributed (IID)**, i.e., $\textrm{p}(\boldsymbol x, y)=\prod_{i=1}^N \textrm{p}(\boldsymbol x^{(i)}, y^{(i)})$.

**Maximum Likelihood Estimation**:
$$
\begin{eqnarray}
\hat y &=& \mathop{\text{argmax}}\limits_\boldsymbol \theta \textrm{p}(\boldsymbol x, y;\boldsymbol \theta)\tag{2.8}\\
&=&\mathop{\text{argmax}}\limits_\boldsymbol \theta\prod_{i=1}^N \textrm{p}(\boldsymbol x^{(i)}, y^{(i)};\boldsymbol \theta) \tag{2.9}\\
&=&\mathop{\text{argmax}}\limits_\boldsymbol \theta\sum_{i=1}^N \log\textrm{p}(\boldsymbol x^{(i)}, y^{(i)};\boldsymbol \theta)\tag{2.10}
\end{eqnarray}
$$
The probability $\textrm{p}(\boldsymbol x^{(i)}, y^{(i)};\boldsymbol \theta)$ is defined through a generative model — an idealized
random process that has generated the observed data. The joint probability is factored using the chain rule,
$$
\textrm{p}(\boldsymbol x^{(i)}, y^{(i)};\boldsymbol \theta)=\textrm{p}(\boldsymbol x^{(i)}| y^{(i)};\boldsymbol \phi)\times \textrm{p}(y^{(i)};\boldsymbol \mu)\tag{2.11}
$$
where $\boldsymbol \theta=\{\boldsymbol \mu, \boldsymbol \phi\}$.

2.3 Discriminative learning

In [0]:
Word_count_max=10
N=10
V=5
K=2
X=torch.randint(low=1,high=Word_count_max,size=(N,V))
X=torch.cat((X,torch.ones(N,1,dtype=torch.long)),1)
y=torch.randint(0,K,size=(N,1))

In [10]:
print(X)

tensor([[2, 1, 3, 2, 9, 1],
        [5, 7, 3, 9, 7, 1],
        [2, 9, 3, 7, 4, 1],
        [6, 5, 3, 8, 1, 1],
        [2, 1, 9, 9, 3, 1],
        [8, 1, 5, 3, 1, 1],
        [5, 9, 2, 3, 5, 1],
        [3, 8, 8, 2, 2, 1],
        [1, 8, 3, 1, 7, 1],
        [7, 7, 8, 5, 8, 1]])


In [0]:
import numpy as np
def feature_function_binary(x,y):
  dict1={y*(V+1)+i:x[i].item() for i in range(0,V+1)}  
  dict2={(1-y)*(V+1)+i:0 for i in range(0,V+1)}
  dict3={}
  dict3.update(dict1)
  dict3.update(dict2)
  return dict3
def update_theta(theta,D1,D2):
  for feature, count in D1.items():
    #print("feature=",feature," count=",count)
    theta[feature]+=count-D2[feature]
  return theta
def compute_score_binary(x,y,weights):
  total=0
  for feature, count in feature_function_binary(x,y).items():
    #print("feature=",feature)
    #print("count=",count)
    total+=weights[feature] * count
  return total
def perceptron(X,y):
  t=0
  theta=torch.zeros(K*(V+1),dtype=torch.float)
  #print("theta=",theta)
  times=100
  while(t<times):
    i= np.random.randint(0,N,1).item()
    x=X[i]    
    #print("x=",x)
    hat_y=0 if compute_score_binary(x,0,theta).item()>=compute_score_binary(x,1,theta).item() else 1
    #print("hat_y=",hat_y,"y[i]=",y[i].item())
    if(hat_y!=y[i].item()):
      theta=update_theta(theta,feature_function_binary(x,y[i].item()),
                         feature_function_binary(x,hat_y))
    t+=1
  return theta

In [12]:
theta = perceptron(X,y)
print(theta)

feature= 6  count= 2
feature= 7  count= 1
feature= 8  count= 3
feature= 9  count= 2
feature= 10  count= 9
feature= 11  count= 1
feature= 0  count= 0
feature= 1  count= 0
feature= 2  count= 0
feature= 3  count= 0
feature= 4  count= 0
feature= 5  count= 0
feature= 0  count= 5
feature= 1  count= 9
feature= 2  count= 2
feature= 3  count= 3
feature= 4  count= 5
feature= 5  count= 1
feature= 6  count= 0
feature= 7  count= 0
feature= 8  count= 0
feature= 9  count= 0
feature= 10  count= 0
feature= 11  count= 0
feature= 6  count= 8
feature= 7  count= 1
feature= 8  count= 5
feature= 9  count= 3
feature= 10  count= 1
feature= 11  count= 1
feature= 0  count= 0
feature= 1  count= 0
feature= 2  count= 0
feature= 3  count= 0
feature= 4  count= 0
feature= 5  count= 0
feature= 0  count= 7
feature= 1  count= 7
feature= 2  count= 8
feature= 3  count= 5
feature= 4  count= 8
feature= 5  count= 1
feature= 6  count= 0
feature= 7  count= 0
feature= 8  count= 0
feature= 9  count= 0
feature= 10  count= 0
featur