In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
# %cd .. 
import sys
sys.path.append("..")
import statnlpbook.util as util

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# Text Classification 

In many applications we need to automatically classify some input text with respect to a set of classes or labels. For example,

* for information retrieval it is useful to classify documents into a set of topics, such as "sport" or "business",
* for sentiment analysis we classify tweets into being "positive" or "negative" and 
* for Spam Filters we need to distinguish between Ham and Spam.

<!-- TODO: Load Web Corpus, 4 Universities, something were Maxent works -->

## Text Classification as Structured Prediction
We can formalize text classification as the simplest instance of [structured prediction](/template/statnlpbook/02_methods/00_structuredprediction) where the input space \\(\Xs\\) are sequences of words, and the output space \\(\Ys\\) is a set of labels such as \\(\Ys=\\{ \text{sports},\text{business}\\} \\) in document classification or \\(\Ys=\\{ \text{positive},\text{negative}, \text{neutral}\\} \\) in sentiment prediction. On a high level, our goal is to define a model a model \\(s_{\params}(\x,y)\\) that assigns high *scores* to the label \\(y\\) that fits the text \\(\x\\), and lower scores otherwise. The model will be parametrized by \\(\params\\), and these parameters we will learn from some training set \\(\train\\) of \\((\x,y)\\) pairs. When we need to classify a text \\(\x\\) we have to solve the trivial (if the number of classes is low) maximization problem $\argmax_y s_{\params}(\x,y)$. 

<!-- TODO: Show a silly classifier example? --> 

In the following we will present two typical approaches to text classifiers: Naive Bayes and discriminative linear classifiers. We will also see that both in fact can use the same model structure, and differ only in how model parameters are trained.

### Naive Bayes
One of the most widely used approaches to text classification relies on the so-called Naive Bayes (NB) Model. In NB we use a distribution $p^{\mbox{NB}}_{\params}$ for $s_\params$. In particular, we use the *a posteriori* probability of a label \\(y\\) given the input text \\(\x\\) as a score for that label given the text.   

\begin{equation}
  s_{\params}(\x,\y)\ = p^{\text{NB}}_{\params}(y|\x)
\end{equation}

By Bayes Law we get

\begin{equation}
    p^{\text{NB}}_{\params}(y|\x) =
  \frac{p^{\text{NB}}_{\params}(\x|y) p^\text{NB}_{\params}(y)}{p^{\text{NB}}_{\params}(x)}  
\end{equation}

and when an input \\(\x\\) is fixed we can focus on 

\begin{equation}\label{eq:NB}
\prob^{\text{NB}}_{\params}(\x,y)= p^{\text{NB}}_{\params}(\x|y) p^\text{NB}_{\params}(y)
\end{equation}

because in this case  \\(p^{\text{NB}}_{\params}(x)\\) is a constant factor. In the above \\(p^{\text{NB}}_{\params}(\x|y)\\) is the *likelihood*, and \\(p^\text{NB}_{\params}(y) \\) is the *prior*.

<!--Let us assume that we have a number \\(K(\x)\\) of feature functions \\(f_k(\x)\\) that represent the input \\(\x\\). For example, in document classification this set could be used to represent the text \\(\x = (x_1,\ldots,x_n)\\) as a bag of words by setting \\(f_k(\x) = x_k\\) and \\(K(\x) = n\\). We could also use bigrams instead, setting \\(f_k(\x) = (x_k,x_{k+1})\\) and \\(K(\x) = n-1\\), or any other representation that is effective for distinguishing between classes of text.-->

The "naivity" of NB stems from a certain conditional independence assumption we make for the likelihood \\(p^{\mbox{NB}}_{\params}(\x|y)\\). Note that conditional independence of two events \\(a\\) and \\(b\\) given a third event \\(c\\) requires that \\(p(a,b|c) = p(a|c) p(b|c)\\). In particular, for the likelihood in NB we have:

\begin{equation}
  p^{\text{NB}}_{\params}(\x|y) = 
  \prod_i^{\text{length}(\x)} p^{\text{NB}}_{\params}(x_i|y)
\end{equation}

That is, NB makes the assumption that the observed wors are independent of each other when *conditioned on the label* \\(y\\). 