# Lecture 3

## Naive Bayes' Classifier and Text Classification

### Text Classification
* Text classification -- use keywords and probabilities to classify text as a specific category
    * May be applied to spam detection, mood / sentiment analysis, author identification, identifying political affiliation, word sense disambiguation
* Text classification is a machine learning problem -- apply either supervised or unsupervised ML
    * Supervised ML: utilize fixed set of classes C, train a classifier from a set of labeled <document, class> pairs
        * Discriminative vs. generative models
    * Unsupervised ML: unknown set of classes C, topic modeling, utilize clustering

### Supervised Learning
* Requires training data set consisting of labeled data
    * Each input vector $x_i$ has some label $y_i$ 
    * Goal: determine a hypothesis function h(x) that approximates the true relationship between x and y
        1. Should be consistent with the training data
        2. Model should be generalizable to unseen examples
    * Trade-off between (1) and (2) -- increased accuracy with training data causes models to become too specific and less accurate with unseen examples, and vice versa

### Representing Documents
* To represent the sample text: “to be, or not to be”
    * Set-of-words representation: (to, be, or, not)
    * Bag-of-words representation: {to: 2, be: 2, or: 1, not: 1}
    * Vector-space model: each word corresponds to one dimension in the vector space, entries may be stored as: 
        * Binary (word appears / does not appear)
        * Raw or normalized frequency counts
        * Weighted frequency counts
        * Probabilities
    * Issues / complications with these models: language is ordered, but these models disregard order and associations between words

### Probabilities in NLP
* Ambiguity and uncertainty exists everywhere in NLP
* Uncertainty over the “correct interpretation” of speech / text
* Utilize probabilities to combine evidence from multiple sources to determine which interpretation of the text is most likely to be correct
* Bayesian Statistics: observe some evidence (words in a document) and infer the “correct interpretation” or topic of the text
    * Prior probabilities -- probability of an interpretation prior to seeing any evidence
    * Conditional (posterior) probability -- probability of an interpretation after taking evidence into account

### Probability Basics
* Begin with a sample space $\Omega$
    * Each w in $\Omega$ is a possible outcome
* Probability distributions assign a probability to each basic outcome
    * Sum of probabilities of all possible outcomes is 1
* Random variable -- function that maps from the outcomes/sample space to the set of real numbers (or to a set of booleans)
* Joint probability -- probability of several events occurring

$$P(A \cap B) = P(A, B)$$

* Conditional probability -- probability of an event occurring given some knowledge of another event 

$$P(A | B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A, B)}{P(B)}$$ 

### Rules for Conditional Probability
* Product Rule:

$$ P(A, B) = P(B) \cdot P(A | B) = P(A) \cdot P(B | A) $$

* Chain Rule (generalization of product rule):

$$ \begin{align}
P(A_1, A_2, ... A_n) &= P(A_1 | A_2, A_3, ..., A_n) \cdot P(A_1, A_2, A_3, ... A_n) \\
&= P(A_1) \cdot P(A_2 | A_1) \cdot ... \cdot P(A_n | A_1, A_2, ..., A_{n - 1})
\end{align}
$$

* Bayes' Rule:

$$ P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}
$$

### Independence
* Two events are independent if 

$$ P(A) = P(A | B) $$

or equivalently, 

$$ P(A, B) = P(A) \cdot P(B) $$

* Two events are **conditionally independent** if

$$ P(B, C | A) = P(B | A) P(C | A) $$

or equivalently, 

$$ P(B | A, C) = P(B | A) \text{  and  } P(C | A, B) = P(C | A) $$

### Probabilities and Supervised Learning
* Given: training data consisting of training examples

$$
  \text{data} = (x_1, y_1), ... , (x_n, y_n)
$$

Goal: learn a mapping $h$ from x to y. 

* Learn the mapping using $P(y | x)$
* Two approaches:
    * Discriminative algorithms learn $P(y | x)$ directly
    * Generative algorithms use Bayes Rule:

$$ P(y | x) = \frac{P(x | y) \cdot P(y)}{P(x)} $$

### Discriminative Algorithms
* Model conditional distribution of the label given the data $P(y | x)$
* Examples: linear and log-linear models, support vector machine (SVM), decision trees, random forests

### Generative Algorithms
* Assume the observed data is being "generated" by a "hidden" class label
* Build a **different conditional distribution** for each class
* Estimate $P(x | y)$ and $P(y)$, then use Bayes Rule:

$$ P(y | x) = \frac{P(x | y) \cdot P(y)}{P(x)} $$

* Examples: Naive Bayes, Hidden Markov Models, Gaussian Mixture Models, PCFGs

### Naive Bayes
* Assumption: attributes are all independent of each other, which may not be true in reality

$$ P(Label, X_1, ..., X_d) = P(Label) \prod_i P(X_i | Label) \\ 
   y^* = \arg \max_y P(y) \prod_i P(x_i | y)
$$

### Training the Naive Bayes' Classifier
* Goal: use the training data to estimate $P(Label)$ and $P(X_i | Label)$ from training data
* Estaimate the prior and posterior probabilities using **Maximum Likelihood Estimates (MLE)**:

$$ P(y) = \frac{Count(y)}{\sum_{y' \in Y} Count(y')} \\
   P(x_i | y) = \frac{Count(x_i, y)}{\sum_{x'} Count(x', y)} 
$$

* May face issues if the denominator (or count) is ever 0

### Independence Assumption
* Independence assumption is important, bcause otherwise there would be a lot of joint probabilities involving $X_1, X_2, ..., X_d$
* Allows us to estimate each $P(X_i | Label)$ independently