# Classification of Text

## Text Classification

### Examples of Text Classification
1. Topic identification
2. Sapam dection
3. Sentiment analysis: is this movie review positive or negative 
4. Spelling correction: weather or whether? color or colour?

### Supervised Classification
1. Learn a **classification model** on properties ('features') and their importance ('weights') from labeled instances 
2. Phases: training phase; inference phase
3. Datasets: training date set; validation data set; test data set

### Classification Paradigms
1. When there are only two possible classes; |Y| = 2:
**Binary Classification**
2. When there are only two possible classes; |Y| > 2:
**Multi-class Classification**
3. When data instances can have two or more labels:
**Multi-label Classification**

### Qs to ask in Supervised Learning
1. Training phase
    - what are the features? how do you represent them?
    - what is the classification model /algorithm?
    - what are the model parameters?
2. Inference phase
    - what the is expected performance? what is a good measure?

## Identifying Features from Text

### Why is texutal data unique?
1. Textual data presents a unique set of challenges
2. All the information you need is in text
3. But features can be pulled out from text at different granularities.

### Types of textural features
1. Words
    - by far the most common class of features
    - handling commonly-occurring words: Stop Words
    - normalization: make lower case vs. leave as-is
    - stemming /lemmatization
2. Characteristics of words
    - capitalization
    - parts of speech of words in a sentence
    - grammatical structure, sentence parsing
    - grouping words of similar meaning, semantics
        - {buy, purchase}
        - {Mr., Ms., Dr., Prof.}; 
        - numbers/digits
        - dates
3. Depending on classification tasks, features may come from inside words and word sequnces
    - bigrams, trigrams, n-grams; "White House"
    - character sub-sequences in words: 'ing', 'ion', ...

## Naive Bayes Classifiers

### Case Study: Classifing text search queries
1. You are interested in classfiying serach queries in 3 classes: Entertainment, Computer Science, Zoology. And most common class of the three is Entertainment.
2. If the query is "python", it could be either on of the class. For example, 1) python the snake; 2) python the programming language; 3) Monte Python. But in most class, given "python", is Zoology.
3. If the query is "python download". Most probable class, given "python download", is Computer Science.

### Probabilistic Model
1. In the example above, a probabilistic model tells you the likehood of a class before you have any information.
2. And you update the likelihood fo the class given new information.
3. Prior Probability: Pr(y=Entertainment), Pr(y=CS), Pr(y=Zoology)
4. Posterior Probability: Pr(y=Zoology|x='python')


### Bayes' Rule
1. $ Posterior\ probability = \frac {{Prior\ Probability} \times  {Likelihood}}{Evidence} $
2. $ Pr(y|X) = \frac {Pr(y) \times Pr(X|y)}{(Pr(X))} $
3. $ Pr(CS|"python") = \frac {Pr(y=CS) \times Pr("python"|y=CS)}{(Pr("python"))} $

### Naive Bayes Classification
1. Naive assumption: given the class label, features are assumed to be independent of each other.
2. Formulation 
<img src="https://img.ceclinux.org/eb/d3d44d2259c613b95a0d373805245fc41c5809.png">
3. Example
<img src="https://img.ceclinux.org/21/b3eee56e85c2377062853774f9055e17135e21.png">


### Naive Bayes: Parameters
1. Prior probabilities: Pr(y) for all y in Y
2. Likelihood: $Pr(x_{i}|y)$ for all features $x_i$ and labels y in Y
3. If there are 3 classes (|y|=3) and 100 features in X, how many parameters does naive Bayes models have?
4. Ans: |Y| + 2 x X x |Y|

### Naive Bayes: Smoothing
1. If feature $x_i$ never occurs in documents labeled y, $Pr(y|x_i)$ will be 0
2. Should smooth the parameters
    - **laplace smooth** or **additive smoothing**: add a dummy counts, e.g. add count 1 to all features
    
### Summary 
1. Naive Bayes is a probabilistic model
2. It assumes features are independent of each other, given the class label
3. Not necessarily true, even for text mining
4. For text classificatino problems, naive Bayes usually provides very strong baselines.

## Naive Bayes Variations
### 2 Classic Naive Bayes Variants for Text
1. Multinomial Naive Bayes
    - Assumption: Data follows a multinomial distribution
    - Each feature is a count (word occurrence counts, TF-IDF weighting, ...)
        - TF: term frequency
        - IDF: inverse document frequence (both adding more frequency/importance to some words)
2. Bernoulli Naive Bayes
    - Assumption: Data follows a multivariate Bernoulli distribution
    - Each feature is binary (e.g. word is present/ absent)

## Support Vector Machines
1. Classifier = Function on input data
2. Decision Boundaries
    - a classification function is represented by decision surfaces
    - <img src="https://img.ceclinux.org/d3/8440af9c80b446b33144ef4c28a7ac874e705f.png">
    - Linear boundaries
        - easy to find; easy to evaluate
        - more generalizable
        - Maximum Margin
        <img src="https://img.ceclinux.org/b4/bf2205c52fa4846d8ca60ead603c167ddcdc24.png">
        - **Support Vector Machines** are maximum-margin classifiers
        <img src="https://img.ceclinux.org/72/214124d871334fc95a3ce579c192c280ebf8a6.png">
        
### Support Vector Machines (SVM)
1. SVMs are **linear classifiers** that find a hyperplane to separate **two classes** of data: positive and negative 
2. Given training data $(x_1,y_1)$, $(x_2,y_2)$, etc where $x_i = (x_1, x_2,...x_n)$ is instance vector and $y_i$ is one of {-1, +1}
3. SVM finds a linear function w (weight vector) 

    $f(x_i) = <w.x_i> + b$
    
    if $f(x_i) >= 0, y_i = +1$, else $y_i = -1$

###  SVM: Multi-class classification
1. One vs. Rest
    - n-class SVM has n classifiers
    - <img src="https://img.ceclinux.org/ea/57365ec7e7e8c669b8bef83082f4d19597decf.png">
2. One vs. One
    - n-class SVM has C(n,2) classifiers
    <img src="https://img.ceclinux.org/26/9ab9469893c2d4e8593edeb414fe8a6464522e.png">

### SVM Parameters
1. Parameter C
    - regularization: how uch importance should you give individual data points as compared to better generalized model
    - larger c = less regularization
         - fit training data as well as possible, every data point is important
2. Other params
     - Linear kernels usually work best for text data
         - other kernels include rbf, polynomial
     - multi_class: ovr (one-vs-rest)
     - class_weight: different calsses can get different weights
     
### Summary
1. SVMs tend to be the most accurate classifiers, esp. n high-dimensional data
2. Strong thoretical foundation (optimization theory)
3. Handle only numeric features
    - convert categorical features to numeric features
    - normalization needed (all in 0-1 range)
4. Hyperplane hard to interpret

## Learning Text Classifiers in Python
### Model Selection in Scikit-learn
```
from sklearn import model_selection

# normal split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train_data, train_lables, test_size = 0.333, random_state=0)

# 5 fold cross validation
predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv=5)
```

### Supervised Text Classification in NLTK
Classification algorigthms in NLTK:
    - NaiveBayesCalssifier
    - DecisionTreeClassifier
    - ConditionalExponentialClassifier
    - MaxentClassifier
    - WekaClassifier
    - SklearnClassifier

### Using NLTK's NaiveBayesClassifiers
Example: Naive Bayes
```
from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)

classifier.classify(unlabeled_instance)
classifier.classify_many(unlabeled_istances)

nltk.classify.util.accuracy(classifier, test_set)

classifier.labels()

classifier.show_most_informative_features()
```

Example: SVM
``` 
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

clfnb = SklearnClassifier(MultinomialNB()).train(train_set)

clfsvm = SklearnClassifier(SVC()), kernel='linear').train(train_set)
```
