# Document classification
- application in information management + document retrieval, routing, or filtering system 
- challenges:
    - semantics, homonymns
    - high-dimensionality problem
    
## Document collection

## Preprocessing/Feature extraction
- Tokenization
- Removing stop words
- Stemming


- dimensionality reduction: omitting unimportant features
    - feature extraction vs feature selection methods

## Feature selection
- Select subset of features by some criterion
    - Information gain*
    - Term frequency
    - Chi-square*
    - Expected cross entropy
    - Odds Ratio
    - the weight of evidence of text
    - Mutual information 
    - Gini index
    - gentic algorithm (GA) optimization (?)

_* denotes most commonly used_


- learning process: finding attributes in examples that distinguish object of separate classes
    - avoid overfitting; do cross-validation



## Indexing
- This is document representation
- Transform full text into document vector 
    - vector space model which is represented by vector of words
    - Bag of word/Vector space model
        - Disadvantages: 
            high-dimensionality & loss of info (correlation with adjacent word, sematic relationships)
        - To overcome problem:
            term weighing
- Term weighing metrics:
    - Boolean weighing
    - word frequency weighing
    - TF-IDF
    - entropy
    
## Classification
### Rocchio's
- for use in relevance feedback
- vector space method
- centroid; average vector
- inductive learning process

### KNN
- case-based learning
- Euclidean distance / Cosine similarity
- majority voting
- Advantage:
    - nonparametric
    - simple and easy to implement
    - one of fastest ML algorithm
- Disadvantage:
    - classification time is long
    - optimal k?
    - uses all features
    - need to compute distances using all document features
    - not robust to noice and irrelevant features
    
### Bayesian classifier/Naive Bayes
- module classifier
- multivariate Bernoulli and multiomial model
- spam filtering & email categorization
- Advantage:
    - easy implementation & computation
    - need small amount of training data
    - gives good results as long as correct category is more probable than other categories
- Disadvantage:    
    - Poor performance when features are correlated

### Decision Tree
- entropy criterion
- Advantage:
    - easy to understand and interpret
- Disadvantage:
    - overfitting

### Decision Rule
- rule-based inference
- DNF model
- inductive learning rule
- heuristics to reduce number of features
- Advantage: 
    - ability to create local dictionary for each separate category
- Disadvantage:
    - inability to assign a document to a single category
    - need help from human experts
    - do not work correctly with large features

### SVM
- need both positive and negative training set
- decision surface, hyper plane, support vector
- generally binary
- Advantage:
    - at space of large number of dimensions, eliminates least important features
- Disadvantage:
    - high complexity of training and categorization algorithm
    - parameter optimization
    - kernel selection

### Neural Networks
- input = terms (can be much greater); output = category
- term weights assigned to input units, activation propagated forward
- single-layer perceptron, multi-layer perceptron, BPNN, MBPNN
- Advantage:
    - abilitiy to work with large sets of features
    - able to correct classification in presence of noise
- Disadvantage:
    - large computational cost
    - mysterious for typical user
    

### Linear Least Square Fit
- mapping approach
- input/output vector pair
- Disadvantage:
    - computational cost

### Voting
- classifier committee 
- k experts
- combining the experts:
    - majority voting
    - second weighted majority 
    - dynamic classifier selection
    
## Evaluating
- Precision & Recall
- Fallout = $\frac{FNi}{FNi + TNi}$
- Error = $\frac{FNi+FPi}{TPi+FNi+FPi+TNi}$
- Accuracy = $TPi+TNi$


- Micro-averaging vs Macro-averaging
- Break-even point
- F-measure
- Interpolation

## Footnote
- see Korde 2012 for table of classifier advantages & disadvantages
- see Bilski 2011 for mini blurbs about description, advantages & disadvantages; like Khan et al 2010