# Chapter 8: Classification

## Classification vs. Prediction

- **Classification**
    - determines categorical class labels
    - e.g., safe vs. risky; weather condition
- **Prediction**
    - models continuous-valued functions
- Typical applications
    - loan approval, target marketing, medical diagnosis, fraud detection, etc.

## Classification

- Step 1: **Learning**
    - model construction
    - training set, class labels
- Step 2: **Classification**
    - test set, accuracy

## Supervised vs. Unsupervised

- **Supervised learning (classification)**
    - supervision: training data accompanied by class labels
    - new data is classified based on training set
- **Unsupervised learning (clustering)**
    - class labels of training data is unknown
    - aims to establish the existence of classes or clusters in the data

## Issues: Evaluation Criteria

- **Accuracy**: classification vs. prediction
- **Speed**: time to construct / use the model
- **Robustness**: handling noise & missing values
- **Scalability**: large amounts of data
- **Interpretability**: understanding and insight
- **Goodness of rules**: e.g., decision tree size, compactness of classification rules

## Decision Tree Induction

- Basic algorithm (a greedy algorithm)
    - **top-down, recursive, divide-and-conquer**
    - attribute selection
    - attribute split
- Stopping conditions
    - all samples belong to the same class
    - no remaining attributes: **majority voting**
    - no samples left

## Splitting Attributes

- **Discrete**-valued
- **Continuous**-valued: split_point
- **Discrete**-valued: **binary** tree, splitting_subset

## Example: Training Set & Decision Tree

![Example Training Set & Decision Tree](./img/8.1.png)

## Attribute Selection Measures

- **Information gain** (ID3/C4.5)
    - $D$, $m$ classes $\displaystyle C_i \quad\quad p_i = |C_{i,D}|/|D| \quad\quad Info(D) = -\sum_{i=1}^m p_i log_2(p_i)$
    - expected information (entropy) needed to classify $D$
    - information needed to classify $D$ using $A$
        - attribute $A$: $\displaystyle a_1, a_2, \ldots, a_v \quad\quad Info_A(D) = \sum_{j=1}^v \frac{|D_j|}{|D|} \times Info(D_j)$
    - information gain $\displaystyle \quad\quad Gain(A) = Info(D) - Info_A(D)$
- Comparison of the three measures
    - good results in general but some biases
    - **information gain**: multi-valued attributes
    - **gain ratio**: unbalanced splits
    - **gini index**: multi-valued, equal-sized & pure partitions, not good when number of classes is large

## Information Gain Example

![Information Gain Example](./img/8.2.png)

## Information Gain

- Continuous-valued attribute $A$
- Determine the **best split point** for $A$
    - sort $A$ values in increasing order
    - consider the midpoint of adjacent values: $(a_i + a_{i+1}) / 2$
    - pick the midpoint w/ the minimum $Info_A(D)$
- Split: $D_1$: $A \leq$ split point; $D_2$: $A >$ split point

## Gain Ratio (C4.5)

- Information gain measure biased towards the attributes with a large number of values
    - e.g., customerID, productID
- C4.5 (a successor of ID3)
    - $\displaystyle SplitInfo_A(D) = -\sum_{j=1}^v \frac{|D_j|}{|D|} \times log_2\left(\frac{|D_j|}{|D|}\right)$
    - select attribute with **maximum gain ratio**
        - $\displaystyle gainRatio(A) = \frac{Gain(A)}{SplitInfo(A)}$

## Gini Index (CART)

- **Gini Index**
    - $\displaystyle Gini(D) = 1 - \sum_{i=1}^m p_i^2$
- **Binary split** using attribute A
    - $\displaystyle Gini_A(D) = \frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)$
- **Reduction in impurity**
    - $\Delta Gini(A) = Gini(D) - Gini_A(D)$
- Select attribute with **largest impurity reduction**

## Overfitting & Tree Pruning

- **Overfitting of the training data**
    - too many branches, reflect anomalies due to noise or outliers
    - poor accuracy for unseen data
- Tree pruning to avoid overfitting
    - **prepruning**: halt tree construction early
    - **postpruning**: remove branches from a "fully-grown" tree

## Tree Pruning Example

![Tree Pruning Example](./img/8.3.png)

## Bayesian Classification

- A statistical classifier:
    - predicts class membership probabilities
- Foundation: based on Bayes' Theorem
- Performance (naive Bayesian classifier)
    - comparable to decision tree & some neural network classifiers
- Incremental

## Naive Bayesian Classifier

- $X = (x_1,x_2,\ldots,x_n)$ (i.e., $n$ attributes)
- $m$ classes: $C_1,C_2,\ldots,C_m$
- **Classification: maximal $P(C_i | X)$**
- Based on Bayes' Theorem $\quad\quad \displaystyle P(C_i|X) = \frac{P(X|C_i)P(C_i)}{P(X)}$
- Since $P(X)$ is constant for all classes, only need to maximize $P(X|C_i)P(C_i)$
- **Naive assumption**: class conditional independence (no dependence between attributes)
    - $\displaystyle P(X|C_i) = \prod_{k=1}^n P(x_k|C_i) = P(x_1|C_i) \times P(x_2|C_i) \times \cdots \times P(x_n|C_i)$
- If $A_k$ is categorical, $P(x_k|C_i)$
- If $A-k$ is continuous-valued, assume Gaussian distribution
    - $\displaystyle P(x_k|C_i) = g(x_k,\mu_{C_i},\sigma_{C_i}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
- **Advantage**
    - easy to compute, good results in most cases
- **Disadvantage**
    - assumption: class conditional independence
    - dependencies exist in practice
        - e.g., hospital patients: age, family history, fever, cough, lung cancer, diabetes, etc.

## Naive Bayesian Classifier Example

- 2 classes: buys_computer
    - $C_1$: yes
    - $C_2$: no
- X = (age $\leq$ 30, income=medium, student=yes, credit_raiting=fair)
- $P(C_i|X) = P(X|C_i)P(C_i)/P(X)$
    - $P(X|C_i)$ = P(age $\leq$ 30 | $C_i$)
        - P(income-medium | C_i)
        - P(student=yes | C_i)
        - P(credit_rating=fair | C_i)

![Naive Bayesian Classifier Example](./img/8.4.png)

## Avoid 0-Probability

- The 0-probability problem
    - e.g., income: low(0), medium(990), high(10)
- Laplacian correction (or Laplace estimator)
    - add 1 to each case: non-zero, close to original probabilities
    - e.g., income: low(1), medium(991), high(11)

## IF-THEN Rules

- IF condition THEN condition
    - R: IF age = youth AND student = yes THEN buys_computer = yes
    - rule antecedent/precondition (IF) rule consequent (THEN)
- Assessment of a rule
    - **coverage**(R) = $n_{\text{covers}} / |D| \quad\quad$ **accuracy**(R) = $n_{\text{correct}} / n_{\text{covers}}$

## Rule Assessment Example

- R: IF age $\leq$ 30 AND student = yes THEN buys_computer = yes
- coverage (R) = 2/14
- accuracy (R) = 2/2

## Rule-Based Classification

- Rule R is **triggered** (precondition satisfied)
- R is the only rule triggered
- No rule is triggered
- More than one rules are triggered
- **size ordering**: most attribute tests
- **class-based ordering**: importance (e.g., prevalence, misclassification cost)
- **rule-based ordering**: (decision list) priority list ordered by rule quality or by experts

## Rule Extraction

- From a decision tree
- Each root to leaf path
- Leaf: class prediction
- Rules are exhaustive and mutually exclusive

## Classifier Accuracy Measures

- Partition: training data and testing data
- Accuracy, recognition rate
- Error rate, misclassification rate
- **Confusion matrix**

![Confusion matrix](./img/8.5.png)

- **Sensitivity**: t_pos / pos
- **Specificity**: t_net / neg
- **Precision**: t_pos / (t_pos + f_pos)
- $\displaystyle accuracy = sensitivity\frac{pos}{pos + neg} + specificity\frac{neg}{pos + neg}$
- Costs and benefits of TP,TN,FP,FN

## Classifier/Predictor Evaluation

- **Holdout, random sampling**

![Classifier predictor evaluation](./img/8.6.png)

- **Cross-validation**
    - divide into $k$ subsamples
    - use $k-1$ subsamples for training, one for testing, --- k-fold cross validation
- **Bootstrapping** (e.g.,.632 bootstrapping)
    - sample with replacement $\Rightarrow$ training data

## Predictor Error Measures

Absolute error: $\displaystyle |y_i - y_i'|$

Square error: $\displaystyle (y_i - y_i')^2$

Mean absolute error: $\displaystyle \frac{\sum_{i=1}^d |y_i - y_i'|}{d}$

Mean square error: $\displaystyle \frac{\sum_{i=1}^d (y_i - y_i')^2}{d}$

Relative absolute error: $\displaystyle \frac{\sum_{i=1}^d |y_i - y_i'|}{\sum_{i=1}^d |y_i - \bar{y}|}$

Relative square error: $\displaystyle \frac{\sum_{i=1}^d (y_i - y_i')^2}{\sum_{i=1}^d (y_i - \bar{y})^2}$

## Model Selection

- Choose between two models $M_1$ and $M_2$
- Mean error rate? (k-fold cross validation)
    - estimated error on future data
- Difference between error rates of $M_1$ & $M_2$
    - statistically significant? or by chance?
    - $\displaystyle t = \frac{\overline{err}(M_1) - \overline{err}(M_2)}{\sqrt{var(M_1 - M_2)/k}}$
- **t-test**
    - $\displaystyle var(M_1 - M_2) = \frac{1}{k}\sum_{i=1}^k \left[ err(M_1)_i - err(M_2)_i - (\overline{err}(M_1) - \overline{err}(M_2)) \right]^2$

## T-test Example

- 10-fold cross-validation: 10 pairs of error values
- T-table (e.g., [https://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm](https://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm))
- Degree of freedom: $v = 10 - 1 = 9$
- Significance level: $a = 0.05$
- Two-sided test: $1 - a/2 = 1 - 0.05/2 = 0.975$
- Check T-table $(v=9, 0.975)$: critical value is $2.262$
- Compute $t$, statistically significant if $t > 2.262$

## Model Selection: ROC Curves

- Visual comparison of models
- X: **false positive rate**
    - f_pos / neg
- Y: **true positive rate**
    - t_pos / pos
- **Area below curve**:
    - accuracy, diagonal line: 0.5 accuracy

## Ensemble Methods

- Use a combination of multiple models to increase accuracy
- Popular ensemble methods
    - **bagging**: equal-weight votes
    - **boosting**: weighted votes

## Bagging: Bootstrap Aggregation

- Analogy: diagnosis by multiple doctors
    - multiple doctors' **majority vote**
- Training: given a data set $D$ of $d$ tuples
    - training set $D_i$: $d$ tuples sampled randomly with replacement (i.e., bootstrap)
    - classifier $M_i$ is learned for $D_i$
- Classification: majority vote
- Prediction: average of multiple predictions

## Boosting

- Analogy: diagnosis by multiple doctors
    - **weighted** by previous diagnosis accuracy
- Weights are assigned to each training tuple
- A series of $k$ classifiers is iteratively learned
- After classifier $M_i$ is learned, adjust weights so $M_{i+1}$ pays more attention to tuples that were misclassified by $M_i$
- $M^*$ combines the votes of all $k$ classifiers, weighted by individual accuracy

## Adaboost

- $D: (X_1,y_1),(X_2,y_2),\ldots,(X_d,y_d)$
- Initial weight of each tuple: $1/d$
- Round $i\ (i=1,\ldots,k) \quad\quad \displaystyle error(M_i) = \sum_{j=1}^d w_j \times err(X_j)$
    - $D_i$ sample $d$ tuples w/ replacement from $D$
    - Pr(choose $\text{tuple}_j$) based on $\text{tuple}_j$'s weight
    - Learn $M_i$ from $D_i$, compute its error rate
    - Reduce weights of correctly classified tuples
        - $\displaystyle w_j = w_j \times \frac{error(M_i)}{1 - error(M-i)}$
    - Normalize tuple weights so sum is 1.0
        - $\displaystyle weight(M_i) = \log\frac{1 - error(M_i)}{error(M-i)}$
- Classification: weighted votes of $k$ classifiers

## Ensemble Methods: Accuracy

- **Bagging**
    - often significantly better than single classifier
    - noise: not considerably worse, more robust
    - prediction: proved improved accuracy
- **Boosting**
    - generally better than bagging
    - may overfit the model to misclassified data