# Module 3: Evaluation

## Model Evaluation & Selection
### Learning Objectives 
1. Understand why accuracy only gives a partial picture of a classifier's performance.

2. Understand the motivation and definition of important evaluation metrics in machine learning.

3. Learn how to use a variety of evalution metrics to evaluate supervised machine learning models.

4. Learn about choosing the right metric for selecting between models or for doing parameter tuning.

### About Evaluation
1. Different applications have very different goals.

2. Accuracy is widely used, but many others are possible, e.g.
    - user satisfaction (web search)
    - amount of revenue (e-commerce)
    - increase in patient survival rates (medical)

3. It's very important to choose evaluation methods that match the goal of your application.

4. Compute your selected evaluation metric for multiple different models.

5. Then select the model with 'best' value of evaluation metric

### Accuracy with Imbalanced Classes
1. Suppose you have two classes:
    - Relevant (R): the positive class
    - Not_Relevant (N): the negative class
2. Out of 1000 randomly selected items, on average 
    - One item is relevant and has an R label
    - The rest of the items (999 of them) are not relevant and labelled N.
3. Recall that:
    Accuracy = **#correct predictions / # total instances**
4. You build a classifier to predict relevant items, and see that its accuracy on a test set is 99.9%. (You may think: wow, this is amazing. But wait..)

5. For comparison, suppose we had a 'dummy' classifier that didn't look at the features at all, and always just blindly predicted the most frequent class (i.e. the negative N class)

6. Assuming a test set of 1000 instances, what would this dummy classifier's accuracy be?
    - $Accuracy_{DUMMY} = 999/1000 = 99.9%$

### Dummy classifiers completely ignore the input data
1. dumm yclassifiers serve as a sanity check on your classifier's performance

2. They provide a *null metric* (e.g. null accuracy) baseline

3. Dummy classifiers should not be used for real problems

4. Som commonly-used settings for the strategy parameter for DummyClassifier in scikit-learn:
    - most_frequent: predicts the most frequent label in the training set
    - stratified: random preditions based on training set class distribution
    - unifrom: generates predictions uniformly at random
    - constant: always predicts a constant label provided by the user
        - a majormotivation of this method is F1-scoring, when the positive class is in the minority
        
### What if my classifier accuracy is close to the null accuracy baseline?
This could be a sign of:
1. Ineffective, erroneous or missing features

2. Poor choice of kernel or hyperparameter

3. Large class imbalance

### Dummy Regressors
*strategy* parameter options:
1. mean: predicts the mean of the training targets

2. median: predicts the median of the training targets

3. quantile: predicts a user-provided quantile of the training targets

4. constant: predicts a constant user-provided value

### Binary Prediction Outcomes 
<img src="https://img.ceclinux.org/6c/34448493ba9f254a61091449ba6f77c530231f.png">

### Confusion Matrix for Binary Prediction Task
<img src="https://img.ceclinux.org/18/7739eb732a66ee82dd5f6c78d1e33a2ab7828a.png">

** Noete: Always look at the confusion matrix for your classifier.**

## Confusion Matrices & Basic Evaluation Metrics

### **Accuracy**: for what fraction of all instances is the classifier's prediction correct (for either positive or negative class)?
<img src="https://img.ceclinux.org/9a/80004a130a4e73aad41819829f140cf7721c59.png">

### **Classification Error (I - Accuracy)**: for what fraction of all instances is the classifier's prediciton *incorrect*?
$ClassificationError = \frac {FP + FN}{TN + TP + FN + FP} = \frac{7+17}{400+26+17+7} = 0.060$

### **Recall**, or **True Positive Rate (TPR)**:  what fraction of all positive instances does the classifier *correctly* indentify as positive?
Recall is also known as:
    - True Positive Rate (TPR)
    - Sensitivity 
    - Probability of detection

$Recall = \frac {TP} {TP + FN} = \frac {26} {26 + 17} = 0.60$

### **Precision**: what fraction of *positive* predictions are correct?
$Precision = \frac {TP}{TP + FP} = \frac {26}{26 + 7} = 0.79$

### **False posivite rate (FPR)**: what fraction of all negative instances does the classifier *incorrectly* identify as positive?
False Positive Rate is also known as: Specificity

$FPR = \frac {FP}{TN + FP} = \frac {7}{400+7} = 0.02$

### The Precision-Recall Tradeoff
<img src="https://img.ceclinux.org/c5/0a5988cacc5b7c718b9dae42e7cf4e04ca99b0.png">
<img src="https://img.ceclinux.org/f0/4165bf69694b90c80bf4ebacb7f0fb91547fd5.png">
<img src="https://img.ceclinux.org/ad/9e1952c54a420b374e73d0b5177f4afe961708.png">

### There is often a tradeoff between precision and recall
1. Recall-oriented machine learning tasks:
    - Search and information extraction in legal discovery
    - Tumor detection
    - Often paired with a human export to filter out false positives
2. Precision-oriented machine learning tasks:
    - Search engine ranking, query suggestion
    - Documentation classification
    -  Many cusotmer-facing tasks (users remeber failures!)
    
### F-score: generalizes F1-score for combining precision & recall into a single number
<img src="https://img.ceclinux.org/e7/2dca94ed7c096a518ef78ec8f079499ffcf1db.png">
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, or 
from sklearn.metrics import classification_report

## Classifier Decision Functions

### Decision Functions (decision_function)
1. Each classifier score value per test point indicates how confidently the classifier predicts the positive class (large-magnitude positive values) or the negative class (large-magnitude negative values)

2. Choosing a fixed decision threshold gives a classifiction rule.

3. By sweeping the decision threshold through the entire range of possible score values, we get a series of classification outcomes that form a curve.

### Predicted Probability of Class Membership (predict_proba)
1. Typical rule: choose most likely class
    - e.g. class I if threshold > 0.50
2. Adjusting threshold afects predictions of classifier

3. Higher threshold results in a more conservative classifier
    - e.g. only predict Class I if estimated probability of class I is above 70%
    - this increases precision. Doesn't predict class I as often, but when it dows, it gets high proportion of class I instances correct
4. Not all models provide realistic probability estimates

<img src="https://img.ceclinux.org/66/45dcd40322e36e7c00f3fda299e06f3072782b.png">

## Precision-recall and ROC Curves

### Precision-Recall Curves
1. X-axis: Precision

2. Y-axis: Recall

3. Top right corner:
    - the 'ideal' point
    - precision = 1.0
    - recall = 1.0

4. "Steepness" of P-R curves is important:
    - maximize precision
    - while maximizing recall

<img src="https://img.ceclinux.org/93/89e371aa62b37ddd90cfff2eafd36eccf00f47.png">

### ROC Curves
Receiver operating characteristic curve

1. X-axis: False Positive Rate

2. Y-axis: True Positive Rate

3. Top left corner:
    - the 'ideal' point
    - false positive rate of zero
    - true positive rate of one

4. 'Steepness' of ROC curves is important
    - maximize the true positive rate
    - while minimizing the false positive rate

<img src="https://img.ceclinux.org/a1/c32857808150f646216766ee6140a016a54159.png">

## Multi-Class Evaluation
1. Multi-class evaluation is an extension of the binary case
    - A collection of true vs predicted binary outcomes, one per class
    - Confusion matrices are especially useful
    - Classification report
2. Overall evaluation metrics are averages across classes
    - But there are different ways to average multi-class results
    - The support (number of instances) for each class is important to consider, e.g. in case of imbalanced classes
3. Multi-label classification: each instance can have multiple labels (not covered here)

### Multi-Class Confusion Matrix
<img src="https://img.ceclinux.org/3d/1b9658381d93761e91f336afd1dfea2b764d50.png">
**should look at the confusion matrix of each of your model, to gain useful insights**

### Micro vs Macro Average
#### Macro-average
1. Each **class** has equal weight

2. Compute metric within each class

3. Average resulting metrics across classes
<img src="https://img.ceclinux.org/39/52703afcf6ee0c3c4834dc99af01fc655f6230.png">

#### Micro-average
1. Each **instance** has equal weight

2. Largest classes have most influence

3. Aggregate outcomes across all classes

4. Compute metric with aggregate outcomes
<img src="https://img.ceclinux.org/d7/f36d014927a1762df9658cfda98732732ddaea.png">

### Macro-Average vs Micro-Average
1. If the calsses have about the same number of instances, macro- and micro-average will be about the same

2. If some classes are much larger (more instances) than others, and you want to:
    - weight your metric toward the largetst ones, use micro-averaging
    - weight your metric toward the smallest ones, use macro-averaging
3. If the micro-average is much lower than the macro-average then examine the larger classes for poor metric performance

4. If the macro-average is much lower than the micro-average then examine the smaller classes for poor metric performance

## Regression Evaluation
1. Typically r2 is enough
    - Reminder: computes how well future instances will be predicted
    - Best possible score is 1.0
    - Constant prediction score is 0.0
2. Alternative metrics include:
    - mean_absolute_error (absolute difference of target & predicted values); corresponds to the expected value of the L1 norm loss
    - mean_squared_error (squared difference of target & predicted values); corresponds to the expected value of the L2 norm loss
    - median_absolute_error (robust to outliers)
    
### Dummy Regressors
As in classification, comparison to a 'dummy' prediction model that uses a fixed rule can be useful. *cummy regressors* in scikit-learn

The DummyRegressor class implements 4 simple baseline rules for regression, using the **strategy** parameter:
    - **mean** predicts the mean of the training target values
    - **median** predicts the median of the training target values
    - **quantile** predicts a user-provided quantile fo the training target values (e.g. value at the 75th percentile)
    - **constant** predicts a custom constant value provided by the user

## Model Selection: Optimizing Classifiers for Different Evaluation Metrics

### Model Selection Using Evaluation Metrics
1. Train/test on same data
    - single metric
    - typically overfits and likely won't generalize well to new data
    - but can serves as a sanity check: low accuracy on the training set may indicate an implementation problem
2. single train/test split
    - single metric
    - speed and simplicity
    - lack of variance information
3. K-fold cross-validation
    - K train-test splits
    - Average metric over all splits
    - Can be combined with parameter grid search: GridSearchCV (def. cv=3)

### Training, Validation, and Test Framework for Model Selection and Evaluation
1. Using ony corss-validation or a test set to do model selection may lead to more subtle overfitting/ optimistic generalization estimates

2. Instead, use three data splits:
    - training set (model building)
    - validation set (model selection)
    - test set (final evaluation)
3. In practice:
    - create an initial training/ test split
    - do cross-validation on the training data for model/parameter selection
    - save the held-out teset set fo rfinal model evaluation

### Concluding Notes
1. Accuracy is often not the right evaluation metric for many real-world machine learning tasks
    - False positives and false negatives may need to be treated very differently
    - Make sure you understand the needs of your application and choose an evaluation metric that matches your application, user, or business goals
2. Examples of additional evaluation methods include:
    - Learning curve: How much does accuracy (or other metric) change as a funciton of the amount of training data?
    - Sensitivity analysis: How much does accuracy (or other metric) change as a function of key learning parameter values?
    