# Model Performance

Terminology

* Labeled data - split into *training set* and *test set*

Model fit - see [here](https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html)

* Underfitting - performs poorly on training and test set. 
    * Model fails to capture relationship between inputs and output.
    * Fix by adding features, adding more complex features, adding more examples to the training set, optimize hyper parameters
* Overfitting - performs well on the training data, and poorly on the test data
    * Model 'memorizes' the training data, and fails to generalize to unseen data
    * Correct by removing more complex features,optimize hyper parameters
    
Supervised Learning Algorithm Types

* Regression - output is continuous numeric
* Binary classification - output is binary
* Multi-class classification - Categorical - one of many possible outcomes

## Regression Model Performance

Common techniques for evaluating model performance:

* Visually observe using plots
    * Plot both predicted and true values for a visual comparison
* Risidual histograms
    * The risidual is the difference between the true target and the predicted target.
    * Ideally centered around 0 with a bell shape, which means errors are random in nature, not inherent in the model.
* Evaluate with metrics like root mean square error
    * Metrics handy for quantifying model performance
    * Square the diff between actual and predicted values, and find the average.

Look at [this notebook[(https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/PerformanceEvaluation/regression_model_performance.ipynb)




## Binary Classifier Performance

Pass/fail, true/false, 1/0

* Positive class - the condition we are interested in detecting
* Negative class - the normal condition

Example: students to be admitted, positive class is admitted, negative are not addmitted
Example: individuals at risk of heart diseast, positive class is those are risk, negative are those not at risk

Typically use the positive class is the class the algorithm needs to detect.

Some algorithms produce a binary output, some provide a raw score that is the probability of being positive. For the latter identify a cut off value, less than is negative.

Compare output to label to determine performance.

Evaluation techniques, binary outputs:

* Plots
* [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
* Metrics like recall and precision

### Plots

* Look at the [binary rawscore performance evaluation notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/PerformanceEvaluation/binary_classifier_rawscore_evaluation.ipynb) for some plot examples.

### Confusion Matrix

* Useful in evaluating the performance of binary classifiers

|                  | Predicted Positive | Predicted Negative |
|------------------|--------------------|-----------------|
| Actual Positive  | True positive      | False  negative |
| Actual Negative  | False positive     | True negative   |

True Positive tells us how many samples were correctly classified as positive. True Negative tells us how many samples were correctly classified as negative. False negative tells us how many positive samples were misclassified and false positive tells us how many negative samples were misclassified.

If you sum the values in a row, you get the number of true positives and negatives. If you sum the values in a column you get the predicted positives and negatives.

Can also use fractional values in the matrix by dividing the positive row by the number of actual positives, and the negative row by the number of actual negatives. This gives true positice rate, false negative rate, etc.

Confusion matrix can be computed using sklearn - see [here](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/PerformanceEvaluation/binary_classifier_performance.ipynb)

### Metrics

Review - lecture 29

True positive rate

* Aka TPR, Recall, Probability of Detection
* Count of samples correctly classified as positive / Number of actual positives
    * TP/(TP + FN)
* Recall value closer to 1 better, closer to 0 worse
* Example: radar operator watching skies for enemy planes.
    * Positive class is enemy plane, negative is friendly plane
    * true positive or probability of detection - the probability of correctly classifying an enemy plane

True negative rate

* Count of samples correctly classified as negative / Number of actual negatives
* Closest to 1 is better, closer to 0 is worse
* The probability of correctly classifying a friendly plane

False positive rate

* FPR, Probability of false alarm
* How many negatives were falsely classified as positives (fraction)
* Count of negative samples mis-correctly classified as positive / Count of actual negatives
* Values closer to 0 are better, closer to 1 are worse
* Probability of false alarm - probability of misclassifying a fiendly plane as an enemy plane

False negative rate

* FNR, Missies
* How many positives were misclassified as negative (fraction)
* Count of positive samples miscorrectly classified as negative / Count of actual positives
* Closer to 0 is better, closer to 1 worse
* Probility of misclassifying an enemy plane as a friendly plane

Precision

* True Positive / (True Positive + False Positive)
* How many positives classified by the algoritm are really positive?
* Closer to 1 better
* Precision would go up as enemy planes are correctly identified, while minimizing false alarms

Accuracy

* Measure of overall performance
* (True positives + True negatives) / (Positive + Negative)
* How many positives and negatives were correctly classified (fraction)
* Closer to 1 better
* Not a good for skewed datasets
* Accuracy would go up when enemy planes and friendly planes are correctly identified

F1 Score

* aka [Harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean)
* 2(precision)(recall) / (precision + recall)
* closer to 1 better
* Another measure of how well the model can identify positives

Sklearn has a classification_report function available

### Area Under the Curve Metrics

Some binary classifiers are based on algoriths that generate a score between 0 and 1, with a custoff threshold used to partition the two classes.

* Often 0.5 is used as the cut off
* Sometimes a higher threshold makes sense, for example 0.8 before something is classified as spam, reducing the number of regular email classified as spam with the tradeoff of a higher false negative rate.
* Sometimes a lower threshold makes sense, for example 0.3 to detect a disease with the tradeoff of a high false alarm rate.

Can use area under the curve to evaluate performance under different thresholds - see [this](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/PerformanceEvaluation/binary_classifier_rawscore_evaluation.ipynb) notebook.

AUC refers to Area Under Curve. The curve here refers to the plot that has probability of false alarm also known as false positive rate in the x-axis and the probability of detection also known as true positive rate, recall in the y-axis. This curve is called as Receiver Operating Characteristics (ROC).

By plotting False Alarm versus Recall at various cutoff thresholds, we can form a curve. A good model has an AUC closer to 1. 0.5 is a random guess, closer to 0 is unusual and indicates the model is flipping results.

Can use the sklearn roc_auc_score method from sklearn.metrics



## Multiclass Classifier Performance

Multiclass classification algorithms predict one of many classes

* Example: grade prediction - A, B, C, D
* Classifier assigns a score for each class indicating likelyhood of the sample belonging to that class
* Can use the same metrics as per binary classifier, using one versus all
* See [this notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/PerformanceEvaluation/multiclass_classifier_performance.ipynb)
* See also the [AWS docs](https://docs.aws.amazon.com/machine-learning/latest/dg/multiclass-classification.html)

Class level vs model level metrics - average class level metrics to get model performance

* See [here](https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin)
* Macro - doesn't consider number of samples of each class, not a good measure for skewed data
* Weighted - assigns weight to each class based on samples if each class, better as it accounts for skewed distributions
* 