# Classification Metrics

In some scenarios, we are ok with the overall accuracy whereas in some scenario the cost of misclassifying a single data point is huge. For example In a scenario of bank finding whether a customer is eligible for the loan or not it can be alright if we might misclassify as some eligible customers as not eligible. But in case of a doctor classifying the patients as having cancer or not it would be a blunder if we declare some potential cancer patients as cancer-free.

__Confusion Matrix__

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. Confusion matrix is nice, but it is not statistically significant as it is a point estimate.


__Confusion matrix Terms:__
- True Positives (TP): Observations where the actual and predicted transactions were fraud

- True Negatives (TN): Observations where the actual and predicted transactions weren’t fraud

- False Positives (FP) or False Alarm or Type I Error : Observations where the actual transactions weren’t fraud but predicted to be fraud

- False Negatives (FN) or Type II Error : Observations where the actual transactions were fraud but weren’t predicted to be fraud
- ideal scenario - zero values for FP and FN
- Samples in the FP set are actually negatives and samples in FN are actually positives.

__Accuracy (Acc)__

- Acc = (tp + tn) / (tp + tn + fp + fn)
- Classification accuracy is the percentage of correct prediction over total instances. Accuracy can be a good metric if the classes are balanced
-  it doesn't tell us where the model is making errors. Answering this "where" question is an essential part of model-building. 

__Classification Error / Error Rate / Misclassification Rate (ERR)__

- ERR = (1-Acc)
- measures the ratio of incorrect predictions over the total number of instances evaluated. 
- applicable for multi-class and multi-label problems;
- Another problem with the accuracy is that two classifiers can yield the same accuracy but perform differently
- Both accuracy and error rate metrics are sensitive to the imbalanced data. The imbalance dataset makes accuracy, not a reliable performance metric to use. To cope with this problem, we can choose to penalize false positives or false negatives. This will generate two alternative metrics i.e precision and recall.


__Precision__

Precision is the ability of a model to identify only the relevant data points. For example, for a text search on a set of documents, precision is the number of correct results (TP) divided by the number of all returned results (that belongs to the positive class i.e TP + FP)

- Precision is the probability that our classifier will properly identify as positive.
- The precision is the ability of the classifier not to label as positive when it is negative. With precision, we are evaluating our data by its performance of ‘positive’ predictions.
- Recall is TP/acutal_yes whereas Precsion is TP/predicted_yes
- Both Precision and Recall do not consider TNs.

__Recall__

Recall is the ability of a model to find all the relevant cases within a training dataset. For example, for a text search on a set of documents, recall is the number of correct results (TP) divided by the number of results that should have been returned (that actually belongs to the positive class i.e. TP+FN). A higher relcall (ie 1.0) says that we will catch every terrorist but our precision will be very low i.e. we will detain many innocents

- The recall is the ability of the classifier to find all the positive samples. With recall, we are evaluating our data by its performance of the ground truths for positive outcomes.
- e.g. If a sample is positive for the disease, what’s the probability that the system will pick it up

__F1 (F-beta measures)__

Used when the target variable is unbalanced. The F1 score can be interpreted as a weighted average of the precision and recall. We use the harmonic mean instead of a simple average because it punishes extreme values. A classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5 but an F1 score of 0. The F1 score gives equal weight to both measures. 

When beta takes value of 0.5 then it is called F1 measures. F1 scores can be used to compare two models. F1 is used where true negatives don’t matter much.  The best value for recall, precision and F1 is 1 and the worst value is 0.

The difference in F1 score reflects the model performance. When you have a small positive class, then F1 score makes more sense. This is the common problem in fraud detection where positive labels are few.

__Sensitivity or TPR or Recall__

- Sensitivity is True Positive rate and is also called positve recall while specificity (TNR) is called negative recall.

__Specificity or TNR is opposite of Recall__

- = TN/(TN+FP).
- Sensitivity and Specificity may give you a biased result, especially for imbalanced classes.
- both precision and recall are necessary to determine if the classifier is performing well.


__Matthews Correlation Coefficient (MCC)__
- Similar to Correlation Coefficient and its values lie between -1 to +1. A model with a score of +1 is a perfect model and -1 is a poor model. This property is one of the key usefulness of MCC as it leads to easy interpretability.


__ROC Curve (TPR vs 1-specificity aka FPR)__

ROC is only used in binary classification problem. ROC is a function of threshold which shows the performance of a classification model at all thresholds.It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis)

 
- The ROC Curve/AUC Score is most useful when we are evaluating a model to itself
- The value can range from 0 to 1. However auc score of a random classifier for balanced data is 0.5
- ROC-AUC score is independent of the threshold set for classification because it only considers the rank of each prediction and not its absolute value. The same is not true for F1 score which needs a threshold value in case of probabilities output

__AOC__

- AUC is one of the popular ranking type metrics AUC is the area under the ROC curve. Perfect classifier: AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100% specificity (no FP)

- The probabilistic interpretation of ROC-AUC score is that if we randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. Here, rank is determined according to order by predicted values.
- ROC-AUC score is independent of the threshold set for classification because it only considers the rank of each prediction and not its absolute value. The same is not true for F1 score which needs a threshold value in case of probabilities output
- AUC is the percentage of the ROC plot that is underneath the curve.
- The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.
- AUC is useful even when there is high class imbalance (unlike classification accuracy)
  Fraud case
   - Null accuracy almost 99%
    - AUC is useful here

General AUC predictions:

- .90-1 = Excellent
- .80-.90 = Good
- .70-.80 = Fair
- .60-.70 = Poor
- .50-.60 = Fail

AUC ROC considers the predicted probabilities for determining the model’s performance. But, it only takes into account the order of probabilities and hence it does not take into account the model’s capability to predict higher probability for samples more likely to be positive(Log Loss).

Whereas the AUC is computed with regards to binary classification with a varying decision threshold, log loss actually takes “certainty” of classification into account.


__Precsion Recall Curve (PRC)__

When dealing with highly skewed datasets (class imbalance), Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. PRC is also used for binary classification problems. 

__Log loss/ Logarithmic Loss / Logistic Loss / Cross-Entropy Loss__

AUC ROC considers the predicted probabilities for determining our model’s performance. However, there is an issue with AUC ROC, it only takes into account the order of probabilities and hence it does not take into account the model’s capability to predict higher probability for samples more likely to be positive. In that case, we could us the log loss which is nothing but negative average of the log of corrected predicted probabilities for each instance. Punishes infinitely the deviation from the true value! It’s better to be somewhat wrong than emphatically wrong!

 
- When working with Log Loss, the classifier must assign probability to each class for all the samples.
- Log loss measures the UNCERTAINTY of the probabilities of the model by comparing them to the true labels and penalising the false classifications.
- Log loss is only defined for two or more labels.
- Log Loss gradually declines as the predicted probability improves, thus Log Loss nearer to 0 indicates higher accuracy, Log Loss away from 0 indicates lower accuracy.
- Log Loss exists in the range (0, ∞].

__Gini Coefficient__

- Gini = 2*AUC – 1
- Gini above 60% is a good model. For the case in hand we get Gini as 92.7%.

__Concordant – Discordant ratio__


__kohens kappa__

Kappa is similar to Accuracy score, but it takes into account the accuracy that would have happened anyway through random predictions.

Kappa = (Observed Accuracy - Expected Accuracy) / (1 - Expected Accuracy)

__Choice of Metrics__

It depends on the business objective and the cost consideration. I consider the largest difference in ROC and PR AUC the fact the ROC is determining how well your model can "calculate" the positive class AND the negative class where as the PR AUC is really only looking at your positive class. So in a balanced class situation and where you care about both negative and positive classes, the ROC AUC metric works great. When you have an imbalanced situation, it is preferred to use the PR AUC, but keep in mind it is only determining how well your model can "calculate" the positive class!

- If we care for absolute probabilistic difference, go with log-loss.
- If we care only for the final class prediction and we don’t want to tune threshold, go with AUC score. -F1 score is sensitive to threshold and we would want to tune it first before comparing the models.

__Q When will we prefer F1 over ROC-AUC?__

Prefer PR curve whenever the positive class is rare or when we care more about the false positives than the false negatives.

To train binary classifiers, choose the appropriate metric for the task, evaluate the classifiers using cross-validation, select the precision/ recall tradeoff that fits our needs, and compare various models using ROC curves and ROC AUC scores.

# Regression Metrics


- RMSE (root mean squared error), 
- MAE (mean absolute error), 
- WMAE(weighted mean absolute error), 
- RMSLE (root mean squared logarithmic error)…



In regression model, the most commonly known evaluation metrics include:

__R-squared (R2)__, which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.

__Root Mean Squared Error (RMSE)__

- RMSE follows an assumption that error are unbiased and follow a normal distribution. As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.

- Because the MSE is squared, its units do not match that of the original output. RMSE is the square root of MSE.
- Since the MSE and RMSE both square the residual, they are similarly affected by outliers.
- The RMSE is analogous to the standard deviation and is a measure of how large the residuals are spread out.
- Generally, RMSE will be higher than or equal to MAE. The lower the RMSE, the better the model.

__Residual Standard Error (RSE)__, also known as the model sigma, is a variant of the RMSE adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.

__Mean Absolute Error (MAE)__, like the RMSE, the MAE measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds - predicteds)). MAE is less sensitive to outliers compared to RMSE.

- Average of the difference between the Original Values and the Predicted Values.
- Do not gives any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data.
- Smaller the MAE, better is the model.
- Robust to outliers
- Range (0, + infinity]

__Mean Squared Error__

- Takes the average of the square of the difference between the original values and the predicted values.
- As we take square of the error, the effect of larger errors(sometimes outliers) become more pronounced then smaller error. Model will be penalized more for making predictions that differ greatly from the corresponding actual value.
- Before applying MSE, we must eliminate all nulls/infinites from the input.
- Not robust to outliers
- Range (0, + infinity]

__Root Mean Squared Logarithmic Error__

- We take the log of the predictions and actual values.
- What changes are the variance that we are measuring.
- RMSLE is usually used when we don’t want to penalize huge differences in the predicted and the actual values when both predicted and actual values are huge numbers.
- If both predicted and actual values are small: RMSE and RMSLE are same.
- If either predicted or the actual value is big: RMSE > RMSLE
- If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost negligible)

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.

AIC stands for (Akaike’s Information Criteria), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
AICc is a version of AIC corrected for small sample sizes.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional variables to the model.

Mallows Cp: A variant of AIC developed by Colin Mallows.


Common metrics in regression:

Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable.

The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE.

RMSE=1n∑ni=1(yi−y^i)2−−−−−−−−−−−−−−√
MAE=1n∑ni=1|yi−y^i|

Root Mean Squared Logarithmic Error
RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE)

RMSLE=1n∑ni=1(log(pi+1)−log(ai+1))2−−−−−−−−−−−−−−−−−−−−−−−−−−−√
Where pi is the ith prediction, ai the ith actual response, log(b) the natural logarithm of b.

Weighted Mean Absolute Error
The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products)
WMAE=1∑wi∑ni=1wi|yi−y^i|


minimizing the squared error over a set of numbers results in finding its mean, and minimizing the absolute error results in finding its median. This is the reason why MAE is robust to outliers whereas RMSE is not. 





__Q Why are some scores like MSE negative in scikit-learn?__

Some model evaluation metrics such as mean squared error (MSE) are negative when calculated in scikit-learn.

This is confusing, because error scores like MSE cannot actually be negative, with the smallest value being zero or no error.

The scikit-learn library has a unified model scoring system where it assumes that all model scores are maximized. In order this system to work with scores that are minimized, like MSE and other measures of error, the sores that are minimized are inverted by making them negative.

This can also be seen in the specification of the metric, e.g. ‘neg‘ is used in the name of the metric ‘neg_mean_squared_error‘.

When interpreting the negative error scores, you can ignore the sign and use them directly.

__MAE vs. MSE__

- Being more complex and biased towards higher deviation, RMSE is still the default metric of many models because loss function defined in terms of RMSE is smoothly differentiable whereas Mean Absolute Error requires complicated linear programming to compute the gradient.
- If we want a metric just to compare between two models from interpretation point of view, then MAE may be a better choice.
- Units of both RMSE & MAE are same as y values which is not true for R Square.
- Minimizing the squared error (𝐿2) over a set of numbers results in finding its mean, and minimizing the absolute error (𝐿1) results in finding its median.


__Adjusted R² over RMSE__

Absolute value of RMSE does not actually tell how good/bad a model is. It can only be used to compare across two models whereas Adjusted R² easily does that. For example, if a model has adjusted R² equal to 0.05 then it is definitely bad.

However, if we care only about prediction accuracy then RMSE is best. It is computationally simple, easily differentiable and present as default metric for most of the models.

## Multi class classification

- Multi Class : classify a set of images of fruits into any one of these categories — apples, bananas, and oranges.

- Multi label : tagging a blog into one or more topics like technology, religion, politics


- MultiClass Classifiers can distinguish between more than two classes.

Random Forest Classifiers or Naive Bayes Classifiers are capable of handling multiple classes directly. Others (Support Vector Machine classifiers or Linear classifiers) are strictly binary classifiers.

Task: 0 - 9 digits classification
One vs All (OvA) Classification Strategy : 
Train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). When we want to classify an image, we get the decision score from each classifier for that image and we select the class whose classifier outputs the highest score.

Example : Almost all classification algorithms.

One vs One (OvO) Strategy : 
Train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. If there are N classes, we need to train N × (N – 1) / 2 classifiers. For the MNIST problem, this means training 45 binary classifiers. When we want to classify an image, we have to run the image through all 45 classifiers and see which class wins the most duels. Main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

Example : Support Vector Machines scale poorly with the size of the training set, it is faster to train many classifiers on small training sets than training few classifiers on large training sets.

Scikit-Learn detects when we try to use a binary classification algorithm for a multi‐ class classification task, and it automatically runs OvA (except for SVM classifiers for which it uses OvO).For MNIST problem, Under the hood, Scikit-Learn trained 10 binary classifiers, get their decision scores for the image, and selected the class with the highest score.

If we want to force ScikitLearn to use OvO or OvA, we can use the OneVsOneClassifier or OneVsRestClassifier classes.


# NLP Metric 

BLEU (Bilingual Evaluation Understudy)

It is mostly used to measure the quality of machine translation with respect to the human translation. It uses a modified form of precision metric.

Example: Reference: The cat is sitting on the mat

Machine Translation 1: On the mat is a cat

Machine Translation 2: There is cat sitting cat

Machine Translation 3: The cat is sitting on the tam