# Regression Metrics


- RMSE (root mean squared error), 
- MAE (mean absolute error), 
- WMAE(weighted mean absolute error), 
- RMSLE (root mean squared logarithmic error)…



In regression model, the most commonly known evaluation metrics include:

__R-squared (R2)__, which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.

__Root Mean Squared Error (RMSE)__

- RMSE follows an assumption that error are unbiased and follow a normal distribution. As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.

- Because the MSE is squared, its units do not match that of the original output. RMSE is the square root of MSE.
- Since the MSE and RMSE both square the residual, they are similarly affected by outliers.
- The RMSE is analogous to the standard deviation and is a measure of how large the residuals are spread out.
- Generally, RMSE will be higher than or equal to MAE. The lower the RMSE, the better the model.

__Residual Standard Error (RSE)__, also known as the model sigma, is a variant of the RMSE adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.

__Mean Absolute Error (MAE)__, like the RMSE, the MAE measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds - predicteds)). MAE is less sensitive to outliers compared to RMSE.

- Average of the difference between the Original Values and the Predicted Values.
- Do not gives any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data.
- Smaller the MAE, better is the model.
- Robust to outliers
- Range (0, + infinity]

__Mean Squared Error__

- Takes the average of the square of the difference between the original values and the predicted values.
- As we take square of the error, the effect of larger errors(sometimes outliers) become more pronounced then smaller error. Model will be penalized more for making predictions that differ greatly from the corresponding actual value.
- Before applying MSE, we must eliminate all nulls/infinites from the input.
- Not robust to outliers
- Range (0, + infinity]

__Root Mean Squared Logarithmic Error__

- We take the log of the predictions and actual values.
- What changes are the variance that we are measuring.
- RMSLE is usually used when we don’t want to penalize huge differences in the predicted and the actual values when both predicted and actual values are huge numbers.
- If both predicted and actual values are small: RMSE and RMSLE are same.
- If either predicted or the actual value is big: RMSE > RMSLE
- If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost negligible)

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.

__AIC stands for (Akaike’s Information Criteria)__, a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
AICc is a version of AIC corrected for small sample sizes.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional variables to the model.

Mallows Cp: A variant of AIC developed by Colin Mallows.


__Common metrics in regression:__

__Mean Squared Error Vs Mean Absolute Error RMSE__ gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable.

__The MAE is a linear score:__ all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE.

RMSE=1n∑ni=1(yi−y^i)2−−−−−−−−−−−−−−√
MAE=1n∑ni=1|yi−y^i|

__Root Mean Squared Logarithmic Error__ RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE)

RMSLE=1n∑ni=1(log(pi+1)−log(ai+1))2−−−−−−−−−−−−−−−−−−−−−−−−−−−√
Where pi is the ith prediction, ai the ith actual response, log(b) the natural logarithm of b.

__Weighted Mean Absolute Error__
The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products)
WMAE=1∑wi∑ni=1wi|yi−y^i|


minimizing the squared error over a set of numbers results in finding its mean, and minimizing the absolute error results in finding its median. This is the reason why MAE is robust to outliers whereas RMSE is not. 





__Q Why are some scores like MSE negative in scikit-learn?__

Some model evaluation metrics such as mean squared error (MSE) are negative when calculated in scikit-learn.

This is confusing, because error scores like MSE cannot actually be negative, with the smallest value being zero or no error.

The scikit-learn library has a unified model scoring system where it assumes that all model scores are maximized. In order this system to work with scores that are minimized, like MSE and other measures of error, the sores that are minimized are inverted by making them negative.

This can also be seen in the specification of the metric, e.g. ‘neg‘ is used in the name of the metric ‘neg_mean_squared_error‘.

When interpreting the negative error scores, you can ignore the sign and use them directly.

__MAE vs. MSE__

- Being more complex and biased towards higher deviation, RMSE is still the default metric of many models because loss function defined in terms of RMSE is smoothly differentiable whereas Mean Absolute Error requires complicated linear programming to compute the gradient.
- If we want a metric just to compare between two models from interpretation point of view, then MAE may be a better choice.
- Units of both RMSE & MAE are same as y values which is not true for R Square.
- Minimizing the squared error (𝐿2) over a set of numbers results in finding its mean, and minimizing the absolute error (𝐿1) results in finding its median.


__Adjusted R² over RMSE__

Absolute value of RMSE does not actually tell how good/bad a model is. It can only be used to compare across two models whereas Adjusted R² easily does that. For example, if a model has adjusted R² equal to 0.05 then it is definitely bad.

However, if we care only about prediction accuracy then RMSE is best. It is computationally simple, easily differentiable and present as default metric for most of the models.

## Multi class classification

- Multi Class : classify a set of images of fruits into any one of these categories — apples, bananas, and oranges.

- Multi label : tagging a blog into one or more topics like technology, religion, politics


- MultiClass Classifiers can distinguish between more than two classes.

Random Forest Classifiers or Naive Bayes Classifiers are capable of handling multiple classes directly. Others (Support Vector Machine classifiers or Linear classifiers) are strictly binary classifiers.

Task: 0 - 9 digits classification
One vs All (OvA) Classification Strategy : 
Train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). When we want to classify an image, we get the decision score from each classifier for that image and we select the class whose classifier outputs the highest score.

Example : Almost all classification algorithms.

One vs One (OvO) Strategy : 
Train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. If there are N classes, we need to train N × (N – 1) / 2 classifiers. For the MNIST problem, this means training 45 binary classifiers. When we want to classify an image, we have to run the image through all 45 classifiers and see which class wins the most duels. Main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

Example : Support Vector Machines scale poorly with the size of the training set, it is faster to train many classifiers on small training sets than training few classifiers on large training sets.

Scikit-Learn detects when we try to use a binary classification algorithm for a multi‐ class classification task, and it automatically runs OvA (except for SVM classifiers for which it uses OvO).For MNIST problem, Under the hood, Scikit-Learn trained 10 binary classifiers, get their decision scores for the image, and selected the class with the highest score.

If we want to force ScikitLearn to use OvO or OvA, we can use the OneVsOneClassifier or OneVsRestClassifier classes.


# NLP Metric 

BLEU (Bilingual Evaluation Understudy)

It is mostly used to measure the quality of machine translation with respect to the human translation. It uses a modified form of precision metric.

Example: Reference: The cat is sitting on the mat

Machine Translation 1: On the mat is a cat

Machine Translation 2: There is cat sitting cat

Machine Translation 3: The cat is sitting on the tam