# Model Evaluation Methods

There are multiple stages in developing a machine learning model for use in a software application. It follows that there are multiple places where one needs to evaluate the model. Roughly speaking, the first phase involves prototyping, where we try out different models to find the best one (model selection). Once we are satisfied with a pro‐totype model, we deploy it into production, where it will go through further testing on live data.1 Figure 1-1 illustrates this workflow.

Why is it so complicated? Two reasons. First of all, note that online and offline evaluations may measure very different metrics. Offline evaluation might use one of the metrics like accuracy or precision-recall, which we discuss in Chapter 2. Furthermore, training and validation might even use different metrics, but that’s an even finer point (see the note in Chapter 2). Online evaluation, on the other hand, might measure business metrics such as customer lifetime value, which may not be available on historical data but are closer to what your business really cares about (more about picking the right metric for online evaluation in Chapter 5).

Secondly, note that there are two sources of data: historical and live. Many statistical models assume that the distribution of data stays the same over time. (The technical term is that the distribution is stationary.) But in practice, the distribution of data changes over time, sometimes drastically. This is called distribution drift. As an example, think about building a recommender for news articles. The trending topics change every day, sometimes every hour; what was popular yesterday may no longer be relevant today. One can imagine the distribution of user preference for news articles changing rapidly over time. Hence it’s important to be able to detect distribution drift and adapt the model accordingly.

One way to detect distribution drift is to continue to track the model’s performance on the validation metric on live data. If the performance is comparable to the validation results when the model was built, then the model still fits the data. When performance starts to degrade, then it’s probable that the distribution of live data has drifted sufficiently from historical data, and it’s time to retrain the model. Monitoring for distribution drift is often done “offline” from the production environment. Hence we are grouping it into offline evaluation.

## >>>>>> Online Evaluation Mechanisms:

Once a satisfactory model is found during the prototyping phase, it can be deployed to production, where it will interact with real users and live data. The online phase has its own testing procedure. The most commonly used form of online testing is A/B testing, which is based on statistical hypothesis testing. The basic concepts may be well known, but there are many pitfalls and challenges in doing it correctly. Chapter 5 goes into a checklist of questions to ask when running an A/B test, so as to avoid some of the pernicious pitfalls. A less well-known form of online model selection is an algorithm called multiarmed bandits. We’ll take a look at what it is and why it might be a better alternative to A/B tests in some situations.


## >>>>>> Offline Evaluation Mechanisms:

As alluded to earlier, the main task during the prototyping phase is to select the right model to fit the data. The model must be evaluated on a dataset that’s statistically independent from the one it was trained on. Why? Because its performance on the training set is an overly optimistic estimate of its true performance on new data. The process of training the model has already adapted to the training data. A more fair evaluation would measure the model’s performance on data that it hasn’t yet seen. In statistical terms, this gives an estimate of the generalization error, which measures how well the model generalizes to new data. So where does one obtain new data? Most of the time, we have just the one dataset we started out with. The statistician’s solution to this problem is to chop it up or resample it and pretend that we have new data.

One way to generate new data is to hold out part of the training set and use it only for evaluation. This is known as hold-out validation. The more general method is known as k-fold cross-validation. There are other, lesser known variants, such as bootstrapping or jackknife resampling. These are all different ways of chopping up or resampling one dataset to simulate new data. Chapter 3 covers offline evaluation and model selection.

#### [1] Hold-out Set Validation


#### [2] K-fold Cross Validation


#### [3] Leave-one-out Cross Validation


#### [4] Boostraping



## >>>>>> Model Evaluation Metrics:

focuses on evaluation metrics. Different machine learning tasks have different performance metrics. If I build a classifier to detect spam emails versus normal emails, then I can use classification performance metrics such as average accuracy, log-loss, and area under the curve (AUC). If I’m trying to predict a numeric score, such as Apple’s daily stock price, then I might consider the root-mean-square error (RMSE). If I am ranking items by relevance to a query submitted to a search engine, then there are ranking losses such as precision-recall (also popular as a classification metric) or normalized discounted cumulative gain (NDCG). These are examples of performance metrics for various tasks.

### -->> Regression Problems:

#### [1] RMSE


#### [2] Quantiles of Errors


#### [3] “Almost Correct” Predictions




### -->> Classification Problems:


#### [0] Accuracy


#### [0-1] Per-Class Accuracy


#### [1] Confusion Matrix


#### [2] Gain / Lift Chart


#### [3] Kolomogorov Smirnov Chart


#### [4] Area Under the ROC curve (AUC – ROC)


#### [5] Kappa Statistics


#### [6] F1 Score (also F-score or F-measure)


#### [7] Log-Loss



### -->> Ranking Problems:


#### [1] Precision-Recall


#### [2] Precision-Recall Curve and the F1 Score


#### [3] NDCG




#### **** Caution: The Di erence Between Training Metrics and Evaluation Metrics

Sometimes, the model training procedure may use a different metric (also known as a loss function) than the evaluation. This can happen when we are reappropriating a model for a different task than it was designed for. For instance, we might train a personalized recom‐ mender by minimizing the loss between its predictions and observed ratings, and then use this recommender to produce a ranked list of recommendations.
This is not an optimal scenario. It makes the life of the model diffi‐ cult—it’s being asked to do a task that it was not trained to do! Avoid this when possible. It is always better to train the model to directly optimize for the metric it will be evaluated on. But for certain met‐ rics, this may be very difficult or impossible. (For instance, it’s very hard to directly optimize the AUC.) Always think about what is the right evaluation metric, and see if the training procedure can opti‐ mize it directly.

#### **** Caution: Skewed Datasets—Imbalanced Classes, Outliers, and Rare Data

It’s easy to write down the formula of a metric. It’s not so easy to interpret the actual metric measured on real data. Book knowledge is no substitute for working experience. Both are necessary for suc‐ cessful applications of machine learning.

Always think about what the data looks like and how it affects the metric. In particular, always be on the look out for data skew. By data skew, I mean the situations where one “kind” of data is much more rare than others, or when there are very large or very small outliers that could drastically change the metric.

Earlier, we mentioned how imbalanced classes could be a caveat in measuring per-class accuracy. This is one example of data skew— one of the classes is much more rare compared to the other class. It is problematic not just for per-class accuracy, but for all of the met‐ rics that give equal weight to each data point. Suppose the positive class is only a tiny portion of the observed data, say 1%—a common situation for real-world datasets such as click-through rates for ads, user-item interaction data for recommenders, malware detection, etc. This means that a “dumb” baseline classifier that always classi‐ fies incoming data as negative would achieve 99% accuracy. A good classifier should have accuracy much higher than 99%. Similarly, if looking at the ROC curve, only the top left corner of the curve would be important, so the AUC would need to be very high in order to beat the baseline. See Figure 2-4 for an illustration of these gotchas.

Any metric that gives equal weight to each instance of a class has a hard time handling imbalanced classes, because by definition, the metric will be dominated by the class(es) with the most data. Fur‐ thermore, they are problematic not only for the evaluation stage, but even more so when training the model. If class imbalance is not properly dealt with, the resulting model may not know how to pre‐ dict the rare classes at all.
Data skew can also create problems for personalized recommenders. Real-world user-item interaction data often contains many users who rate very few items, as well as items that are rated by very few users. Rare users and rare items are problematic for the recommender, both during training and evaluation. When not enough data is available in the training data, a recommender model would not be able to learn the user’s preferences, or the items that are similar to a rare item. Rare users and items in the evaluation data would lead to a very low estimate of the recommender’s performance, which com‐ pounds the problem of having a badly trained recommender.

Outliers are another kind of data skew. Large outliers can cause problems for a regressor. For instance, in the Million Song Dataset, a user’s score for a song is taken to be the number of times the user has listened to this song. The highest score is greater than 16,000! This means that any error made by the regressor on this data point would dwarf all other errors. The effect of large outliers during eval‐ uation can be mitigated through robust metrics such as quantiles of errors. But this would not solve the problem for the training phase. Effective solutions for large outliers would probably involve careful data cleaning, and perhaps reformulating the task so that it’s not sensitive to large outliers.

## >>>>>> Hyperparameter Tuning:

You may have heard of terms like hyperparameter search, autotuning (which is just a shorter way of saying hyperparameter search), or grid search (a possible method for hyperparameter search). Where do those terms fit in? To understand hyperparameter search, we have to talk about the difference between a model parameter and a hyperparameter. In brief, model parameters are the knobs that the training algorithm knows how to tweak; they are learned from data. Hyperparameters, on the other hand, are not learned by the training method, but they also need to be tuned. 

To make this more concrete, say we are building a linear classifier to differentiate between spam and nonspam emails. This means that we are looking for a line in feature space that separates spam from nonspam. The training process determines where that line lies, but it won’t tell us how many features (or words) to use to represent the emails. The line is the model parameter, and the number of features is the hyperparameter. Hyperparameters can get complicated quickly. Much of the prototyping phase involves iterating between trying out different models, hyperparameters, and features. Searching for the optimal hyperparameter can be a laborious task. This is where search algorithms such as grid search, random search, or smart search come in. These are all search methods that look through hyperparameter space and find good configurations. Hyperparameter tuning is covered in detail in Chapter 4.










## >>>>>> The Pitfalls of A/B Testing:

A/B testing has emerged as the predominant method of online test‐ ing in the industry today. It is often used to answer questions like, “Is my new model better than the old one?” or “Which color is bet‐ ter for this button, yellow or blue?” In the A/B testing setup, there is a new model (or design) and an incumbent model (or design). There is some notion of live traffic, which is split into two groups: A and B, or control and experiment. Group A is routed to the old model, and group B is routed to the new model. Their performance is compared and a decision is made about whether the new model performs substantially better than the old model. That is the rough idea, and there is a whole statistical machinery that makes this state‐ ment much more precise.

This machinery is known as statistical hypothesis testing. It decides between a null hypothesis and an alternate hypothesis. Most of the time, A/B tests are formulated to answer the question, “Does this new model lead to a statistically significant change in the key met‐ ric?” The null hypothesis is often “the new model doesn’t change the average value of the key metric,” and the alternative hypothesis “the new model changes the average value of the key metric.” The test for the average value (the population mean, in statistical speak) is the most common, but there are tests for other population parameters as well.

##### Pitfalls of A/B Testing:

- Complete Separation of Experiences
- Which Metric?
- How Much Change Counts as Real Change?
- One-Sided or Two-Sided Test?
- How Many False Positives Are You Willing to Tolerate?
- How Many Observations Do You Need?
- Is the Distribution of the Metric Gaussian?
- Are the Variances Equal?
- What Does the p-Value Mean?
- Multiple Models, Multiple Hypotheses
- How Long to Run the Test?
- Catching Distribution Drift

##### Multi-Armed Bandits: An Alternative
With all of the potential pitfalls in A/B testing, one might ask whether there is a more robust alternative. The answer is yes, but not exactly for the same goals as A/B testing. If the ultimate goal is to decide which model or design is the best, then A/B testing is the right framework, along with its many gotchas to watch out for.

However, if the ultimate goal is to maximize total reward, then mul‐ tiarmed bandits and personalization is the way to go. The name “multiarmed bandits” (MAB) comes from gambling. A slot machine is a one-armed bandit; each time you pull the lever, it outputs a certain reward (most likely negative). Multiarmed bandits are like a room full of slot machines, each one with an unknown random payoff distribution. The task is to figure out which arm to pull and when, in order to maximize the reward. There are many MAB algorithms: linear UCB, Thompson sampling (or Bayesian bandits), and Exp3 are some of the most well known. John Myles White wrote a wonderful book that explains these algorithms. Ste‐ ven Scott wrote a great survey paper on Bayesian bandit algorithms. Sergey Feldman has a few blog posts on this topic as well.

If you have multiple competing models and you care about maxi‐ mizing overall user satisfaction, then you might try running an MAB algorithm on top of the models that decides when to serve results from which model. Each incoming request is an arm pull; the MAB algorithm selects the model, forwards the query to it, gives the answer to the user, observes the user’s behavior (the reward for the model), and adjusts the estimate for the payoff distribution. As folks from zulily and RichRelevance can attest, MABs can be very effec‐ tive at increasing overall reward.

On top of plain multiarmed bandits, personalizing the reward to individual users or user groups may provide additional gains. Dif‐ ferent users often have different rewards for each model. Shoppers in Atlanta, GA, may behave very differently from shoppers in Syd‐ ney, Australia. Men may buy different things than women. With enough data, it may be possible to train a separate MAB for each user group or even each user. It is also possible to use contextual bandits for personalization, where one can fold in information about the user’s context into the models for the reward distribution of each model.

## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

## <> Offline Evaluation Mechanisms:

#### [1] Hold-Out Validation

Hold-out validation is simple. Assuming that all data points are i.i.d. (independently and identically distributed), we simply randomly hold out part of the data for validation. We train the model on the larger portion of the data and evaluate validation metrics on the smaller hold-out set.
Computationally speaking, hold-out validation is simple to program and fast to run. The downside is that it is less powerful statistically. The validation results are derived from a small subset of the data, hence its estimate of the generalization error is less reliable. It is also difficult to compute any variance information or confidence inter‐ vals on a single dataset.

Use hold-out validation when there is enough data such that a sub‐ set can be held out, and this subset is big enough to ensure reliable statistical estimates.

#### [2] Cross-Validation

Cross-validation is another validation technique. It is not the only validation technique, and it is not the same as hyperparameter tun‐ ing. So be careful not to get the three (the concept of model valida‐ tion, cross-validation, and hyperparameter tuning) confused with each other. Cross-validation is simply a way of generating training and validation sets for the process of hyperparameter tuning. Hold- out validation, another validation technique, is also valid for hyper‐ parameter tuning, and is in fact computationally much cheaper.

There are many variants of cross-validation. The most commonly used is k-fold cross-validation. In this procedure, we first divide the training dataset into k folds (see Figure 3-2). For a given hyperpara‐ meter setting, each of the k folds takes turns being the hold-out vali‐ dation set; a model is trained on the rest of the k – 1 folds and meas‐ ured on the held-out fold. The overall performance is taken to be the average of the performance on all k folds. Repeat this procedure for all of the hyperparameter settings that need to be evaluated, then pick the hyperparameters that resulted in the highest k-fold average.

Another variant of cross-validation is leave-one-out cross- validation. This is essentially the same as k-fold cross-validation, where k is equal to the total number of data points in the dataset.

Cross-validation is useful when the training dataset is so small that one can’t afford to hold out part of the data just for validation pur‐ poses.

#### [3] Bootstrap and Jackknife

Bootstrap is a resampling technique. It generates multiple datasets by sampling from a single, original dataset. Each of the “new” data‐ sets can be used to estimate a quantity of interest. Since there are multiple datasets and therefore multiple estimates, one can also cal‐ culate things like the variance or a confidence interval for the esti‐ mate.

Bootstrap is closely related to cross-validation. It was inspired by another resampling technique called the jackknife, which is essen‐ tially leave-one-out cross-validation. One can think of the act of dividing the data into k folds as a (very rigid) way of resampling the data without replacement; i.e., once a data point is selected for one fold, it cannot be selected again for another fold.

Bootstrap, on the other hand, resamples the data with replacement. Given a dataset containing N data points, bootstrap picks a data point uniformly at random, adds it to the bootstrapped set, puts that data point back into the dataset, and repeats.

Why put the data point back? A real sample would be drawn from the real distribution of the data. But we don’t have the real distribu‐ tion of the data. All we have is one dataset that is supposed to repre‐ sent the underlying distribution. This gives us an empirical distribu‐ tion of data. Bootstrap simulates new samples by drawing from the empirical distribution. The data point must be put back, because otherwise the empirical distribution would change after each draw.

Obviously, the bootstrapped set may contain the same data point multiple times. (See Figure 3-2 for an illustration.) If the random draw is repeated N times, then the expected ratio of unique instan‐ ces in the bootstrapped set is approximately 1 – 1/e ≈ 63.2%. In other words, roughly two-thirds of the original dataset is expected to end up in the bootstrapped dataset, with some amount of replica‐ tion.

One way to use the bootstrapped dataset for validation is to train the model on the unique instances of the bootstrapped dataset and validate results on the rest of the unselected data. The effects are very similar to what one would get from cross-validation.

#### [4] Leave-one-out Cross Validation

xxxxxxxxxxxx

## <> Online Evaluation Mechanisms:

xxxxxxx

## <> Hyperparameter Tuning:

In the realm of machine learning, hyperparameter tuning is a “meta” learning task. It happens to be one of my favorite subjects because it can appear like black magic, yet its secrets are not impenetrable. In this chapter, we’ll talk about hyperparameter tuning in detail: why it’s hard, and what kind of smart tuning methods are being devel‐ oped to do something about it.

## <> The Pitfalls of A/B Testing:

## <> Model Evaluation Metrics:

#### >>>> Regression Problem:

#### [1] Root Mean Squared Error (RMSE)

RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution. Here are the key points to consider on RMSE:

- The power of ‘square root’  empowers this metric to show large number deviations.
- The ‘squared’ nature of this metric helps to deliver more robust results which prevents cancelling the positive and negative error values. In other words, this metric aptly displays the plausible magnitude of error term.
- It avoids the use of absolute error values which is highly undesirable in mathematical calculations.
- When we have more samples, reconstructing the error distribution using RMSE is considered to be more reliable.
- RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your data set prior to using this metric.
- As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.

RMSE metric is given by:

##### SQRT(  (prediction - original)^2 / N  )

In [None]:
# -------------- R

In [None]:
# -------------- Python

#### [2] Quantiles of Errors

RMSE may be the most common metric, but it has some problems. Most crucially, because it is an average, it is sensitive to large outli‐ ers. If the regressor performs really badly on a single data point, the average error could be very big. In statistical terms, we say that the mean is not robust (to large outliers).

Quantiles (or percentiles), on the other hand, are much more robust. To see why this is, let’s take a look at the median (the 50th percentile), which is the element of a set that is larger than half of the set, and smaller than the other half. If the largest element of a set changes from 1 to 100, the mean should shift, but the median would not be affected at all.

One thing that is certain with real data is that there will always be “outliers.” The model will probably not perform very well on them. So it’s important to look at robust estimators of performance that aren’t affected by large outliers. It is useful to look at the median absolute percentage:

- MAPE = median(|(yi − yi)/yi|)

It gives us a relative measure of the typical error. Alternatively, we could compute the 90th percentile of the absolute percent error, which would give an indication of an “almost worst case” behavior.

In [None]:
# ----------- R


In [None]:
# ----------- Python


#### [3] “Almost Correct” Predictions

Perhaps the easiest metric to interpret is the percent of estimates that differ from the true value by no more than X%. The choice of X depends on the nature of the problem. For example, the percent of estimates within 10% of the true values would be computed by percent of |(yi – ŷi)/yi| < 0.1. This gives us a notion of the precision of the regression estimate.

In [None]:
# ------------ R


In [None]:
# ------------- Python


#### >>>> Classification Problem:

#### [0] Accuracy

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions (the number of data points in the test set):

accuracy = # correct predictions / # total data points

In [None]:
# ------------ R


In [None]:
# ------------ Python


#### [0-1] Per-Class Accuracy

A variation of accuracy is the average per-class accuracy—the aver‐ age of the accuracy for each class. Accuracy is an example of what’s known as a micro-average, and average per-class accuracy is a macro-average. In the above example, the average per-class accuracy would be (80% + 97.5%)/2 = 88.75%. Note that in this case, the aver‐ age per-class accuracy is quite different from the accuracy.

In general, when there are different numbers of examples per class, the average per-class accuracy will be different from the accuracy. (Exercise for the curious reader: Try proving this mathematically!) Why is this important? When the classes are imbalanced, i.e., there are a lot more examples of one class than the other, then the accu‐ racy will give a very distorted picture, because the class with more examples will dominate the statistic. In that case, you should look at the per-class accuracy, both the average and the individual per-class accuracy numbers.

Per-class accuracy is not without its own caveats. For instance, if there are very few examples of one class, then test statistics for that class will have a large variance, which means that its accuracy esti‐ mate is not as reliable as other classes. Taking the average of all the classes obscures the confidence measurement of individual classes.

accuracy = # correct predictions(per class) / # total data points(per class)

In [None]:
# ------------- R


In [None]:
# ------------- Python


#### [1] Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of classes being predicted. For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. In general we are concerned with one of the above defined metric. For instance, in a pharmaceutical company, they will be more concerned with minimal wrong positive diagnosis. Hence, they will be more concerned about high Specificity. On the other hand an attrition model will be more concerned with Senstivity.Confusion matrix are generally used only with class output models.

- [Accuracy] : the proportion of the total number of predictions that were correct.

- [Positive Predictive Value or Precision] : the proportion of positive cases that were correctly identified.
- [Negative Predictive Value] : the proportion of negative cases that were correctly identified.
- [Sensitivity or Recall] : the proportion of actual positive cases which are correctly identified.
- [Specificity] : the proportion of actual negative cases which are correctly identified.


In [None]:
# --------------- R


In [None]:
# --------------- Python


#### [2] Gain / Lift Chart

http://www.listendata.com/2014/08/excel-template-gain-and-lift-charts.html

http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html

Gain or lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population. Expect most target captured in the first a few bins.

- Use model to calculate probability for each observation
- Sort the observations by probability desc
- Bin all observations into 10 bins - 10% each
- Base line (Random model) - 10% obeservations = 10% target
- Model line - 10% obeservations = (calculate ?% target) in that bin

Gain chart: Plot "Base line" and "Model line" with % bin
Lift chart: Plot "Base ratio = 1" and "Model ratio = model line / Base line"

In [None]:
# --------------- R

In [None]:
# --------------- Python

#### [3] Kolomogorov Smirnov Chart

http://www.saedsayad.com/model_evaluation_c.htm

K-S or Kolmogorov-Smirnov chart measures performance of classification models. More accurately, K-S is a measure of the degree of separation between the positive and negative distributions. The K-S is 100, if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives.

On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0. In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.

- Calculate probability for each observation
- Sort the observations by probability and bin them by probability - 100% - 90%, 89% - 80%, etc.
- Calculate Cumulative % proportion of target and non-target cases in each bin

K-S Chart: Plot the cumulative proportion of target / non-target lines with probability bins

K-S statistics: the highest difference between the proportions of target and non-target lines


In [None]:
# --------------- R

In [None]:
# --------------- Python

#### [4] Area Under the ROC curve (AUC – ROC)

https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/

*ROC (Receiver operating characteristic) curve

This is again one of the popular metrics used in the industry.  The biggest advantage of using ROC curve is that it is independent of the change in proportion of responders. This statement will get clearer in the following sections. The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Note that the area of entire square is 1*1 = 1. Hence AUC itself is the ratio under the curve and the total area. 

AUC Standard:
- .90-1 = excellent (A)
- .80-.90 = good (B)
- .70-.80 = fair (C)
- .60-.70 = poor (D)
- .50-.60 = fail (F)

Points to remember:

1. For a model which gives class as output, will be represented as a single point in ROC plot.

2. Such models cannot be compared with each other as the judgement needs to be taken on a single metric and not using multiple metrics. For instance, model with parameters (0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence these metrics should not be directly compared.

3. In case of probabilistic model, we were fortunate enough to get a single number which was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions. It is also possible that one model performs better in some region and other performs better in other.


Why should you use ROC and not metrics like lift curve?

- Lift is dependent on total response rate of the population. Hence, if the response rate of the population changes, the same model will give a different lift chart. A solution to this concern can be true lift chart (finding the ratio of lift and perfect model lift at each decile). But such ratio rarely makes sense for the business.

- ROC curve on the other hand is almost independent of the response rate. This is because it has the two axis coming out from columnar calculations of confusion matrix. The numerator and denominator of both x and y axis will change on similar scale in case of response rate shift.

In [None]:
# --------------- R

In [None]:
# --------------- Python

#### [5] Kappa Statistics

- Usually used for un-balance data

Calculation:

--T------F
  
T a(20)  b(5)

F c(10)  d(15)

- The probability of random agreement(T)

-----P(T) = (a+b) / (a+b+c+d) * (a+c) / (a+b+c+d) = 0.3

- The probability of random agreement(F)

-----P(F) = (c+d) / (a+b+c+d) * (b+d) / (a+b+c+d) = 0.2

- The Over-all probability of random agreement

-----P(ALL) = P(T) + P(F) = 0.5

- The observed proportinate agreement

-----P(o) = (a+d) / (a+b+c+d) = 0.7

- Kappa Statistics

-----K = ( P(o) - P(ALL) ) / ( 1 - P(ALL) )


##### Measurement

- Perfect agreement κ=1.
- κ=0, does not mean perfect disagreement; it only means agreement by chance as that would indicate that the diagonal cell probabilities are simply product of the corresponding marginals.
- If agreement is greater than agreement by chance, then κ≥0.
- If agreement is less than agreement obtained by chance, then κ≤0.
- The minimum possible value of κ=−1.
- A value of kappa higher than 0.75 will indicate excellent agreement while lower than 0.4 will indicate poor agreement.




##### Pros

- Kappa statistics are easily calculated and software is readily available (e.g., SAS PROC FREQ).
- Kappa statistics are appropriate for testing whether agreement exceeds chance levels for binary and nominal ratings.

##### Cons

- Kappa is not really a chance-corrected measure of agreement (see above).
- Kappa is an omnibus index of agreement. It does not make distinctions among various types and sources of disagreement.
- Kappa is influenced by trait prevalence (distribution) and base-rates. As a result, kappas are seldom comparable across studies, procedures, or populations (Thompson & Walter, 1988; Feinstein & Cicchetti, 1990).
- Kappa may be low even though there are high levels of agreement and even though individual ratings are accurate. Whether a given kappa value implies a good or a bad rating system or diagnostic method depends on what model one assumes about the decisionmaking of raters (Uebersax, 1988).
- With ordered category data, one must select weights arbitrarily to calculate weighted kappa (Maclure & Willet, 1987).
- Kappa requires that two rater/procedures use the same rating categories. There are situations where one is interested in measuring the consistency of ratings for raters that use different categories (e.g., one uses a scale of 1 to 3, another uses a scale of 1 to 5).
- Tables that purport to categorize ranges of kappa as "good," "fair," "poor" etc. are inappropriate; do not use them.

In [None]:
# --------------- R

In [None]:
# --------------- Python

#### [6] F1 Score (also F-score or F-measure)

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.

Calculation:

--T------F
  
T a(20)  b(5)

F c(10)  d(15)

precision = a / (a+b)

Recall = a / (a+c)

F-Score = 2 * {precision * Recall} * {precision + Recall} --- 0<->1

Measurement:

- More than 0.7 (Good)
- Less than 0.3 (Bad)

Keep in mind:

- The F-score is often used in the field of information retrieval for measuring search, document classification, and query classification performance.[3] Earlier works focused primarily on the F1 score, but with the proliferation of large scale search engines, performance goals changed to place more emphasis on either precision or recall[4] and so {\displaystyle F_{\beta }} F_{\beta } is seen in wide application.

- The F-score is also used in machine learning.[5] Note, however, that the F-measures do not take the true negatives into account, and that measures such as the Matthews correlation coefficient, Informedness or Cohen's kappa may be preferable to assess the performance of a binary classifier.[2]

- The F-score has been widely used in the natural language processing literature, such as the evaluation of named entity recognition and word segmentation.



In [None]:
# --------------- R

In [None]:
# --------------- Python

#### [7] Log-Loss

Log-loss, or logarithmic loss, gets into the finer details of a classifier. In particular, if the raw output of the classifier is a numeric proba‐ bility instead of a class label of 0 or 1, then log-loss can be used. The probability can be understood as a gauge of confidence. If the true label is 0 but the classifier thinks it belongs to class 1 with probabil‐ ity 0.51, then even though the classifier would be making a mistake, it’s a near miss because the probability is very close to the decision boundary of 0.5. Log-loss is a “soft” measurement of accuracy that incorporates this idea of probabilistic confidence.

- log-loss = −1/n ∑N,i=1 yi log pi+(1−yi)*log(1−pi)

Formulas like this are incomprehensible without years of grueling, inhuman training. Let’s unpack it. pi is the probability that the ith data point belongs to class 1, as judged by the classifier. yi is the true label and is either 0 or 1. Since yi is either 0 or 1, the formula essen‐ tially “selects” either the left or the right summand. The minimum is 0, which happens when the prediction and the true label match up. (We follow the convention that defines 0 log 0 = 0.)

The beautiful thing about this definition is that it is intimately tied to information theory: log-loss is the cross entropy between the dis‐ tribution of the true labels and the predictions, and it is very closely related to what’s known as the relative entropy, or Kullback–Leibler divergence. Entropy measures the unpredictability of something. Cross entropy incorporates the entropy of the true distribution, plus the extra unpredictability when one assumes a different distribution than the true distribution. So log-loss is an information-theoretic measure to gauge the “extra noise” that comes from using a predic‐ tor as opposed to the true labels. By minimizing the cross entropy, we maximize the accuracy of the classifier.

In [None]:
# ------------- R


In [None]:
# ------------- Python


#### >>>> Ranking Problem:

#### [1] Precision-Recall

Precision and recall are actually two metrics. But they are often used together. Precision answers the question, “Out of the items that the ranker/classifier predicted to be relevant, how many are truly rele‐ vant?” Whereas, recall answers the question, “Out of all the items that are truly relevant, how many are found by the ranker/classi‐ fier?” Figure 2-3 contains a simple Venn diagram that illustrates pre‐ cision versus recall.

Mathematically, precision and recall can be defined as the following:
precision = # happy correct answers

#total items returned by ranker

recall = # happy correct answers # total relevant items

Frequently, one might look at only the top k items from the ranker, k = 5, 10, 20, 100, etc. Then the metrics would be called “preci‐ sion@k” and “recall@k.”

When dealing with a recommender, there are multiple “queries” of interest; each user is a query into the pool of items. In this case, we can average the precision and recall scores for each query and look at “average precision@k” and “average recall@k.” (This is analogous to the relationship between accuracy and average per-class accuracy for classification.)

In [None]:
# --------------- R


In [None]:
# --------------- Python


#### [2] Precision-Recall Curve and the F1 Score

When we change k, the number of answers returned by the ranker, the precision and recall scores also change. By plotting precision versus recall over a range of k values, we get the precision-recall curve. This is closely related to the ROC curve. (Exercise for the curious reader: What’s the relationship between precision and the false-positive rate? What about recall?)

Just like it’s difficult to compare ROC curves to each other, the same goes for the precision-recall curve. One way of summarizing the precision-recall curve is to fix k and combine precision and recall. One way of combining these two numbers is via their harmonic mean:

F =2 precision*recall 1 precision + recall

Unlike the arithmetic mean, the harmonic mean tends toward the smaller of the two elements. Hence the F1 score will be small if either precision or recall is small.

In [None]:
# ------------- R


In [None]:
# -------------- Python


#### [3] NDCG

Precision and recall treat all retrieved items equally; a relevant item in position k counts just as much as a relevant item in position 1. But this is not usually how people think. When we look at the results from a search engine, the top few answers matter much more than answers that are lower down on the list.

NDCG tries to take this behavior into account. NDCG stands for normalized discounted cumulative gain. There are three closely related metrics here: cumulative gain (CG), discounted cumulative gain (DCG), and finally, normalized discounted cumulative gain. Cumulative gain sums up the relevance of the top k items. Discoun‐ ted cumulative gain discounts items that are further down the list. Normalized discounted cumulative gain, true to its name, is a nor‐ malized version of discounted cumulative gain. It divides the DCG by the perfect DCG score, so that the normalized score always lies between 0.0 and 1.0. See the Wikipedia article for detailed mathe‐ matical formulas.

DCG and NDCG are important metrics in information retrieval and in any application where the positioning of the returned items is important.

In [None]:
# ---------- R


In [1]:
# ---------- Python
