# Evaluating Predictive Performance for Classifiers

This notebook discusses different methods for evaluating predictive performance in the case of classification models. The discussion here accompanies pages 131 - 154 of our textbook. As in the case of regression, it is important to consider multiple metrics because each gives insight to different facets of our expected prediction errors. There are many more metrics to consider when evaluating a classifier. We start with some pretty simple metrics and then move towards more complex metrics applicable to scenarios with a rare class or with imbalanced costs associated with mis-classifications.

As usual, we'll start by loading the libraries we will use (it is good practice to load these all at once near the beginning of your analysis, so any reader knows what packages they'll need to run your notebook). The libraries we read in are the same as last time, except we load the `LogisticRegression` model from `sklearn.linear_model` instead of the `LinearRegression` model. It is worth noting that `LogisticRegression` is a classifier while `LinearRegression` is a regressor. Finally we read in the `bank_df` data frame from a public GitHub repository. We'll be intersted in predicting whether a customer will accept an offer or not on a given solicitation -- the response column here is `y`.

In [1]:
#@title
#Import libraries
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
%matplotlib inline
!pip install dmba
from dmba import liftChart, gainsChart

bank_df = pd.read_csv("https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/bank.csv", sep = ";")
print(bank_df.info())
bank_df.head()

  import pandas.util.testing as tm


Collecting dmba
  Downloading https://files.pythonhosted.org/packages/44/7e/22fc51d7f54ac4662c5edcf0133083499bbea91bd6a6beb0c5b13f565a20/dmba-0.0.13-py3-none-any.whl
Installing collected packages: dmba
Successfully installed dmba-0.0.13
no display found. Using non-interactive Agg backend
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      452

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


From the output associated with `bank_df.info()`, we can see that there are no missing values in the dataset, but we have several categorical variables. We'll start by getting dummy variables associated with our categorical columns.

In [2]:
print(bank_df.shape)
bank_df = pd.get_dummies(bank_df, columns = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y"], drop_first = True)
print(bank_df.shape)
print(bank_df.columns)

(4521, 17)
(4521, 43)
Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_married', 'marital_single', 'education_secondary',
       'education_tertiary', 'education_unknown', 'default_yes', 'housing_yes',
       'loan_yes', 'contact_telephone', 'contact_unknown', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_other', 'poutcome_success', 'poutcome_unknown', 'y_yes'],
      dtype='object')


Notice that the width of the dataset has again exploded, since we've utilized *one-hot-encoding* for lots of categorical columns. Now we'll use `train_test_split()` to randomly select our *training*, *test*, and *safe* sets.

In [3]:
X = bank_df.drop("y_yes", axis = 1)
y = bank_df["y_yes"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.75, random_state = 370)

X_safe, X_test, y_safe, y_test = train_test_split(X_temp, y_temp, test_size = 0.6, random_state = 570)

Now that we've got our data split into *training*, *test*, and *safe* sets, we can train our logistic regression classifier.

In [4]:
#We instantiate a logistic regression classifier, and choose the 
#liblinear solver rather than the default lbfgs solver to avoid
#a non-convergence warning. You do not need to worry about this 
#here -- we may discuss it later in our course when we discuss
#logistic regression in detail.
log_clf = LogisticRegression(solver = "liblinear")

log_clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

## Assessing Classifiers

In discussing classifiers it is first important to discuss how a classifier works. 

#### How do classifiers work?

There are two major types of classifier: (i) a classifier which just provides a prediction of class membership, or (ii) a classifier which provides a *propensity* (likelihood) for class membership. This second type of classifier is useful in case we would like to *rank-order* records according to their probability of belonging to a class of interest. 

The *logistic regression classifier* we built is one of these *rank-order* classifiers. Regarding rank-order classifiers, since we are not provided an actual class prediction, we must choose a cutoff regarding propensity for class membership. Is 0.5 the right cutoff, or is 0.25 a better cutoff? Should we just classify according to the class with the highest propensity? The answer to all of these questions depends on the setting and our objectives. We'll talk more about this challenge when we discuss *logistic regression* in detail. For now, let's assume that we've already converted these class propensities into a class prediction.

#### Assessing a classifier's performance

In classification a prediction is either correct or it is wrong -- there are not varying degrees of correctness. Since this is the case, our error metrics from the regression setting are irrelevant. In classification we are either interested in a *raw* or *weighted* misclassification error.

**Benchmarking for Classification:** Consider an application with $m$ classes. We may choose to ignore all of the characteristics for each record and just assign every observation to the most dominant class. We would expect to give an incorrect classification on every observation not belonging to this most prevalent class -- our benchmark classification error would then be near 
$$E = 1 - \mathcal{P}\left[Y = m\right]$$

Any model using feature characteristics to make classifications should do at least as well as this benchmark *naive* model. Classification, however, does come with an additional challenge -- we are often interested in most accurately flagging a rare case (customers leading to sales from cold-calls, a positive diagnosis for a rare disease, etc.) Since this is the case, a raw misclassification rate is not the ideal metric. Instead, we introduce several alternative metrics to compensate for different objectives and imbalanced costs associated with errors coming from each of the various classes. Explanations of the raw and weighted misclassification error rates are below, but first we introduce the confusion matrix.

**Confusion Matrix:** Consider a classification scenario in which we have $m$ classes. A confusion matrix is an $m\times m$ matrix which summarizes our model performance. In row $i$, column $j$ we list the observations which are truly from class $i$ and which we have labeled as class $j$. Notice that if $i=j$ we have correct classifications while if $i\neq j$ we have misclassification. The confusion matrix in the case of the `bankModel` appears below.

In order to construct the confusion matrix, we first use our logistic regression model to predict the class to which each observation in our *test* set belongs. We can do this with the model's `.predict()` method. This method automatically transforms a propensity into a class prediction using the 0.5 threshold as the cutoff for belonging to the class of interest (the class labeled by 1 in the response column). The logistic regression classifier also comes with a `.predict_proba()` method which predicts the probability of the observation belonging to the class of interest -- we can use this method in conjunction with a boolean (true/false) test associated with a desired threshold value -- see the commented out lines in the code block below.

In [5]:
#@title
predictions = log_clf.predict(X_test)

#To make propensity predictions and then utilize our own threshold of 0.4 for belonging to the y_yes class.
#predictions = log_clf.predict_proba(X_test)
#predictions = (predictions >= 0.4)

pd.crosstab(y_test, predictions)

col_0,0,1
y_yes,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1757,29
1,195,54


In [6]:
(150)*195

29250

**Classification Error Rate:** If we have a confusion matrix $A = \left(a_{i,j}\right)$, then the classification error rate is given by

$$E = 1 - \frac{\sum{a_{i,i}}}{n}$$

where $a_{i,i}$ denotes an entry along the diagonal of the confusion matrix -- that is, correct classifications.

From the confusion matrix we can see that for 1755 bank customers we correctly predicted that they would refuse the promotion and for 54 customers we correctly predicted that they would accept the promotion. Our error rate is:

$$E = 1 - \frac{1757 + 54}{2035} \approx 0.1101$$

Is our error rate of about 11% signifying good classifier performance? Let's think more about the bank's objectives...

The bank really wants customers to accept its promotion. Say that the bank expects to earn an additional \\$150 per year for each customer who accepts a promotion. Furthermore, say that mailers cost about \\$3 each to send out, including postage. If we sent a mailer to every person we predicted would have accepted the offer, we would spend $\$3\cdot (29 + 54) = \$249$, but we would earn back $\$150\cdot 54 = \$8,100$ -- a net gain of $\$7,851$ -- not bad!! While this is a nice return on investment, we also missed out on 195 individuals who would have accepted but we never extended the offer. This results in an additional hidden cost of $\$150\cdot 195=\$29,250$. We've gone from feeling pretty proud of ourselves to pretty terrible now. We are in a scenario where the cost of misclassifying an observation as a yes (false positive) is really low ($\$3$) but the cost of misclassifying an observation as a no (false negative) is much larger ($\$150$). We would much rather increase our classification error rate if we can catch some more of those individuals who will accept the offer. 

We've decided that the raw classification error rate is not appropriate for this scenario. In the case where misclassifications are not all equally undesireable, we pursue different metrics of model performance.

**Sensitivity (or recall):** Measures the ability of the model to detect the important class members accurately. If Class $k$ is the most important class, then the sensitivity is measured as

$$\text{sensitivity} = \frac{a_{k,k}}{\sum{a_{k,i}}} = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{Total Positives}}$$

**Specificity:** Measures the ability of the model to correctly rule out members of the unimportant classes. Again, if Class $k$ is the most important class, the specificity is measured as

$$\text{specificity} = \frac{\sum_{i\neq k}{a_{i,i}}}{\sum_{i\neq k}{a_{i,j}}} = \frac{TN}{TN + FP} = \frac{\text{True Negatives}}{\text{Total Negatives}}$$

Note that the specificity takes the proportion of correctly classified observations out of all of those cases whose true class is not Class 1.

**Precision:** Measures how accurate the model's positive predictions are -- that is, if the model predicts the class of interest, precision measures how often that prediction is correct. Considering again that Class $k$ is the most important cass, the precision is measured as

$$\text{precision} = \frac{a_{k,k}}{\sum_{i = 1}^{k}{a_{k, i}}} = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{Total Predicted Positives}}$$

While the sensitivity, specificity, and precision provide three additional error metrics, they are not *tuned* to the discrepancies in misclassification costs. We introduce weighted classification error below.

To identify the average misclassification cost, we replace the entries in the confusion matrix (raw counts) with values corresponding to the total misclassification cost -- that is, if we have misclassified 11 observations as Class 1 and cost associated with misclassifying a non-Class 1 observation as belonging to Class 1 is $\$3$, then the entry in this new version of the confusion matrix would be $\$3\cdot 11 = \$33$ rather than the 11 which appeared in the original version of the confusion matrix. Notice that using this strategy we ignore the benefit of correct classifications -- their *cost* is 0.

**Average Misclassification Cost:** Consider the two class case, where the cost of misclassifying a Class 1 observation as belonging to Class 2 is $q_1$ and the cost of misclassifying a Class 2 observation as belonging to Class 1 is $q_2$. The average misclassification cost then is

$$\text{Average Misclassification Cost}  = \frac{\left(q_1\cdot a_{1,2}\right) + \left(q_2\cdot a_{2,1}\right)}{n}$$

where $a_{1,2}$ and $a_{2,1}$ are the numbers of misclassified observervations from the original confusion matrix.

In our example of the bank's promotional offer, the average misclassification cost becomes:

$$\text{Average Misclassification Cost}  = \frac{\left(\$3\cdot 29\right) + \left(\$150\cdot 195\right)}{2035} \approx \$14.42 \text{per person}$$

The goal now becomes to minimize this average cost by adjusting the propensity threshold for indicating the accepting class. Is 50% the right cutoff? Could we lower the average cost by reducing or increasing this required threshold?

**Generalizing Average Misclassification Cost** to more than two classes is easily done. The cost of misclassifying a Class *i* observation as belonging to Class *j* can be denoted by $q_{ij}$. We compute the average misclassification cost then as:

$$\text{Average Misclassification Cost}  = \frac{\sum_{i\neq j}{\left(q_{ij}\cdot a_{i,j}\right)}}{n}$$

This can become quite complicated quickly -- for example, in the 5 class case, we have $5\cdot 4 = 20$ different types of misclassifications. This requires the estimation of $20$ potentially different misclassification costs. The more we *estimate*, the more *uncertainty* we bring to our model. It is important to consider costs and benefits when determining to proceed with a multiclass scenario or to try to reduce the problem down to just two classes.

### Ranking Performance

Recall that some classifiers output a propensity for class membership rather than just a straight class prediction. This can be really useful in the rare class case in measuring how accurately your classifier flags its *most likely members*. We can sort the observations by propensity (predicted likelihood of class membership), and then see how well our classifier outperforms random guessing. We use a cummulative gains chart for this, similarly to how it was used in our discussion of assessing regression models.

In [None]:
propensities = log_clf.predict_proba(X_test)[:, 1]
forGains_df = pd.DataFrame()
forGains_df["y"] = y_test
forGains_df["propensity"] = propensities
forGains_df = forGains_df.sort_values("propensity", ascending=False)
forGains_df["y_num"] = (forGains_df["y"] == 1)
forGains_df = forGains_df.drop("y", axis = 1)
forGains_df.head()
gainsChart(forGains_df["y_num"], figsize=(4, 4))

plt.tight_layout()
plt.show()

As in our regression notebook, remember that the further our cumulative gains curve pulls away from the diagonal "average gains" line, the better our classifier is at flagging observations belonging to the positive class. It looks like we do pretty well here, even with a quite simple model!

### Summary

Okay, now you've had some experience in evaluating the performance of both regressors and classifiers. You know that there are multiple metrics to consider in each case and that each metric provides insight towards a different aspect of model performance. In order to properly evaluate model performance, we need to be cognisant of the type of response we want to predict as well as our objectives. This distinction is particularly important in the classification setting -- do we want a model that is as accurate as possible when all class outcomes are considered, do we care most about being able to identify or "rule-out" a class of interest, are we trying to optimize our classifier with respect to imbalanced mis-classification costs? There's so much to consider, which is part of what makes analytics so challenging and interesting.

I wouldn't be surprised if you told me that these past three weeks have been pretty difficult. The course content has been tough -- we've covered a lot of foundational material which we will leverage throughout the remainder of our course. While the material doesn't get easy from here on out, it does get a bit more manageable. From week to week, we will start to focus on particular predictive modeling techniques. Keeping the *End-to-[almost]-End* notebook handy will certainly be helpful as you consider the structure of our notebooks from here on out and as you consider applying these analytics techniques to novel projects. In addition, keeping the *Model Performance Evaluation* notebooks nearby will help you as you consider how best to evaluate modes in our coursework scenarios as well as when you utilize predictive modeling techniques in new and less structured environments. You've done some significant leg-work to get to this point and that is no small feat -- good work! We'll continue next time with a two-week foray into *Linear Regression Models* -- see you then.