In [1]:
import pandas
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
pandas.set_option('display.max_columns', 20)
pandas.set_option('display.width', 350)
  
#set seed to be able to reproduce the results
np.random.seed(4684)

The high high majority of real datasets are highly unbalanced. Typically, the minority class is between 1 and 3% for the most important metrics, i.e. conversion rate, ads click-through-rate, fraud, email click-rate.


Building a model with unbalanced data presents a few challenges. The biggest problem is that if you predict everything as majority class, you will get a huge accuracy. And many models internally are built to optimize accuracy. So feed into the model a dataset with 2% class 1 and 98% class 0, and you will get as an output a model that predicts (almost) everything as class 0 with ~98% accuracy. 98% accuracy might look good at first, but a model like that is hardly useful.

In [2]:
data = pandas.read_csv('./dataset/emails.csv')

In [3]:
#get dummy variables from categorical ones
data_dummy = pandas.get_dummies(data, drop_first=True).drop('email_id', axis=1)
  
#split into train and test to avoid overfitting
train, test = train_test_split(data_dummy, test_size = 0.34)
  
#build the model. We choose a RF, but this issues applies to pretty much all models
rf = RandomForestClassifier(n_estimators=50, oob_score=True)
rf.fit(train.drop('clicked', axis=1), train['clicked'])
#let's print OOB confusion matrix
print(pandas.DataFrame(confusion_matrix(train['clicked'], rf.oob_decision_function_[:,1].round(), labels=[0, 1])))

       0    1
0  64342  269
1   1342   14


In [4]:
#and let's print test set confusion matrix
print(pandas.DataFrame(confusion_matrix(test['clicked'], rf.predict(test.drop('clicked', axis=1)), labels=[0, 1])))

       0    1
0  33126  144
1    706    7


Firstly, let’s make sure we understand the confusion matrix output. The first one is for RF OOB errors and the second one is for the test set.


Top left (where column == 0 and row == 0) are true negatives. These are the events that the model classifies as 0s and are indeed 0s. Expect this number to always be very large, after all it is the easy part of the problem, given how many 0s we have in the dataset


Bottom left (where column == 0 and row == 1) are the events that the model classifies as 0, but are in fact 1s. It is crucial to minimize this as much as possible. If the model classifies everything as 0, this number will be large


Column == 1 and row == 0, these are the events that are 0s, but the model classifies them as 1. At first, this number will be very low. Again, if it is classifying everything as 0, there will be very few events here. As you improve the model, this number will go up. While you should avoid making this number go up too much (after all they are still misclassifications), that’s part of the trade-off. It is fine to see this number increase in order to increase the number of events correctly classified as 1


Column == 1 and row == 1, these are the true positives. Your goal should be increasing this number. And the only way to do it is by decreasing the number of events that the model classifies as 0, but are in fact 1s (column == 0 and row == 1)


To summarizes all this, we can create class 0 and 1 errors. The first one is class error for class 0 and the other one for class 1. If the model classifies everything as majority class, class error for class 0 will be close to 0 (meaning it is perfect), but the other one will be close to 1 (meaning useless). Your goal is to decrease class 1 error, without increasing too much class 0 error

In [5]:
#confusion matrix test set
conf_matrix = pandas.DataFrame(confusion_matrix(test['clicked'], rf.predict(test.drop('clicked', axis=1)), labels=[0, 1]))
#class0/1 errors are 1 -  (correctly classified events/total events belonging to that class)
class0_error = 1 - conf_matrix.loc[0,0]/(conf_matrix.loc[0,0]+conf_matrix.loc[0,1])
class1_error = 1 - conf_matrix.loc[1,1]/(conf_matrix.loc[1,0]+conf_matrix.loc[1,1])
  
print(pandas.DataFrame( {'test_class0_error':[class0_error],
                        'test_class1_error':[class1_error]
}))

   test_class0_error  test_class1_error
0           0.004328           0.990182


1Play with model cut-off point

So obviously the models we built above are useless. Our goal is to force the model to classify more events as class 1, even if this means losing overall accuracy.

The simplest way we can try to achieve that is to simply change the model cut-off point. A couple of words about the cut-off point: every model, for each event, returns a probability between 0 and 1. Typically, model default cut-off value is 0.5, meaning that if that probability is lower than 0.5, the model predicts 0, else predicts 1. But you can change that cut-off value. In a RF, for R that probability is simply the proportion of trees that predict 1. So, if my RF has 50 trees and 10 of those predict 1, the final model score will be 0.2. For Python, it is the mean of all the probabilities returned by each tree.

Right now, our model is using the default 0.5 cut-off value. So it feels pretty straightforward that, if we want to increase the number of events classified as class 1, we can just change the cut-off value accordingly. Maybe if we create a rule like: everything >= 0.1 will be class 1, we will improve our class 1 error.
So, let’s see how modifying the cut-off point will change my model output.

In [6]:
from sklearn.metrics import roc_curve
  
#get test set predictions as a probability
pred_prob=rf.predict_proba(test.drop('clicked', axis=1))[:,1]
#get false positive rate and true positive rate, for different cut-off points
#and let's save them in a dataset. 
fpr, tpr, thresholds = roc_curve(test['clicked'],pred_prob)
# we will focus on class errors, defined as
# class0_error = fpr and class1_error = 1 - tpr
error_cutoff=pandas.DataFrame({'cutoff':pandas.Series(thresholds),
                               'class0_error':pandas.Series(fpr),
                               'class1_error': 1 - pandas.Series(tpr)
                                })

In [7]:
#let's also add accuracy to the dataset, i.e. overall correctly classified events.
#This is: (tpr * positives samples in the test set + tnr * positive samples in the dataset)/total_events_in_the_data_set
error_cutoff['accuracy']=((1-error_cutoff['class0_error'])*sum(test['clicked']==0)+(1-error_cutoff['class1_error'])*sum(test['clicked']==1))/len(test['clicked'])
  
print(error_cutoff)

        cutoff  class0_error  class1_error  accuracy
0     1.960000      0.000000      1.000000  0.979019
1     0.960000      0.000030      1.000000  0.978989
2     0.944000      0.000090      1.000000  0.978931
3     0.900000      0.000120      1.000000  0.978901
4     0.860000      0.000180      1.000000  0.978842
...        ...           ...           ...       ...
1178  0.001429      0.188127      0.674614  0.801666
1179  0.001176      0.188157      0.674614  0.801636
1180  0.001053      0.188218      0.674614  0.801577
1181  0.000909      0.188248      0.674614  0.801548
1182  0.000000      1.000000      0.000000  0.020981

[1183 rows x 4 columns]


Let’s understand this output:


Cutoff are the different cut-off values we are considering. It usually goes from 0 to 1. In this case, it goes from 0 to 0.96, everything above 0.96 (i.e. 0.97,0.98, etc.) will give the same classification as 0.96. The first row (1.96) doesn’t really matter. It just shows what happens when all events are predicted as majority class and python arbitrarily sets it to 1+max(cutoff)


Class0_error is 1 - true negatives/all negative events. This is also called false negative rate. It simply means: of all class 0 events, how many can I correctly classify? Then take 1 minus that number


Class1_error 1 - true positives/all positive events. This is also called false positive rate. It simply means: of all class 1 events, how many can I correctly classify? Then take 1 minus that number


Accuracy. This is simply the model accuracy, so correctly classified events divided by all events.


The most important thing of this table is understanding the trade-offs as I decrease the cut-off value. Decreasing the cut-off value leads to an increase in class0_error (which is bad), but an improvement in class1_error (which is good). The two extremes (when either class0_error = 0 and class1_error = 1 or the other way round) are obviously useless. It means just classifying all events as a given class. The point is finding the best combination of the two class errors.


Also, as you can see, accuracy is largest when the model classifies everything as majority class. That’s obvious. If you have 98% majority class events, you get a great looking 98% accuracy by classifying everything as class 0. But that’s again useless. As you improve class1 error, you will see accuracy decrease. That’s ok, you are willing to accept that (not that you have any other options). But keep this trade-off in mind making sure accuracy doesn’t go down too much.


The most common way to optimize class0 and class1 errors taking into account the trade-off is by maximizing this formula:


(1-class1_error) - class0_error


The maximum value of this is when both class errors are 0, which will never happen in real life. However, at its core, this simply means that I am willing to increase class0_error by a certain number as long as class1_error goes down by a larger number.


Imagine my starting point is class0_error = 0 and class1_error = 1. The formula above returns 1-1-0 = 0, so it is very bad. If I decrease class1_error by 0.1, but class0_error has increased by only 0.05, that number becomes: 1-0.9-0.05 = 0.05, so it went up and am happy. Which makes sense given that the loss on one side was offset by a larger gain on the other side.


In [8]:
#let's check best combination of class0 and class1 errors
error_cutoff['optimal_value'] = 1 - error_cutoff['class1_error'] - error_cutoff['class0_error']
  
print(error_cutoff.sort_values('optimal_value', ascending=False).head(20))

        cutoff  class0_error  class1_error  accuracy  optimal_value
1172  0.002500      0.186444      0.676017  0.803284       0.137539
1178  0.001429      0.188127      0.674614  0.801666       0.137258
1179  0.001176      0.188157      0.674614  0.801636       0.137228
1180  0.001053      0.188218      0.674614  0.801577       0.137168
1181  0.000909      0.188248      0.674614  0.801548       0.137138
1173  0.002222      0.186865      0.676017  0.802872       0.137118
1174  0.002000      0.187136      0.676017  0.802607       0.136848
1171  0.002857      0.185993      0.677419  0.803696       0.136587
1175  0.001818      0.187647      0.676017  0.802107       0.136337
1176  0.001667      0.187767      0.676017  0.801989       0.136216
1177  0.001538      0.188037      0.676017  0.801724       0.135946
1169  0.003125      0.185813      0.678822  0.803843       0.135365
1170  0.003077      0.185873      0.678822  0.803784       0.135305
1165  0.003636      0.184491      0.680224  0.80

So, according to this, the best cut-off value is ~0.0025. Meaning that all events whose RF predicted probability is >= 0.0025, will be predicted as 1, else 0. That probability is the mean calculated across the probabilities returned by each tree.


Keep in mind that that formula can be useful, but at the end of the day domain knowledge and a strong business sense are much more important that optimizing some formula. For instance, there might be cases in which the cost of a false positive is much larger than the cost of a false negative. In any case, it would be pretty straightforward to modify that formula accordingly.


All that formula is saying is: I am willing to accept a decrease of X in class0_error as long as class1_error improves by more than X. You can quickly modify it. Imagine the cost of class1_error is much larger than class0_error. You could say something like: I am willing to accept a decrease of X in class0_error as long as class1_error improves by at least X/2. Mathematically, this would be like maximizing:


(1-class1_error)*2 - class0_error


So now a decrease in class1_error weighs twice as much as an increase in class0_error.


Finally, let’s check that everything makes sense by going back to our original model and rebuild the confusion matrix with the new cut-off point.

In [9]:
#we already have predicted probabilities from the previous step, i.e.
#pred_prob=rf.predict_proba(test.drop('clicked', axis=1))[:,1]
  
#let's create a 0/1 vector according to the 0.002 cutoff
best_cutoff = error_cutoff.sort_values('optimal_value', ascending=False)['cutoff'].values[0]
predictions=np.where(pred_prob>=best_cutoff,1,0)
#get confusion matrix for those predictions
#confusion matrix test set
conf_matrix_new = pandas.DataFrame(confusion_matrix(test['clicked'], predictions, labels=[0, 1]))
print(conf_matrix_new)

       0     1
0  27067  6203
1    482   231


In [10]:
class0_error = 1 - conf_matrix_new.loc[0,0]/(conf_matrix_new.loc[0,0]+conf_matrix_new.loc[0,1])
class1_error = 1 - conf_matrix_new.loc[1,1]/(conf_matrix_new.loc[1,0]+conf_matrix_new.loc[1,1])
print(pandas.DataFrame( {'cutoff':[best_cutoff],
                         'test_class0_error_new':[class0_error],
                         'test_class1_error_new':[class1_error]
}))

   cutoff  test_class0_error_new  test_class1_error_new
0  0.0025               0.186444               0.676017


It worked! As you can see, true/false positive/negative rates are exactly the same as in the error_cutoff dataset when the cut-off is 0.0025


Pros and Cons
Pros of using the cut-off value strategy with unbalanced data

✓ The interpretation is very straightforward. Since one class is way less likely to happen than the other one, I just force the model to classify more events as minority class by lowering the classification threshold. You are not modifying the data distribution or, in general, doing any tranformation that leads to lose interpretability


Cons of using the cut-off value strategy with unbalanced data

✓ It is not always possible to use this approach. If your classes are highly unbalanced, your model might not find any way to separate them. That would lead to basically no model and all events being classified as majority class with probability 1. In that case, since all events have probability 1, there is no point in doing a cut-off analysis. More on this in the next section