Cut-off approach not working

In [1]:
import pandas
import numpy as np
import graphviz
from sklearn.ensemble import RandomForestClassifier
np.random.seed(4684)
from numpy.core.umath_tests import inner1d
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
from scipy import stats
  
pandas.set_option('display.max_columns', 20)
pandas.set_option('display.width', 350)

  


In [2]:
data = pandas.read_csv('./dataset/emails.csv')

In [3]:
#get dummy variables from categorical ones
data_dummy = pandas.get_dummies(data, drop_first=True).drop('email_id', axis=1)
  
#split into train and test to avoid overfitting
train, test = train_test_split(data_dummy, test_size = 0.34)
  
#build the model. We choose a single decision tree here, but this issue might apply to all models.
tree=DecisionTreeClassifier(min_impurity_decrease=0.001)
tree.fit(train.drop('clicked', axis=1),train['clicked'])
  
#visualize it
export_graphviz(tree, out_file="tree.dot", feature_names=train.drop('clicked', axis=1).columns, proportion=True, rotate=True)
with open("tree.dot") as f:
    dot_graph = f.read()
s = Source.from_file("tree.dot")
s.view()


'tree.dot.pdf'

No splits at all! Not even on just one variable. So if we check the probability assigned to each event on the test set, we get:

In [4]:
tree_probabilities = tree.predict_proba(test.drop('clicked', axis=1))[:,1]
print(stats.describe(tree_probabilities))

DescribeResult(nobs=33983, minmax=(0.020555732411660376, 0.020555732411660376), mean=0.020555732411660376, variance=0.0, skewness=0.0, kurtosis=-3.0)


That is: min and max are the same! Our model is returning 0.021 probability for every single event in the dataset. This was simply the starting proportion of minority class. In a situation like this, there is obviously no point in optimizing the cut-off point. We will have class0 error = 0 and class1 error = 1 for all cut-offs above 0.0207 and class0 error = 1 and class1 error = 0 for all cut-offs below or equal to 0.021. No other option.


Something like this happens very often when the data are highly unbalanced and/or hard to separate.

Changing weights

If we cannot use the model cut-off point, the next best strategy is to change the weights of the events in your training set. Specifically, you want to increase the weight of the minority class. This will force the model towards correctly predicting more class 1 events. But it comes at the expenses of class 0. So, similarly to when we changed the cut-off point, expect class 1 loss to go down and class 0 loss to go up.


A way to look at why changing weights works is the following:


The main problem we have when building the model is that the accuracy the model would achieve by classifying everything as majority class is very high. So the model has no incentive in trying to correctly predict class 1 events, especially because this will mean misclassifying class 0 events and, therefore, losing accuracy


By changing weights and increasing the number of minority class events, the accuracy coming from classifying everything as majority class drops. The extreme case is that you make the classes balanced, so class 1 and class 0 proportion becomes 50/50. So the model has to now try to separate the classes just like it happens in common ML problems when classes are not heavily unbalanced. After all, if starting class proportion is 50/50, any non-random split will improve that and move accuracy >50%.


Unfortunately, there is no hard rule on how to optimize class weights. In general, an approach similar to the one used for the cut-off point will work. That is, try a few different weights, look at class errors and pick the weight where the trade-off between the two class errors is the best.

In [5]:
# RF weights can be passed as class weights inside RandomForestClassifier. 
#Then, for each weight configuration, we save class errors and accuracy. 
#Finally, pick the best combination of both class errors 
  
#Build 20 RF models with different weights
class0_error = []
class1_error = []
accuracy = []
  
#apply weights from 10 to 200 with 10 as a step
for i in range(10,210,10) :
               rf = RandomForestClassifier(n_estimators=50, oob_score=True, class_weight={0:1,1:i})
               rf.fit(train.drop('clicked', axis=1), train['clicked'])
               #let's get confusion matrix
               conf_matrix = pandas.DataFrame(confusion_matrix(test['clicked'], rf.predict(test.drop('clicked', axis=1)), labels=[0, 1]))
               #class0/1 errors are 1 -  (correctly classified events/total events belonging to that class)
               class0_error.append( 1 - conf_matrix.loc[0,0]/(conf_matrix.loc[0,0]+conf_matrix.loc[0,1]))
               class1_error.append( 1 - conf_matrix.loc[1,1]/(conf_matrix.loc[1,0]+conf_matrix.loc[1,1]))
               accuracy.append((conf_matrix.loc[1,1]+conf_matrix.loc[0,0])/conf_matrix.values.sum())
  
dataset_weights = pandas.DataFrame ({'minority_class_weight': pandas.Series(range(10,210,10)),
                                     'class0_error': pandas.Series(class0_error),
                                     'class1_error': pandas.Series(class1_error),
                                     'accuracy':     pandas.Series(accuracy)
                                   })
  
print(dataset_weights)

    minority_class_weight  class0_error  class1_error  accuracy
0                      10      0.029967      0.957924  0.950564
1                      20      0.051398      0.928471  0.930200
2                      30      0.064743      0.906031  0.917606
3                      40      0.073880      0.890603  0.908984
4                      50      0.079892      0.894811  0.903010
5                      60      0.081244      0.896213  0.901657
6                      70      0.085903      0.886396  0.897302
7                      80      0.086384      0.883590  0.896890
8                      90      0.089690      0.882188  0.893682
9                     100      0.088188      0.879383  0.895212
10                    110      0.090592      0.882188  0.892799
11                    120      0.090382      0.886396  0.892917
12                    130      0.090532      0.879383  0.892917
13                    140      0.092005      0.879383  0.891475
14                    150      0.092065 

Let’s understand this output:


Minority_class_weight is the weight given to the minority class. So 10 means minority class is weighed 10 times more than the majority one, and so on. In our dataset, we had a 2:98 ratio between the classes. That means that when we give the minority class weight ~50, we achieve a balanced dataset, i.e. ~100/98 ratio.


Class0 and class1 error as well as accuracy are as described in the previous section. You can see that, as you increase the weight of the minority class, more and more points get classified as class 1, so class0 error will increase (bad), class1 error will decrease (good), and accuracy will go down (bad). And btw this is exactly the same trade-off we saw when lowering the cut-off. However, at some point those values kind of flatten on the test set.


The formula we previously used to optimize the two class errors, (1-class1_error) - class0_error, can also be used here:

In [6]:
#Calculate trade-ff between class errors
dataset_weights['optimal_value'] = 1 - dataset_weights['class1_error'] - dataset_weights['class0_error']
  
#Order by optimal_value and pick the first row
print(dataset_weights.sort_values('optimal_value', ascending=False).head(1))

   minority_class_weight  class0_error  class1_error  accuracy  optimal_value
3                     40       0.07388      0.890603  0.908984       0.035517


So, according to this, the best weight is 40, meaning we want to increase minority class weight by 40. If we do that, we will get to a ratio between the classes of 80:98 (the starting point was 2:98), pretty close to balanced.


And that’s it. The model is not predicting all events as majority class. And we managed to improve class 1 error, without hurting accuracy and class 0 error too much.

Pros and Cons
Pros of using the the weight strategy with unbalanced data

✓ It is very flexible. You can try as many different weights as you wish and find out the ones that work best for you. You have way less flexibility with the cut-off approach. And btw you could also use this on top of the cut-off approach, i.e. firstly change weights and then find the best cut-off.

✓ It is pretty easy to explain it. You can simply say: since we had less class 1 events, we increased their number to force the model to learn about them
Cons of using the the weight strategy with unbalanced data

✓ You risk overfitting if the starting number of minority class events is not large enough. Imagine you have just 10 minority class events. By increasing their weight, you are essentially creating many events just like them. So now you have tons of events which are all the same. The model will correctly classify those, but this will hardly generalize to new data. Changing weights and in general resampling do not work when your sample size is too small

✓ It can be computationally expensive if you have to try many different weights and build a model for each weight. Making data balanced without specifically looking for the optimal weights is a commonly used short-cut. As with all short-cuts use it with caution though