## Misclassification cost as part of training

There are 2 ways in which we can introduce cost into the learning function of the algorithm with Scikit-learn:

- Defining the **class_weight** parameter for those estimators that allow it, when we set the estimator
- Passing a **sample_weight** vector with the weights for every single observation, when we fit the estimator.


With both the **class_weight** parameter or the **sample_weight** vector, we indicate that the loss function should be modified to accommodate the class imbalance and the cost attributed to each misclassification.

## parameters

**class_weight**: can take 'balanced' as argument, in which case it will use the balance ratio as weight. Alternatively, it can take a dictionary with {class: penalty}, pairs. In this case, it penalizes mistakes in samples of class[i] with class_weight[i].

So if class_weight = {0:1, and 1:10}, misclassification of observations of class 1 are penalized 10 times more than misclassification of observations of class 0.

**sample_weight** is a vector of the same length as y, containing the weight or penalty for each individual observation. In principle, it is more flexible, because it allows us to set weights to the observations and not to the class as a whole. So in this case, for example we could set up higher penalties for fraudulent applications that are more costly (money-wise)than to those fraudulent applications that are of little money.

## Important

If you use both class_weight and sample_weight, the final penalty will be **the combination of the 2**, so be very careful

## Demo

In this notebook I will introduce cost-sensitive learning to Logistic Regression. But we can do the same with almost every other classifier in Scikit-learn using **sample_weight** or, using **Class_weight** in those estimators that have that attribute.

## Classifiers that support class_weight

In [4]:
import sklearn
sklearn.__version__

'1.0.2'

In [5]:
# lets try to find all classifiers with the attribute 'class_weight'

# Let's find out which classifiers from sklearn support class_weight
# as part of the __init__ method, that is, when we set the m up

from sklearn.utils import all_estimators

estimators = all_estimators(type_filter='classifier')

for name,class_ in estimators:
    try:
        if hasattr(class_(), 'class_weight'):
            print(name)
    except:
        pass

DecisionTreeClassifier
ExtraTreeClassifier
ExtraTreesClassifier
LinearSVC
LogisticRegression
LogisticRegressionCV
NuSVC
PassiveAggressiveClassifier
Perceptron
RandomForestClassifier
RidgeClassifier
RidgeClassifierCV
SGDClassifier
SVC


Not all classifiers supports class_weight. For those which don't, like GradientBoostingClassifier, we can still use sample_weight when we fit the estimator.

## Logistic Regression with class_weight and sample weight

In this demo, we are going to introduce the misclassification cost in Logistic Regression, using class_weight and then sample_weight.

In [1]:
# import libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [9]:
# lets import dataset

df = pd.read_csv('..\kdd2004.csv').sample(10000)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
3566,69.43,21.05,0.97,26.0,15.5,1539.3,1.58,0.38,-3.0,-73.0,...,1813.6,1.75,0.47,0.0,-84.0,1015.2,-0.64,0.05,-0.69,-1
90644,89.27,24.73,-0.79,-25.5,32.0,2791.3,1.49,-0.98,-20.0,-91.0,...,3785.4,-1.14,0.41,0.0,-103.0,1286.1,1.28,0.33,-0.02,-1
21409,78.88,24.75,-1.41,-14.0,-11.0,1468.9,0.89,-0.29,4.0,-93.0,...,965.7,0.95,-0.54,7.0,-84.0,833.4,0.53,0.28,0.93,-1
28471,92.74,21.74,0.38,20.0,-12.0,1080.8,0.6,0.44,19.5,-85.0,...,806.1,-0.49,2.27,10.0,-38.0,85.0,0.61,0.43,0.58,-1
40266,75.58,24.72,-1.0,-32.0,45.0,1539.8,0.41,0.9,9.0,-74.5,...,2081.4,-1.94,5.11,23.0,-111.0,903.4,0.53,0.18,-0.25,-1


In [10]:
df.shape

(10000, 75)

In [13]:
# check balance ratio
df['target'].value_counts()/len(df['target'])

-1    0.9916
 1    0.0084
Name: target, dtype: float64

In [12]:
# split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis =1),
                                                   df['target'],
                                                   test_size=0.3,
                                                   random_state=0)

X_train.shape, X_test.shape

((7000, 74), (3000, 74))

In [2]:
# define a function to call logistic regression model

def run_log(X_train, X_test, y_train, y_test,class_weight):
    
    log = LogisticRegression(penalty='l2',
                             solver='newton-cg',
                             max_iter=10,
                             class_weight=class_weight,  # weights/cost
                             random_state=0,
                            n_jobs=2)
    
    log.fit(X_train, y_train)
    
    train_preds = log.predict_proba(X_train)
    test_preds = log.predict_proba(X_test)
    
    print('Train ROC: {}'.format(roc_auc_score(y_train, train_preds[:,1])))
    print('Test ROC: {}'.format(roc_auc_score(y_test, test_preds[:,1])))

In [15]:
# running the log model without the class weight values
# i.e. original data and no balancing

run_log(X_train, X_test, y_train, y_test,None)

Train ROC: 0.9321113528456024
Test ROC: 0.8492481718179666


In [16]:
# with class_weight balanced
# running the log model without the class weight values
# i.e. original data and no balancing
# this technique uses class imbalance ratio to determine the cost

run_log(X_train, X_test, y_train, y_test,'balanced')

Train ROC: 0.9912647093753396
Test ROC: 0.9470045221811114


In [18]:
# alternatively, we can pass a different cost
# in a dictionary, if we know it already

run_log(X_train,
          X_test,
          y_train,
          y_test,
          class_weight={-1:1, 1:10}) 
# the above  {-1:1, 1:10} means, the cost for a misclassification of an obs belonging to class 1 is 10, class 1 being minority
# and the cost for a misclassification of an obs belonging to class 1 is 1

Train ROC: 0.978605234099219
Test ROC: 0.9141408478778139


In [21]:
# changing class weights
run_log(X_train,
          X_test,
          y_train,
          y_test,
          class_weight={-1:1, 1:50}) 

Train ROC: 0.9813165348918891
Test ROC: 0.916021975557798


In [22]:
# changing class weights
run_log(X_train,
          X_test,
          y_train,
          y_test,
          class_weight={-1:1, 1:99}) 

Train ROC: 0.9911484746722055
Test ROC: 0.9464065478192623


In [23]:
# changing class weights
run_log(X_train,
          X_test,
          y_train,
          y_test,
          class_weight={-1:1, 1:120}) 

Train ROC: 0.9912495483271047
Test ROC: 0.948013603916732


- We should always check the class weight proportions as it might over fit to a specific class and ignores classification of other classes as well.
- IT also seems, the more we get close to the imbalance ratio, the greater performance the model produces
- And once the class imbalance ratio exceeds, the model performance seems to remain constant

## Using Sample Weight

In [3]:
# logistic regression + sample weight

def run_logit(X_train, X_test, y_train, y_test,sample_weight):
    
    log = LogisticRegression(penalty='l2',
                             solver='newton-cg',
                             max_iter=10,
                             random_state=0,
                            n_jobs=2)
    
    log.fit(X_train, y_train,sample_weight=sample_weight)
    
    train_preds = log.predict_proba(X_train)
    test_preds = log.predict_proba(X_test)
    
    print('Train ROC: {}'.format(roc_auc_score(y_train, train_preds[:,1])))
    print('Test ROC: {}'.format(roc_auc_score(y_test, test_preds[:,1])))

In [25]:
# this is the same when we ran earlier with no class weight
run_logit(X_train, X_test, y_train, y_test,None)

Train ROC: 0.9321113528456024
Test ROC: 0.8492481718179666


In [26]:
sample_weight = np.where(y_train==1,99,1)
# above means set the sample weight of that observation to be 99 if y_train of that obs == 1
# i.e. it belongs to minority, otherwise set the sample weight to 1
# because we are setting up sample_weight for every sample or observation

run_logit(X_train, X_test, y_train, y_test,sample_weight)

Train ROC: 0.9911484746722055
Test ROC: 0.9464065478192623


- This results is the same when we ran earlier passing class_weight = {-1:1,1:99}

**We can see that cost learning improves the performance of our model for this dataset**
- We can try running some other models in some other datasets

In [4]:
## from imblearn datasets
from imblearn.datasets import fetch_datasets

In [18]:
datasets_ls = [
    'car_eval_34',
    'ecoli',
    'thyroid_sick',
    'arrhythmia',
    'ozone_level'
]

In [5]:
data = fetch_datasets()['ecoli']
data

{'data': array([[0.49, 0.29, 0.48, ..., 0.56, 0.24, 0.35],
        [0.07, 0.4 , 0.48, ..., 0.54, 0.35, 0.44],
        [0.56, 0.4 , 0.48, ..., 0.49, 0.37, 0.46],
        ...,
        [0.61, 0.6 , 0.48, ..., 0.44, 0.39, 0.38],
        [0.59, 0.61, 0.48, ..., 0.42, 0.42, 0.37],
        [0.74, 0.74, 0.48, ..., 0.31, 0.53, 0.52]]),
 'target': array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -

In [6]:
df = pd.DataFrame(data.data, data.target)
df.head()

df = df.reset_index()
df.rename(columns = {'index':'target'}, inplace=True)
df.head()

Unnamed: 0,target,0,1,2,3,4,5,6
0,-1,0.49,0.29,0.48,0.5,0.56,0.24,0.35
1,-1,0.07,0.4,0.48,0.5,0.54,0.35,0.44
2,-1,0.56,0.4,0.48,0.5,0.49,0.37,0.46
3,-1,0.59,0.49,0.48,0.5,0.52,0.45,0.36
4,-1,0.23,0.32,0.48,0.5,0.55,0.25,0.35


In [7]:
df.shape

(336, 8)

In [8]:
df['target'].value_counts()/len(df['target'])

-1    0.895833
 1    0.104167
Name: target, dtype: float64

In [22]:
# define a function to call logistic regression model

def run_log1(X_train, X_test, y_train, y_test,class_weight):
    
    log = LogisticRegression(penalty='l2',
                             solver='newton-cg',
                             max_iter=10,
                             class_weight=class_weight,  # weights/cost
                             random_state=0,
                            n_jobs=2)
    
    log.fit(X_train, y_train)
    
    train_preds = log.predict_proba(X_train)
    test_preds = log.predict_proba(X_test)
    
    print('Train ROC: {}'.format(roc_auc_score(y_train, train_preds[:,1])))
    print('Test ROC: {}'.format(roc_auc_score(y_test, test_preds[:,1])))
    
    return roc_auc_score(y_test, test_preds[:,1])

In [27]:
results_dict = {}
for dataset in datasets_ls:
    print(dataset)
    
    data = fetch_datasets()[dataset]
    results_dict[dataset] = {}

    X_train, X_test,y_train, y_test = train_test_split(data.data,
                                                      data.target,
                                                       test_size=0.3,
                                                      random_state=0)
    print(X_train.shape,X_test.shape)
    
    results_0 = run_log1(X_train, X_test,y_train,y_test,None)
    results_1 = run_log1(X_train, X_test,y_train,y_test,'balanced')
    results_dict[dataset]['Not_balanced'] = results_0
    results_dict[dataset]['Balanced'] = results_1
    
    #sample_weight = np.where(y_train == 1,99,-1)
    #run_logit(X_train,X_test,y_train,y_test,sample_weight)

car_eval_34
(1209, 21) (519, 21)
Train ROC: 0.9987859868192854
Test ROC: 0.9970405143381977
Train ROC: 0.9988148918950167
Test ROC: 0.9965812838044699
ecoli
(235, 7) (101, 7)
Train ROC: 0.9302539565697461
Test ROC: 0.9492753623188406
Train ROC: 0.9370629370629371
Test ROC: 0.9456521739130435
thyroid_sick
(2640, 52) (1132, 52)
Train ROC: 0.8707358610817982
Test ROC: 0.8753492952545086
Train ROC: 0.9605818557950497
Test ROC: 0.9328737613097803
arrhythmia
(316, 278) (136, 278)
Train ROC: 0.9974424552429668
Test ROC: 0.96484375
Train ROC: 1.0
Test ROC: 0.923828125
ozone_level
(1775, 72) (761, 72)
Train ROC: 0.8140988436983794
Test ROC: 0.6474259974259974
Train ROC: 0.908009286128845
Test ROC: 0.77001287001287


In [28]:
results_dict

{'car_eval_34': {'Not_balanced': 0.9970405143381977,
  'Balanced': 0.9965812838044699},
 'ecoli': {'Not_balanced': 0.9492753623188406, 'Balanced': 0.9456521739130435},
 'thyroid_sick': {'Not_balanced': 0.8753492952545086,
  'Balanced': 0.9328737613097803},
 'arrhythmia': {'Not_balanced': 0.96484375, 'Balanced': 0.923828125},
 'ozone_level': {'Not_balanced': 0.6474259974259974,
  'Balanced': 0.77001287001287}}