# Cost Sensitive Learning

Most classification algorithms assume that misclassification errors cost are equal when it is not the same. For instance, misclassifying a sick person as healthy is more costly than otherwise (because of the lack of treatment the person might get) or misclassifying a fraudulent claim costs more than misclassifying a legitimate claim.

Cost Sensitive Learning (CSL) is a type of learning that takes misclassification costs into account. The goal is not to reduce the error rate (i.e. the accuracy for example), but to minimize the total misclassification cost hence why CSL treats misclassification differently.

Cost-insensitive:
* Minimise error rate
* Same cost to all misclassification

Cost-sensitive:
* Minimise cost
* Different misclassification costs

Cost Matrix looks like a confusion matrix but the cost is added. Standard machine learning models use a 0-1 loss funcition, which assigns a cost of 0 to a correctly classified observation and cost 1 to an incorrectly classified one. Cost-sensitive learning applies different costs to different classification errors.

# Type of Costs

The classic machine learning algorithms uses Constant Error Cost.

Conditional Cost refers to the cost of a misclassification error regarding the circumstances. It depends on the nature of the case / observation. It won't be the same for a fraud (where it depends how much money are involved in the application) or a medical diagnosis (depends on the diseases and the patients). Costs can also depend on time, like for sensor defect detection, where time of detection is important.

A potential solution is to expand the classification target, by adding one or more targets (Healthy, sick and young, sick and elderly or defect now / defect in a week / defect in a month / no defect, etc.).

Cost of test or feature, like acquiring variables from 3rd parties in finance or cost of carrying out the tests in medicine.

Cost of teacher or intervention (or professional costs), like in finance cost of fraud investigators, in medicine cost of a professional etc.

Computational cost, like data storage, time to train models, etc.

Data Cost, to acquired the data, to label the data, etc.

Human - Computer Cost which are the costs associated acquiring the data and building the models, like data analyts, engineers, domain experts, etc.

# Obtaining the Costs 

The effectiveness of CSL relies on the supplied cost matrix.

Low cost will not find the proper classification boundary by being biased towards the majority class.

High cost may impair generalization by being unable to predict well the majority classes.

We can determine cost by having a cost matrix provided by expert (domain xpert to determine real costs) or we use heuristic approach, using the data, using the imbalance ratio or by optimisation.

But we must take into account factors that influence the abitility of a classifier to identify rare events (small sample size, class separability, within-class sub-clusters).

# Cost Sensitive Approaches

Cost Sensitive Learning (CSL) can be separated into two approaches:

* Direct approaches:
    * Misclassification cost into the training of a classifier.
* Meta-learning:
    * Pre-processing (Under-sampling and over-sampling)
    * Post-processing (modify the outputs of a classifier)

## Cost Sensitive Learning with Scikit-Learn

There are two ways to include cost into scikit-learn:
* A class_weight parameter for estimators that allow it
* With a sample_weight vector with the weights for every single observation when we fit the estimator.

We shouldn't apply both class_weight and sample_weight together as the final penalty will be a combinaison of the two.

In [1]:
from sklearn.utils.testing import all_estimators

In [2]:
estimators = all_estimators(type_filter="classifier")

for name, class_ in estimators:
    try:
        # check if the class has an attribute called class_weight
        if hasattr(class_(), "class_weight"):
            print(name)
    except:
        pass

DecisionTreeClassifier
ExtraTreeClassifier
ExtraTreesClassifier
LinearSVC
LogisticRegression
LogisticRegressionCV
NuSVC
PassiveAggressiveClassifier
Perceptron
RandomForestClassifier
RidgeClassifier
RidgeClassifierCV
SGDClassifier
SVC


In [5]:
# Import libraries

import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [6]:
def load_data():
    
    df = pd.read_csv("./kdd2004.csv")
    df["target"] = df["target"].map({-1:0,1:1})
    
    return df

In [7]:
df = load_data()

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(
    labels=["target"], axis=1), df["target"], test_size=0.33, random_state=24)

In [10]:
def run_lr(X_train, X_test, y_train, y_test, class_weight):


    lr = LogisticRegression(penalty='l2',
                            class_weight=class_weight,
                            random_state=24,
                            solver='newton-cg',
                            max_iter=10,
                            n_jobs=-1,
                            )
    lr.fit(X_train,y_train)
    
    y_pred = lr.predict_proba(X_test)[:,1]
    
    print(f"ROC-AUC for the train set: {roc_auc_score(y_train,lr.predict_proba(X_train)[:,1])}")
    print(f"ROC-AUC for the test set: {roc_auc_score(y_test,y_pred)}")

In [11]:
run_lr(X_train, X_test, y_train, y_test,class_weight="balanced")

ROC-AUC for the train set: 0.9751809866633914
ROC-AUC for the test set: 0.9667857356132781


In [12]:
run_lr(X_train, X_test, y_train, y_test,class_weight=None)

ROC-AUC for the train set: 0.9298414285928388
ROC-AUC for the test set: 0.9108892048109519


In [14]:
run_lr(X_train, X_test, y_train, y_test,class_weight={0:1,1:10})

ROC-AUC for the train set: 0.9473086540392941
ROC-AUC for the test set: 0.9301425638772336


In [15]:
run_lr(X_train, X_test, y_train, y_test,class_weight={0:1,1:100})

ROC-AUC for the train set: 0.9769361713077744
ROC-AUC for the test set: 0.9694883258922125


In [16]:
def run_lr(X_train, X_test, y_train, y_test, sample_weight):


    lr = LogisticRegression(penalty='l2',
                            random_state=24,
                            solver='newton-cg',
                            max_iter=10,
                            n_jobs=-1,
                            )
    lr.fit(X_train,y_train,sample_weight=sample_weight)
    
    y_pred = lr.predict_proba(X_test)[:,1]
    
    print(f"ROC-AUC for the train set: {roc_auc_score(y_train,lr.predict_proba(X_train)[:,1])}")
    print(f"ROC-AUC for the test set: {roc_auc_score(y_test,y_pred)}")

In [17]:
run_lr(X_train, X_test, y_train, y_test,sample_weight=None)

ROC-AUC for the train set: 0.9298414285928388
ROC-AUC for the test set: 0.9108892048109519


In [18]:
run_lr(X_train,
       X_test,
       y_train,
       y_test,
       sample_weight=np.where(y_train==1,99,1)) #equivalent to {0:1,1:99}

ROC-AUC for the train set: 0.97664019811729
ROC-AUC for the test set: 0.9699606071850528


In [19]:
run_lr(X_train,
       X_test,
       y_train,
       y_test,
       sample_weight=np.where(y_train==1,200,1)) #equivalent to {0:1,1:99}

ROC-AUC for the train set: 0.9796471066405542
ROC-AUC for the test set: 0.9727835078071077


## Estimating the Cost with Cross-Validation

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [21]:
rf = RandomForestClassifier(n_estimators=50,
                            random_state=24,
                            max_depth=2,
                            n_jobs=-1,
                            class_weight=None
                            )

In [22]:
param_grid = {
    "n_estimators": [10, 50, 100],
    "max_depth": [None, 2, 10],
    "class_weight": [None, {0: 1, 1: 10}, {0: 1, 1: 100}, {0: 1, 1: 1000}]
}

In [25]:
grid_search = GridSearchCV(rf,
                           param_grid=param_grid,
                           scoring="roc_auc",
                           cv=2)

In [26]:
grid_search.fit(X_train,y_train)

GridSearchCV(cv=2,
             estimator=RandomForestClassifier(max_depth=2, n_estimators=50,
                                              n_jobs=-1, random_state=24),
             param_grid={'class_weight': [None, {0: 1, 1: 10}, {0: 1, 1: 100},
                                          {0: 1, 1: 1000}],
                         'max_depth': [None, 2, 10],
                         'n_estimators': [10, 50, 100]},
             scoring='roc_auc')

In [27]:
grid_search.best_score_

0.9832796433551112

In [28]:
grid_search.best_params_

{'class_weight': {0: 1, 1: 100}, 'max_depth': 10, 'n_estimators': 100}

In [31]:
grid_search.best_estimator_

RandomForestClassifier(class_weight={0: 1, 1: 100}, max_depth=10, n_jobs=-1,
                       random_state=24)

In [32]:
grid_search.score(X_test,y_test)

0.9911605560675483

# Bayes Conditional Risk

As a recap, given a cost matrix, an observation should be classified into the class that has the minimum cost.

The cost of assigning an observation to a certain class is called expected cost or bayes risk, and we denote it R(i|x).

R(i|x) is the expected cost of classifying an observation into class i.
P(j|x) is the probability of an observation of being of class j.
C(i,j) is the cost of assigning an observation of class j to class i.

R(i|x) = sum (P(j|x) * C(i,j)) -> an observation should be classified into the class that has the minimum cost or minimum risk.

Example: For observation 1, assuming binary classification, and 1 being the minority class:

P(0|1) = 0.8 and P(1|1) = 0.2 --> probability of belonging to class 0 and 1

C(1,0) = 1 and C(0,1) = 10 --> Cost of misclassification

R(0,1) = P(0,1) * C(0,0) + P(1|1) * C(0,1) = 0.8 * 0 + 0.2 * 10 = 2
R(1,1) = P(0,1) * C(1,0) + P(1|1) *  C(1,1) = 0.2 * 1 + 0.8 * 0 = 0.2

Therefore the classifier will classify an instance x into positive class if and only if: R(1|x) < = R(0|x) also equal to P(0|x)C(1,0) <= P(1|x)C(0,1) equals to C(1,0) <= P(1|x)C(0,1) + C(1,0).

So the threshold of probability above which a classifier can confidently classify an observation as a member of class 1 is:

C(1,0) / ((C(0,1) + C(1,0)) <= P(1|x)

(1 / (10+1)) <+ P(1|x) in our example above.

# Meta Cost

Meta Cost is a procedure to make a cost insensitive algorithm, cost sensitive, and it can be applied to any algorithm and will return whether probabilities or classes. It can be added to a bagging classifier where the prediction is the average probability of majority vote and we add bayes optimal prediction, which means adding the misclassification cost.

In [38]:
from metacost import MetaCost

In [39]:
lr = LogisticRegression(penalty="l2",
                        solver="newton-cg",
                        random_state=24,
                        max_iter=10,
                        n_jobs=-1)

In [49]:
#Test with no cost
cost_matrix = np.array([[0,1],[1,0]])
cost_matrix

array([[0, 1],
       [1, 0]])

In [41]:
metacost_ = MetaCost(estimator=lr,
                    cost_matrix = cost_matrix,
                    n_estimators=50,
                    n_samples=None,
                    p=True,
                    q=True)

In [42]:
metacost_.fit(X_train,y_train)

resampling data and training ensemble
Finished training ensemble
evaluating optimal class per observation
Finished re-assigning labels
Training model on new data
Finished training model on data with new labels


In [43]:
y_pred = metacost_.predict_proba(X_test)[:,1]

print(f"ROC-AUC for the train set: {roc_auc_score(y_train,metacost_.predict_proba(X_train)[:,1])}")
print(f"ROC-AUC for the test set: {roc_auc_score(y_test,y_pred)}")

ROC-AUC for the train set: 0.8993202419552555
ROC-AUC for the test set: 0.8834548425250986


In [50]:
#Test with cost of 1 for the majority class and 100 for the minority class
cost_matrix = np.array([[0,100],[1,0]])
cost_matrix

array([[  0, 100],
       [  1,   0]])

In [51]:
metacost_ = MetaCost(estimator=lr,
                    cost_matrix = cost_matrix,
                    n_estimators=50,
                    n_samples=None,
                    p=True,
                    q=True)

In [52]:
metacost_.fit(X_train,y_train)

resampling data and training ensemble
Finished training ensemble
evaluating optimal class per observation
Finished re-assigning labels
Training model on new data
Finished training model on data with new labels


In [53]:
y_pred = metacost_.predict_proba(X_test)[:,1]

print(f"ROC-AUC for the train set: {roc_auc_score(y_train,metacost_.predict_proba(X_train)[:,1])}")
print(f"ROC-AUC for the test set: {roc_auc_score(y_test,y_pred)}")

ROC-AUC for the train set: 0.934969494024405
ROC-AUC for the test set: 0.915486776849617


We can notice an increase in the performance!