## Cost sensitive learning

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model.  It is a field of study that is closely related to the field of imbalanced learning that is concerned with classification on datasets with a skewed class distribution.

This subfield of machine learning that is focused on learning and using models on data that have uneven penalties or costs when making predictions and more.


####  Not All Classification Errors Are Equal

 Most  machine  learning  algorithms  designed  forclassification assume that there is an equal number of examples for each observed class.  This isnot always the case in practice, and datasets that have a skewed class distribution are referredto as imbalanced classification problems.
 
In  addition  to  assuming  that  the  class  distribution  is  balanced,  most  machine  learning algorithms also assume that the prediction errors made by a classifier are the same, so-calledmiss-classifications.  This is typically not the case for binary classification problems, especially those that have an imbalanced class distribution. 

Machine learning algorithms that treat each type of misclassification error as the same are unable to meet the needs of these types of problems.  As such, both the underrepresentation of the minority class in the training data and the increased importance on correctly identifying examples from the minority class make imbalanced classification one of the most challenging problems in applied machine learning.

Traditionally, machine learning algorithms are trained on a dataset and seek to minimizeerror.  Fitting a model on data solves an optimization problem where we explicitly seek tominimize error.  A range of functions can be used to calculate the error of a model on trainingdata, and the more general term is referred to as loss.  We seek to minimize the loss of a modelon the training data, which is the same as talking about error minimization.


- Error Minimization:  The conventional goal when training a machine learning algorithm is to minimize the error of the model on a training dataset

- Cost:  The penalty associated with an incorrect prediction.

- Cost Minimization:  The goal of cost-sensitive learning is to minimize the cost of a model on a training dataset.

- Cost Matrix:  A matrix that assigns a cost to each cell in the confusion matrix.

Conceptually, the cost of labeling an example incorrectly should always be greaterthan the cost of labeling it correctly
 - —The Foundations Of Cost-sensitive Learning, 2001


The values of the cost matrix must be carefully defined.  Like the choice of error functionfor traditional machine learning models, the choice of costs or cost function will determine thequality and utility of the model that is fit on the training data.

There are perhaps three main groups of cost-sensitive methods that are most relevant forimbalanced learning; they are:

- Cost-Sensitive Resampling
- Cost-Sensitive Algorithms
- Cost-Sensitive Ensembles 

#### Cost-Sensitive Resampling

Data sampling is a technique that can be used for cost-sensitive learning directly.  Instead ofsampling with a focus on balancing the skewed class distribution, the focus is on changing thecomposition of the training dataset to meet the expectations of the cost matrix.  This mightinvolve directly sampling the data distribution or using a method to weight examples in thedataset.  Such methods may be referred to as cost-proportionate weighing of the training datasetor cost-proportionate sampling.

#### Cost-Sensitive Algorithms

Machine learning algorithms are rarely developed specifically for cost-sensitive learning.  Instead,the wealth of existing machine learning algorithms can be modified to make use of the cost matrix.

Many such algorithm-specific augmentations have been proposed for popular algorithms, like decision trees and support vector machine.

The scikit-learn Python machine learning library provides examples of these cost-sensitive extensions via the classweight argument on the following classifiers:

- SVC
- DecisionTreeClassifier

Another more general approach to modifying existing algorithms is to use the costs as apenalty for misclassification when the algorithms are trained.  Given that most machine learningalgorithms are trained to minimize error, cost for misclassification is added to the error or usedto weigh the error during the training process.

he scikit-learn library provides examples of these cost-sensitiveextensions via theclassweightargument on the following classifiers:

- LogisticRegression
- RidgeClassifier

The Keras Python Deep Learning library also provides access to this use of cost-sensitive augmentation for neural networks via the classweightargument on thefit()function whentraining models. 

#### Cost-Sensitive Ensembles

pass


### Cost-Sensitive Logistic Regression

The weighting can penalize the model less for errors made on examples from the majority class and penalize the model more for errors made on examples from the minority class.  The result is a version of logistic regression that performs better on imbalanced classification tasks,generally referred to as cost-sensitive or weighted logistic regression

In [1]:
# fit a logistic regression model on an imbalanced classification dataset
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model
model = LogisticRegression(solver='lbfgs')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.985


####  Logistic Regression for Imbalanced Classification

Logistic regression is an effective model for binary classification tasks, although by default, it is not effective at imbalanced classification.  Logistic regression can be modified to be better suited for imbalanced classification.  The coefficients of the logistic regression algorithm are fitusing an optimization algorithm that minimizes the negative log likelihood (loss) for the modelon the training dataset.

minn∑i=1−(log(yhati)×yi+ log(1−yhati)×(1−yi))

This involves the repeated use of the model to make predictions followed by an adaptation ofthe coefficients in a direction that reduces the loss of the model.  The calculation of the loss fora given set of coefficients can be modified to take the class balance into account.  By default, theerrors for each class may be considered to have the same weighting, say 1.0.  These weightingscan be adjusted based on the importance of each class.


minn∑i=1−(w0×log(yhati)×yi+w1×log(1−yhati)×(1−yi)

The weighting is applied to the loss so that smaller weight values result in a smaller error value, and in turn, less update to the model coefficients.  

A larger weight value results in a larger error calculation, and in turn, more update to the model coefficients.

#### Weighted Logistic Regression with Scikit-Learn

TheLogisticRegressionclass  provides  theclassweightargument that can be specified as a model hyperparameter.  Theclassweightis a dictionarythat defines each class label (e.g.  0 and 1) and the weighting to apply in the calculation of the negative log likelihood when fitting the model. 

The class weighing can be defined multiple ways; for example:

- Domain expertise, determined by talking to subject matter experts

- Tuning, determined by a hyperparameter search such as a grid search.

- Heuristic, specified using a general best practice

A best practice for using the class weighting is to use the inverse of the class distribution present in the training dataset.  For example, the class distribution of the test dataset is a 1:100ratio for the minority class to the majority class.  The inversion of this ratio could be used with1 for the majority class and 100 for the minority class

We can evaluate the logistic regression algorithm with a class weighting using the sameevaluation procedure defined in the previous section.  We would expect that the class-weightedversion of logistic regression to perform better than the standard version of logistic regressionwithout any class weighting. 

The mean ROC AUC score is reported, in this case showing a better score than the unweightedversion of logistic regression, 0.989 as compared to 0.985










In [2]:
# weighted logistic regression model on an imbalanced classification dataset
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model
weights = {0:0.01, 1:1.0}
model = LogisticRegression(solver='lbfgs', class_weight=weights)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.989


We can use the default class balance directly with the LogisticRegression class by settingtheclassweightargument to‘balanced’.  For example:

In [8]:
# weighted logistic regression for class imbalance with heuristic weights
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model seeting class weight to balanced 
model = LogisticRegression(solver='lbfgs', class_weight='balanced')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.989


#### Grid Search Weighted Logistic Regression


Using a class weighting that is the inverse ratio of the training data is just a heuristic.  It ispossible that better performance can be achieved with a different class weighting, and this toowill depend on the choice of performance metric used to evaluate the model.  In this section, wewill grid search a range of different class weightings for weighted logistic regression and discoverwhich results in the best ROC AUC score.


In this case, we can see that the 1:100 majority to minority class weighting achieved thebest mean ROC score.  This matches the configuration for the general heuristic.  It might beinteresting to explore even more severe class weightings to see their effect on the mean ROCAUC score.


In [9]:
# grid search class weights with logistic regression for imbalance classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model
model = LogisticRegression(solver='lbfgs')
# define grid
balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]
param_grid = dict(class_weight=balance)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring='roc_auc')
# execute the grid search
grid_result = grid.fit(X, y)
# report the best configuration
print('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))
# report all configurations
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print('%f (%f) with: %r' % (mean, stdev, param))

Best: 0.988943 using {'class_weight': {0: 1, 1: 100}}
0.982148 (0.017020) with: {'class_weight': {0: 100, 1: 1}}
0.983465 (0.015555) with: {'class_weight': {0: 10, 1: 1}}
0.985242 (0.013456) with: {'class_weight': {0: 1, 1: 1}}
0.987973 (0.009846) with: {'class_weight': {0: 1, 1: 10}}
0.988943 (0.006354) with: {'class_weight': {0: 1, 1: 100}}
