# Cost Sensitive Learning

In [None]:
%conda install xgboost

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/flynn/anaconda3

  added / updated specs:
    - xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           9 KB
    conda-4.11.0               |   py39h06a4308_0        14.4 MB
    libxgboost-1.5.0           |       h295c915_1         2.0 MB
    py-xgboost-1.5.0           |   py39h06a4308_1         163 KB
    xgboost-1.5.0              |   py39h06a4308_1          25 KB
    ------------------------------------------------------------
                                           Total:        16.6 MB

The following NEW packages will be INSTALLED:

  _py-xgboost-mutex  pkgs/main/linux-64::_py-xgboost-mutex-2.0-cpu_0
  libxgboost         pkgs/main/linux-64::libxgboost-1.5.0-h295c915_1
  py-xgboost         p

## Grid Search Weighted for ML Classification

Using a class weighting that is the inverse ratio of the training data is just a heuristic. 

It is
possible that better performance can be achieved with a different class weighting, and this too
will depend on the choice of performance metric used to evaluate the model. 

In this section, we
will grid search a range of different class weightings for weighted logistic regression and discover
which results in the best ROC AUC score.

In [1]:
# grid search class weights with logistic regression for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model
model = LogisticRegression(solver= 'lbfgs' )
# model = SVC(gamma= ' scale ' )
# define grid
balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]
param_grid = dict(class_weight=balance)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv,
scoring= 'roc_auc' )
# execute the grid search
grid_result = grid.fit(X, y)
# report the best configuration
print( ' Best: %f using %s ' % (grid_result.best_score_, grid_result.best_params_))
# report all configurations
means = grid_result.cv_results_[ 'mean_test_score' ]
stds = grid_result.cv_results_[ 'std_test_score' ]
params = grid_result.cv_results_[ 'params' ]
for mean, stdev, param in zip(means, stds, params):
    print( ' %f (%f) with: %r ' % (mean, stdev, param))

 Best: 0.988943 using {'class_weight': {0: 1, 1: 100}} 
 0.982148 (0.017020) with: {'class_weight': {0: 100, 1: 1}} 
 0.983465 (0.015555) with: {'class_weight': {0: 10, 1: 1}} 
 0.985242 (0.013456) with: {'class_weight': {0: 1, 1: 1}} 
 0.987973 (0.009846) with: {'class_weight': {0: 1, 1: 10}} 
 0.988943 (0.006354) with: {'class_weight': {0: 1, 1: 100}} 


## Cost-Sensitive Gradient Boosting with XGBoost

In [2]:
# grid search positive class weights with xgboost for imbalance classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7)
# define model
model = XGBClassifier()
# define grid
weights = [1, 10, 25, 50, 75, 99, 100, 1000]
param_grid = dict(scale_pos_weight=weights)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv,
scoring= 'roc_auc' )
# execute the grid search
grid_result = grid.fit(X, y)
# report the best configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# report all configurations
means = grid_result.cv_results_[ 'mean_test_score' ]
stds = grid_result.cv_results_[ 'std_test_score' ]
params = grid_result.cv_results_[ 'params' ]
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.960155 using {'scale_pos_weight': 1000}
0.953721 (0.035950) with: {'scale_pos_weight': 1}
0.958254 (0.028362) with: {'scale_pos_weight': 10}
0.957892 (0.027283) with: {'scale_pos_weight': 25}
0.959157 (0.027430) with: {'scale_pos_weight': 50}
0.959241 (0.028015) with: {'scale_pos_weight': 75}
0.959305 (0.028286) with: {'scale_pos_weight': 99}
0.959505 (0.028213) with: {'scale_pos_weight': 100}
0.960155 (0.028721) with: {'scale_pos_weight': 1000}
