## Estimating the Cost with Cross-Validation

We mentioned that there are 3 ways of estimating the cost:

- Domain Expert provides the cost
- Balance Ratio (we did this in previous notebook)
- Cross-validation: find cost as hyper-parameter

In this notebook, we will find the cost with hyper parameter search and cross-validation.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [2]:
# load data
# only a few observations to speed the computaton

data = pd.read_csv('../kdd2004.csv').sample(10000)

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
125847,54.11,29.11,1.34,31.5,-46.0,1400.7,-1.87,1.64,13.5,-85.5,...,714.5,0.06,-0.08,-21.0,-50.0,145.2,1.59,-0.34,-1.64,-1
144780,60.18,27.27,-0.47,-8.5,-12.5,895.8,0.04,-0.28,-1.0,-65.5,...,927.2,-1.96,0.75,0.0,-40.0,420.5,-0.93,0.22,0.44,-1
44247,55.35,25.41,0.95,-10.0,-22.0,1463.0,0.86,0.61,11.5,-85.0,...,1396.1,1.31,1.01,0.0,-83.0,718.1,-0.09,0.55,0.59,-1
80256,63.67,25.48,-0.09,-9.0,81.5,3950.4,-1.05,-0.82,-13.5,-67.0,...,3774.7,0.55,0.81,3.0,-69.0,928.9,-0.44,0.24,-0.38,-1
71866,37.16,25.61,-0.44,-46.5,83.0,1802.5,1.37,-0.25,-19.0,-59.5,...,2199.2,1.58,0.93,-4.0,-125.0,57.0,1.15,-0.08,-0.45,-1


In [3]:
# imbalanced target

data.target.value_counts() / len(data)

-1    0.9919
 1    0.0081
Name: target, dtype: float64

In [4]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((7000, 74), (3000, 74))

In [5]:
# set up initial random forest

rf = RandomForestClassifier(n_estimators=50,
                            random_state=39,
                            max_depth=2,
                            n_jobs=4,
                            class_weight=None)

In [6]:
# set up parameter search grid
# including class weight

param_grid = {
  'n_estimators': [10, 50, 100],
  'max_depth': [None, 2, 3],
  'class_weight': [None, {-1:1, 1:10}, {-1:1, 1:100}],
}

In [7]:
search = GridSearchCV(estimator=rf,
                      scoring='roc_auc',
                      param_grid=param_grid,
                      cv=2,
                     ).fit(X_train, y_train)

In [8]:
search.best_score_

0.9768145161290323

In [9]:
search.best_params_

{'class_weight': {-1: 1, 1: 100}, 'max_depth': 2, 'n_estimators': 100}

In [10]:
search.best_estimator_

RandomForestClassifier(class_weight={-1: 1, 1: 100}, max_depth=2, n_jobs=4,
                       random_state=39)

In [11]:
search.score(X_test, y_test)

0.9949042016806723

**HOMEWORK**

Try other machine learning algorithms and other datasets available in imbalanced-learn