## Estimating the Cost with Cross-Validation

We mentioned that there are 3 ways of estimating the cost:

- Domain Expert provides the cost
- Balance Ratio (which we covered in our previous notebook)
- Cross-validation: find cost as hyper-parameter

Here, we will find the cost with hyper parameter search and cross-validation using a Random Fores Model

In [1]:
# import libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

In [2]:
# get data
df = pd.read_csv('../kdd2004.csv').sample(10000)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
18443,62.5,28.1,-0.98,-28.0,23.0,1631.1,0.88,0.69,-0.5,-95.5,...,1469.0,0.08,2.85,-9.0,-106.0,328.7,1.14,0.34,0.8,-1
135939,79.08,22.73,0.18,11.0,73.0,2320.0,0.56,0.59,18.5,-74.0,...,2740.9,0.13,-0.02,-9.0,-60.0,734.2,-0.07,0.04,-0.37,-1
22707,63.81,19.4,-1.14,-22.0,317.0,4910.1,0.78,-0.58,-6.0,-44.0,...,7001.3,1.56,0.74,7.0,-29.0,67.0,1.31,0.4,0.43,-1
75422,80.95,30.59,-0.42,3.0,0.0,858.0,-0.17,3.66,26.0,-96.0,...,561.0,0.07,1.38,-1.0,-41.0,470.7,-1.31,0.46,0.81,-1
102072,52.42,26.15,-0.04,-8.0,19.0,949.9,0.9,0.95,21.0,-67.0,...,1647.0,-3.0,3.23,0.0,-28.0,-32.0,1.32,0.47,0.51,-1


In [4]:
# balance ratio
df['target'].value_counts()/len(df['target'])

-1    0.9894
 1    0.0106
Name: target, dtype: float64

In [9]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis = 1),
                                                   df['target'],
                                                   test_size=0.3,
                                                   random_state=39)

In [5]:
# lets define a random forest model

rf = RandomForestClassifier(n_estimators=10,random_state=0,n_jobs=2)

In [8]:
# lets define grid_params for Grid Search

grid_params = {
    'n_estimators' : [20, 50, 100],
    'max_depth' : [None, 2,3,4],
    'class_weight' : [None, 'balanced',{-1:1,1:10},{-1:1,1:100}]
}

In [12]:
# fit the grid search model

model = GridSearchCV(rf,
                     param_grid=grid_params,
                     n_jobs=2,
                     scoring='roc_auc',  # we want to optimize based on roc_auc score
                     cv=3)   # 3-fold cross validation

model.fit(X_train,y_train)

GridSearchCV(cv=3,
             estimator=RandomForestClassifier(n_estimators=10, n_jobs=2,
                                              random_state=0),
             n_jobs=2,
             param_grid={'class_weight': [None, 'balanced', {-1: 1, 1: 10},
                                          {-1: 1, 1: 100}],
                         'max_depth': [None, 2, 3, 4],
                         'n_estimators': [20, 50, 100]},
             scoring='roc_auc')

In [13]:
model.best_params_

{'class_weight': 'balanced', 'max_depth': 3, 'n_estimators': 100}

In [14]:
model.best_score_

0.9718215898371372

- So, out of the hyper parameters that we have grid search, these combination of hyper params yields the best performance for this data with a max score stated above

In [20]:
# lets try to check using Logistic Regression
log = LogisticRegressionCV(random_state=0,
                           cv=3)

In [21]:
par_grid = {
    'max_iter' : [10, 50, 100],
    'class_weight' : [None, 'balanced',{-1:1,1:10},{-1:1,1:100}]
}

In [23]:
log_model = GridSearchCV(log,
                     param_grid=par_grid,
                     n_jobs=2,
                     scoring='roc_auc',  # we want to optimize based on roc_auc score
                     cv=3)

log_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

GridSearchCV(cv=3, estimator=LogisticRegressionCV(cv=3, random_state=0),
             n_jobs=2,
             param_grid={'class_weight': [None, 'balanced', {-1: 1, 1: 10},
                                          {-1: 1, 1: 100}],
                         'max_iter': [10, 50, 100]},
             scoring='roc_auc')

In [24]:
log_model.best_params_

{'class_weight': 'balanced', 'max_iter': 100}

In [25]:
log_model.best_score_

0.9611650544504017

- Seems RF model has a better score than the Log Model
- We can try for some other datasets with different hyper params