# Credit Card Fraud Detection - Model Parameter Tuning

Coded by Emile Badran in April/2018

This notebook is part of the "Credit Card Fraud Detection" project. It contains the RandomizedSearchCV codes and outputs that were used to tune the machine learning models parameters. Click this link to see the final notebook with outputs and conclusion.

**Acknowledgments**

The dataset used in this notebook was collected and analysed for a research on big data mining and fraud detection by the Machine Learning Group of the Universit√© Libre de Bruxelles (http://mlg.ulb.ac.be). The data can be downloaded from https://www.kaggle.com/mlg-ulb/creditcardfraud/data

Authors of the original paper are: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. *Calibrating Probability with Undersampling for Unbalanced Classification.* In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

In [1]:
import warnings
warnings.filterwarnings('ignore')

import time
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline

import multiprocessing
from imblearn.over_sampling import RandomOverSampler 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC
from sklearn import ensemble

%env JOBLIB_TEMP_FOLDER=/tmp

env: JOBLIB_TEMP_FOLDER=/tmp


### Importing and inspecting the data set

In [2]:
# This notebook runs on Google Cloud

# Import Google Cloud-related libraries
import google.datalab.storage as storage
from io import BytesIO

# Define the storage bucket name and data file variable
mybucket = storage.Bucket('la-data')
data_file = mybucket.object('creditcard.csv')

# Create a readable URI object with the data in binary format
uri = data_file.uri
%gcs read --object $uri --variable data

# Read the binary data as CSV into a Pandas data frame
raw_data = pd.read_csv(BytesIO(data))
raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [3]:
# Inspect the number of rows and columns in the raw_data data frame
raw_data.shape

(284807, 31)

#### Column metadata:
- Time: Number of seconds elapsed between each transaction (over two days)
- V1 to V28: PCA components
- Amount: USD amount for each transaction
- Class: Whether the transaction was a fraud (1) or not-fraud (0)

The dataset contains credit card transactions made in September 2013 by European cardholders. It contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, features V1 to V28 are the principal components obtained from PCA.

### Resampling the data set to compensate class imbalance
There are 492 frauds out of 284,807 transactions. Class imbalance is 0.172%. Fraud transactions will be oversampled so the algorithm can be properly trained.

In [6]:
# Create train and test sample data frames
sample_train = shuffle(raw_data, n_samples=189543) # test sample with 2/3 of the data set 
sample_test = shuffle(raw_data, n_samples=95264) # test sample with 1/3 of the data set

# Random over sample fraud transactions for training purposes
ros = RandomOverSampler(ratio='minority')
x_train_balanced, y_train_balanced = ros.fit_sample(sample_train.iloc[:,:-1], sample_train.Class)

# The resulting balanced training sample has an equal amount of fraud/no-fraud transactions:
np.unique(y_train_balanced, return_counts=True)

(array([0, 1]), array([189217, 189217]))

### Selecting the most valuable features
Features with lower predictive potential are filtered to reduce processing time when running models and improve model stability. The score function used to select the most valuable features is ANOVA f-values.

In [7]:
# Select the 10 best features:
kbest = SelectKBest(f_classif, k=10).fit(x_train_balanced, y_train_balanced)

# Filter train and test matrices with only the k best features:
kbest_train_balanced = kbest.transform(x_train_balanced)
kbest_sample_test = kbest.transform(sample_test.iloc[:,:-1])

# Inspect sample sizes and number of dimensions
print(x_train_balanced.shape)
print(y_train_balanced.shape)
print(kbest_train_balanced.shape)
print(kbest_sample_test.shape)

(378434, 30)
(378434,)
(378434, 10)
(95264, 10)


The feature space was reduced from 30 to 10 variables.

# Predicting credit card fraud
Since this is a classification problem, the following models are used to detect credit card fraud:

- Naive Bayes
- KNN
- Random Forest
- Logistic Regression
- Support Vector Classifier
- Gradient Boosting Classifier

For each model, RandomizedSearchCV is used to randomly select and test (at least) 20 parameter combinations and select the most promising. The method also cross-validates each combination twice, totalling (at least) 40 fits.

RandomizedSearchCV lets you select a scoring function to choose the best parameter. Here, we're using 'recall', which returns the fold with the highest sensitivity score $ \frac{TP}{(TP+FN)} $.

The models were then run with parameters obtained from RandomizedSearchCV. The overall accurracy scores of the different models are compared in the conclusion section at the end of this notebook.

### Cross-validation custom function
This custom function cross-validates all models returning confusion matrices and sensitivity scores. The function is used to crossvalidate all models.

In [8]:
# Define a function to cross-validate all models
def cv_models(model, folds):
    
    # Start the timer function to inspect the amount of time necessary to run the method
    start_time = time.time()
    
    # This "if" statement makes models run in the main program to prevent infinite multi-processing loops
    if __name__ == '__main__':
      multiprocessing.set_start_method('forkserver', force=True)

      # Fit the model with the selected parameters
      model.fit(kbest_train_balanced,y_train_balanced)

      # Cross-validate the model 
      print('\nCross-Validation:')

      # Split the data set to the number of folds
      step = int(len(sample_test)/folds)

      # A "for loop" calls the predict function with a sub-sample 1/5th of the size of the test data set
      start = 0
      sensitivity = []
      for i in range(folds):
        stop = start+step

        # Call the predict function for every sub-sample
        model_predicted = model.predict(kbest_sample_test[start:stop])

        # Print sample range and confusion matrix 
        print('\nSample range: ', start, 'to', stop)
        print(confusion_matrix(model_predicted, sample_test.Class[start:stop]))
        
        # Calculate and print specificity scores (AKA recall scores) for every sub-sample
        recall = recall_score(sample_test.Class[start:stop], model_predicted)
        sensitivity.append(recall)
        print('Sensitivity: ', recall)
        start += step

      # Calculate the average sensitivity
      print('\nAverage sensitivity = ', np.mean(sensitivity))

      # Stop the timer function and inspect the time taken to run the method
      print("\n--- time elapsed %s seconds ---" % (time.time() - start_time))

### RandomizedSearchCV custom function
This custom function calls the RandomizedSearchCV method saving notebook space

In [9]:
# Define a function to run RandomizedSearchCV
def randomized_cv(model, params, iterations, num_cv):

  # Declare randomized search CV
  random_CV = RandomizedSearchCV(estimator=model, param_distributions=params,
                    n_iter=iterations, verbose=5, cv=num_cv, n_jobs=1, scoring='recall')
  
  # Start the timer function
  start_time = time.time()
  
  # This "if" statement makes models run in the main program to prevent infinite multi-processing loops
  if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver', force=True)

    # Fit the random search model
    random_CV.fit(kbest_train_balanced,y_train_balanced)

    # Stop the timer function and inspect the time taken to run the method
    print("\n--- %s seconds ---" % (time.time() - start_time))

    # Print the best parameters from RandomizedSearchCV
    print(random_CV.best_params_)

## Naive Bayes

Since the outcome data is binary, the Bernoulli Naive Bayes classifier is used.

In [12]:
# Instantiate the model
bnb = BernoulliNB(fit_prior=True)

# Inspect the default parameters
print(bnb)

# Call the cross-validation function
cv_models(bnb, 5)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Cross-Validation:

Sample range:  0 to 19052
[[18471     6]
 [  552    23]]
Sensitivity:  0.7931034482758621

Sample range:  19052 to 38104
[[18451     4]
 [  570    27]]
Sensitivity:  0.8709677419354839

Sample range:  38104 to 57156
[[18468     3]
 [  552    29]]
Sensitivity:  0.90625

Sample range:  57156 to 76208
[[18427     4]
 [  582    39]]
Sensitivity:  0.9069767441860465

Sample range:  76208 to 95260
[[18464     3]
 [  552    33]]
Sensitivity:  0.9166666666666666

Average sensitivity =  0.8787929202128119

--- time elapsed 0.3151257038116455 seconds ---


## KNN Classifier
The KNN classification model sorts the 'K' number of most similar (or nearest) data points. Similarity is measured according to a proximity metric that can be defined in the model's parameters. The algorithm calculates the probability for each vote value $ \frac{votes_i}{k} $ and the value with highest probability is returned.

#### KNN 1st attempt
Here we cross-validate KNN with default parameters. The n_jobs parameter is set to -1, to use all processor cores.

In [10]:
# Call the KNN classifier with default parameters
knn = KNeighborsClassifier(n_jobs=-1)

# Inspect the default parameters
print(knn)

# Call the cross-validation function
cv_models(knn, 5)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform')

Cross-Validation:

Sample range:  0 to 19052
[[19008     1]
 [   11    32]]
Sensitivity:  0.9696969696969697

Sample range:  19052 to 38104
[[19013     1]
 [   14    24]]
Sensitivity:  0.96

Sample range:  38104 to 57156
[[19003     2]
 [   11    36]]
Sensitivity:  0.9473684210526315

Sample range:  57156 to 76208
[[18990     3]
 [   14    45]]
Sensitivity:  0.9375

Sample range:  76208 to 95260
[[19004     2]
 [   17    29]]
Sensitivity:  0.9354838709677419

Average sensitivity =  0.9500098523434686

--- time elapsed 9.358150243759155 seconds ---


#### Run RandomizedSearchCV
Randomly search for the best parameters having sensitivity as the scoring metric.

In [79]:
# Create the RandomizedSearchCV parameter grid:
params = {'n_neighbors': [5, 10, 50, 100],
               'weights': ['uniform', 'distance'],
               'algorithm': ['auto'],
               'leaf_size': [5, 10, 20, 30],
               'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski']}

# Declare the model:
knn = KNeighborsClassifier()

# Fit randomized_cv with scoring metric set to "recall"
randomized_cv(knn, params, 20, 2)

Fitting 2 folds for each of 20 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 16.5min finished



--- 991.262886762619 seconds ---
{'algorithm': 'kd_tree', 'n_neighbors': 5, 'leaf_size': 20, 'weights': 'distance', 'metric': 'euclidean'}


#### KNN 2nd attempt
Run the second attempt with RandomizedSearchCV results

In [10]:
# Call the KNN classifier with selected parameters
knn = KNeighborsClassifier(algorithm='kd_tree', n_neighbors=5, leaf_size=20, weights='distance',
                           metric='euclidean', n_jobs=-1)

# Call the cross-validation function
cv_models(knn, 5)


Cross-Validation:

Sample range:  0 to 19052
[[19022     1]
 [    2    27]]
Sensitivity:  0.9642857142857143

Sample range:  19052 to 38104
[[19015     2]
 [    6    29]]
Sensitivity:  0.9354838709677419

Sample range:  38104 to 57156
[[19020     0]
 [    4    28]]
Sensitivity:  1.0

Sample range:  57156 to 76208
[[19022     2]
 [    5    23]]
Sensitivity:  0.92

Sample range:  76208 to 95260
[[19002     1]
 [    8    41]]
Sensitivity:  0.9761904761904762

Average sensitivity =  0.9591920122887865

--- time elapsed 9.078984022140503 seconds ---


## Random Forest Classifier

Random Forest is a 'bagging' ensemble model consisting of multiple decision trees. The trees are generated with randomly selected samples of the data, having with varied depths (meaning the numbers of leaves and branches). Trees get a "vote" on the outcome of each observation. The predictions with most votes are returned.

#### Random Forest Classifier 1st attempt
Run the model with default parameters.

In [9]:
# Call the model with default parameters
rfc = ensemble.RandomForestClassifier(n_jobs=-1)

# Inspect the default parameters
print(rfc)

# Call the cross-validation function
cv_models(rfc, 5)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Cross-Validation:

Sample range:  0 to 19052
[[19023     3]
 [    1    25]]
Sensitivity:  0.8928571428571429

Sample range:  19052 to 38104
[[19016     3]
 [    0    33]]
Sensitivity:  0.9166666666666666

Sample range:  38104 to 57156
[[19016     2]
 [    0    34]]
Sensitivity:  0.9444444444444444

Sample range:  57156 to 76208
[[19017     2]
 [    2    31]]
Sensitivity:  0.9393939393939394

Sample range:  76208 to 95260
[[19024     2]
 [    1    25]]
Sensitivity:  0.9259259259259259

Average sensitivity =  0.9238576238576238

--- time elapsed 7.862975835800171 seconds -

#### Run RandomizedSearchCV
Randomly search for the best parameters having sensitivity as the scoring metric.

In [11]:
# Create the RandomizedSearchCV parameter grid:
params = {'n_estimators': [10, 20, 50, 100, 200],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [10, 20, 30, None],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 5, 10],
               'bootstrap': [True, False]}

# Declare the model:
rfc = ensemble.RandomForestClassifier()

# Fit randomized_cv with scoring metric set to "recall"
randomized_cv(rfc, params, 30, 2)

Fitting 2 folds for each of 30 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  8.8min finished



--- 554.3161761760712 seconds ---
{'max_depth': 20, 'bootstrap': False, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 20}


#### Random Forest Classifier 2nd attempt

Run the model with RandomizedSearchCV's "best parameters". 

In [15]:
# Call the model with the selected parameters
rfc = ensemble.RandomForestClassifier(bootstrap=False, min_samples_leaf=5, min_samples_split=2, n_estimators=20,
                                     max_features='sqrt', max_depth=20, n_jobs=-1)

# Call the cross-validation function
cv_models(rfc, 5)


Cross-Validation:

Sample range:  0 to 19052
[[19017     0]
 [    2    33]]
Sensitivity:  1.0

Sample range:  19052 to 38104
[[19015     2]
 [    3    32]]
Sensitivity:  0.9411764705882353

Sample range:  38104 to 57156
[[19018     2]
 [    2    30]]
Sensitivity:  0.9375

Sample range:  57156 to 76208
[[19010     1]
 [    5    36]]
Sensitivity:  0.972972972972973

Sample range:  76208 to 95260
[[19013     0]
 [    4    35]]
Sensitivity:  1.0

Average sensitivity =  0.9703298887122417

--- time elapsed 44.30052733421326 seconds ---


## Logistic Regression

Logistic regression can be used to calculate the probability (obtained as the log odds) of getting y=1 (a fraud transaction) rather than y=0 (non-fraud).

#### Logistic regression 1st attempt
Cross-validate the model with default parameters. The n_jobs parameter is set to -1, to use all processor cores.

In [10]:
# Call the model with default parameters
logit = LogisticRegression(n_jobs=-1)

# Inspect the default parameters
print(logit)

# Call the cross-validation function
cv_models(logit, 5)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Cross-Validation:

Sample range:  0 to 19052
[[18605     2]
 [  419    26]]
Sensitivity:  0.9285714285714286

Sample range:  19052 to 38104
[[18596     4]
 [  420    32]]
Sensitivity:  0.8888888888888888

Sample range:  38104 to 57156
[[18585     5]
 [  431    31]]
Sensitivity:  0.8611111111111112

Sample range:  57156 to 76208
[[18556     1]
 [  463    32]]
Sensitivity:  0.9696969696969697

Sample range:  76208 to 95260
[[18613     7]
 [  412    20]]
Sensitivity:  0.7407407407407407

Average sensitivity =  0.8778018278018278

--- time elapsed 2.6593966484069824 seconds ---


#### Run RandomizedSearchCV
Since specificity score of logistic regression isn't great, the 'recall' scoreing parameter used by RandomizedSearchCV led to low specificity and 100% sensitivity accuracy. Therefore, the scoring metric of RandomizedSearchCV was changed to the default. 

In [11]:
# Declare the parameters
params = {'penalty': ['l1', 'l2'],
          'C': [1, 10, 50, 100],
          'fit_intercept': [True, False],
          'warm_start': [True, False],
          'solver': ['liblinear', 'saga']}

# Declare the model
logit = LogisticRegression()

# Fit randomized_cv WITH DEFAULT SCORING METRIC
randomized_cv(logit, params, 10, 2)

Fitting 2 folds for each of 10 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   14.1s finished



--- 22.311493635177612 seconds ---
{'warm_start': False, 'solver': 'saga', 'fit_intercept': True, 'penalty': 'l2', 'C': 10}


#### Logistic Regression 2nd attempt

Run the model with RandomizedSearchCV's "best parameters". 

In [8]:
# Declare the model with the selected parameters
logit = LogisticRegression(penalty='l1', solver='liblinear', C=1, warm_start=False,
                           fit_intercept=True, n_jobs=-1)

# Call the custom cross-validation function
cv_models(logit, 5)


Cross-Validation:

Sample range:  0 to 19052
[[18604     3]
 [  424    21]]
Sensitivity:  0.875

Sample range:  19052 to 38104
[[18585     1]
 [  434    32]]
Sensitivity:  0.9696969696969697

Sample range:  38104 to 57156
[[18597     1]
 [  428    26]]
Sensitivity:  0.9629629629629629

Sample range:  57156 to 76208
[[18593     0]
 [  433    26]]
Sensitivity:  1.0

Sample range:  76208 to 95260
[[18568     7]
 [  452    25]]
Sensitivity:  0.78125

Average sensitivity =  0.9177819865319865

--- time elapsed 3.438904285430908 seconds ---


### Support Vector Machine Classifier
The Support Vector Machine Classifier (SVC) is an effective machine learning method for high dimensional spaces, including when the number of variables (or dimensions) exceeds the number of samples.

#### Support Vector Machine Classifier 1st attempt
Cross-validate the model with default parameters.

In [14]:
# Call the model with default parameters
svc = svm.SVC()

# Inspect the default parameters
print(svc)

# Call the cross-validation function
cv_models(svc, 5)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Cross-Validation:

Sample range:  0 to 19052
[[18933     0]
 [   88    31]]
Sensitivity:  1.0

Sample range:  19052 to 38104
[[18954     3]
 [   68    27]]
Sensitivity:  0.9

Sample range:  38104 to 57156
[[18949     5]
 [   75    23]]
Sensitivity:  0.8214285714285714

Sample range:  57156 to 76208
[[18920     7]
 [   92    33]]
Sensitivity:  0.825

Sample range:  76208 to 95260
[[18942     2]
 [   83    25]]
Sensitivity:  0.9259259259259259

Average sensitivity =  0.8944708994708994

--- time elapsed 4193.549197673798 seconds ---


#### Run RandomizedSearchCV
Randomly search for the best parameters having sensitivity as the scoring metric.

In [None]:
# Create the RandomizedSearchCV parameter grid:
params = {'kernel':['linear', 'rbf'],
              'C':[0.7, 1],
              'decision_function_shape':['ovo'],
              'gamma':['auto',1,100], 'cache_size':[1000]}

# Declare the model
svc = svm.SVC()

# Fit randomized_cv with scoring metric set to "recall"
randomized_cv(svc, params, 12, 2)

Fitting 2 folds for each of 12 candidates, totalling 24 fits
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=auto 
[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=auto, score=0.8901972390757458, total= 8.5min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=auto 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  9.5min remaining:    0.0s


[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=auto, score=0.8920364458913811, total= 8.4min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=auto 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 19.0min remaining:    0.0s


[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=auto, score=0.9969452254613872, total=25.5min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=auto 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 45.4min remaining:    0.0s


[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=auto, score=0.9973363211635625, total=27.2min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=1 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 73.5min remaining:    0.0s


[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=1, score=0.8901972390757458, total= 8.6min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=1 
[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=1, score=0.8920364458913811, total= 8.5min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=1 
[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=1, score=1.0, total=20.7min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=1 
[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=rbf, gamma=1, score=1.0, total=20.1min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=100 
[CV]  cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=100, score=0.8901972390757458, total= 8.1min
[CV] cache_size=1000, C=0.7, decision_function_shape=ovo, kernel=linear, gamma=100 
[CV]  cache

#### Support Vector Machine Classifier 2nd attempt

Run the model with RandomizedSearchCV's "best parameters". 

In [11]:
# Call the model with selected parameters
svc = svm.SVC(cache_size=1000, C=1, decision_function_shape='ovo', kernel='linear', gamma=1)

# Call the custom cross-validation function
cv_models(svc, 5)


Cross-Validation:

Sample range:  0 to 19052
[[18721     4]
 [  300    27]]
Sensitivity:  0.8709677419354839

Sample range:  19052 to 38104
[[18701     4]
 [  317    30]]
Sensitivity:  0.8823529411764706

Sample range:  38104 to 57156
[[18727     1]
 [  290    34]]
Sensitivity:  0.9714285714285714

Sample range:  57156 to 76208
[[18710     2]
 [  311    29]]
Sensitivity:  0.9354838709677419

Sample range:  76208 to 95260
[[18713     3]
 [  303    33]]
Sensitivity:  0.9166666666666666

Average sensitivity =  0.9153799584349869

--- time elapsed 2961.659015893936 seconds ---


In [12]:
# Call the model with selected parameters
svc = svm.SVC(cache_size=1000, C=1, decision_function_shape='ovr', kernel='linear', gamma=1)

# Call the custom cross-validation function
cv_models(svc, 5)


Cross-Validation:

Sample range:  0 to 19052
[[18721     4]
 [  300    27]]
Sensitivity:  0.8709677419354839

Sample range:  19052 to 38104
[[18701     4]
 [  317    30]]
Sensitivity:  0.8823529411764706

Sample range:  38104 to 57156
[[18727     1]
 [  290    34]]
Sensitivity:  0.9714285714285714

Sample range:  57156 to 76208
[[18710     2]
 [  311    29]]
Sensitivity:  0.9354838709677419

Sample range:  76208 to 95260
[[18713     3]
 [  303    33]]
Sensitivity:  0.9166666666666666

Average sensitivity =  0.9153799584349869

--- time elapsed 3026.675800561905 seconds ---


### Gradient Boosting
The gradient boosting classifier used here is an ensemble of "weak" decision trees that are run in sequence. The residuals of each decision tree is used as the outcome to be predicted by the subsequent tree. The cost function used is the sum of the negative log odds. Predictions are given by adding the values of all decision trees.

In [11]:
# Call the model with default parameters
gbc = ensemble.GradientBoostingClassifier()

# Inspect the default parameters
print(gbc)

# Call the cross-validation function
cv_models(gbc, 5)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

Cross-Validation:

Sample range:  0 to 19052
[[18868     0]
 [  151    33]]
Sensitivity:  1.0

Sample range:  19052 to 38104
[[18850     1]
 [  177    24]]
Sensitivity:  0.96

Sample range:  38104 to 57156
[[18842     1]
 [  172    37]]
Sensitivity:  0.9736842105263158

Sample range:  57156 to 76208
[[18806     2]
 [  198    46]]
Sensitivity:  0.9583333333333334

Sample range:  76208 to 95260
[[18850     2]
 [  171    29]]
Sensitivity:  0.9354838709677419

Average sensitivity =  0.9655002829654782

--- time elapsed 64.4233

#### Run RandomizedSearchCV
Randomly search for the best parameters having sensitivity as the scoring metric.

In [None]:
# Create the RandomizedSearchCV parameter grid:
params = {'loss': ['deviance', 'exponential'],
          'max_depth': [3,10],
          'n_estimators': [100],
          'criterion': ['friedman_mse','mae']}


# Declare the model:
gbc = ensemble.GradientBoostingClassifier()

# Fit randomized_cv with scoring metric set to "recall"
randomized_cv(gbc, params, 8, 2)

Fitting 2 folds for each of 8 candidates, totalling 16 fits
[CV] max_depth=3, loss=deviance, n_estimators=100, criterion=friedman_mse 
[CV]  max_depth=3, loss=deviance, n_estimators=100, criterion=friedman_mse, score=0.9935105427257834, total= 1.7min
[CV] max_depth=3, loss=deviance, n_estimators=100, criterion=friedman_mse 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.7min remaining:    0.0s


[CV]  max_depth=3, loss=deviance, n_estimators=100, criterion=friedman_mse, score=0.9934894044284733, total= 1.7min
[CV] max_depth=10, loss=deviance, n_estimators=100, criterion=friedman_mse 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.4min remaining:    0.0s


[CV]  max_depth=10, loss=deviance, n_estimators=100, criterion=friedman_mse, score=1.0, total= 9.1min
[CV] max_depth=10, loss=deviance, n_estimators=100, criterion=friedman_mse 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 12.6min remaining:    0.0s


[CV]  max_depth=10, loss=deviance, n_estimators=100, criterion=friedman_mse, score=1.0, total= 9.9min
[CV] max_depth=3, loss=exponential, n_estimators=100, criterion=friedman_mse 
[CV]  max_depth=3, loss=exponential, n_estimators=100, criterion=friedman_mse, score=0.9774665750673783, total= 1.8min
[CV] max_depth=3, loss=exponential, n_estimators=100, criterion=friedman_mse 
[CV]  max_depth=3, loss=exponential, n_estimators=100, criterion=friedman_mse, score=0.984241399355282, total= 1.7min
[CV] max_depth=10, loss=exponential, n_estimators=100, criterion=friedman_mse 
[CV]  max_depth=10, loss=exponential, n_estimators=100, criterion=friedman_mse, score=1.0, total=10.1min
[CV] max_depth=10, loss=exponential, n_estimators=100, criterion=friedman_mse 
[CV]  max_depth=10, loss=exponential, n_estimators=100, criterion=friedman_mse, score=1.0, total=10.2min
[CV] max_depth=3, loss=deviance, n_estimators=100, criterion=mae .....


#### Gradient Boosting Classifier 2nd attempt

Run the model with RandomizedSearchCV's "best parameters". 

In [14]:
# Call the model with selected parameters
gbc = ensemble.GradientBoostingClassifier(max_depth=3, loss='exponential', n_estimators=100, criterion='friedman_mse')

# Call the custom cross-validation function
cv_models(gbc, 5)


Cross-Validation:

Sample range:  0 to 19052
[[18796     2]
 [  227    27]]
Sensitivity:  0.9310344827586207

Sample range:  19052 to 38104
[[18817     2]
 [  204    29]]
Sensitivity:  0.9354838709677419

Sample range:  38104 to 57156
[[18818     1]
 [  202    31]]
Sensitivity:  0.96875

Sample range:  57156 to 76208
[[18789     1]
 [  220    42]]
Sensitivity:  0.9767441860465116

Sample range:  76208 to 95260
[[18803     1]
 [  213    35]]
Sensitivity:  0.9722222222222222

Average sensitivity =  0.9568469523990192

--- time elapsed 76.4148519039154 seconds ---
