In [1]:
pip install optuna

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting optuna
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.3/365.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.10.2-py3-none-any.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.10.2 cmaes-0.9.1 colorlog-6.7.0 optuna-3.1.0
Note: you may 

**Exercise 1: (5 points) Using the bucket, that you create in the last homework assignment, and the pandas
library, read the train.csv and test.csv data files and create two data-frames called train and
test, respectively.**

In [6]:
import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import optuna

from precision_recall_cutoff import precision_recall_cutoff
import cost_functions

from tqdm import tqdm
from scipy.stats import boxcox
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier 
from sklearn.metrics import classification_report, make_scorer, confusion_matrix

## Defining the s3 bucket
s3= boto3.resource('s3')
bucket_name= 'craig-shaffer-data-445-bucket'
bucket= s3.Bucket(bucket_name)

## Defining the file to be read from s3 bucket
file_key = 'train.csv'
file_key2 = 'test.csv'

bucket_object = bucket.Object(file_key)
bucket_object2 = bucket.Object(file_key2)

file_object = bucket_object.get()
file_object2 = bucket_object2.get()

file_content_stream = file_object.get('Body')
file_content_stream2 = file_object2.get('Body')

## Reading the datafiles
train = pd.read_csv(file_content_stream, sep = '|')
test = pd.read_csv(file_content_stream2, sep = '|')

*Engineering variables from previous homeworks*

In [7]:
#variable one: low trust level (trustLevel for fraud is never >2)
train['lowTrust'] = np.where(train['trustLevel'] <= 2, 1, 0)
test['lowTrust'] = np.where(test['trustLevel'] <= 2, 1, 0)

#variable two: low value per second (the highest value per second (VPS) in the fraud data set is .231, but will set it to .25)
train['lowVPS'] = np.where(train['valuePerSecond'] <= 0.25, 1, 0)
test['lowVPS'] = np.where(test['valuePerSecond'] <= 0.25, 1, 0)

#variable three: low scan time (noticeable difference in quartiles for fraud and not fraud)
train['lowTotalScanTime'] = np.where(train['totalScanTimeInSeconds'] < 1000, 1, 0)
test['lowTotalScanTime'] = np.where(test['totalScanTimeInSeconds'] < 1000, 1, 0)

#variable four: high scannedLineItemsPerSecond (SLIPS) (SLIPS doesn't exceed .308 in fraud but goes up to 11 in not fraud)
train['highSLIPS'] = np.where(train['scannedLineItemsPerSecond'] > 0.35 , 1, 0)
test['highSLIPS'] = np.where(test['scannedLineItemsPerSecond'] > 0.35 , 1, 0)

#variable five: boxcox transformation on scannedLineItemsPerSecond
train['boxcox_SLIPS'] = boxcox(train['scannedLineItemsPerSecond'])[0]
test['boxcox_SLIPS'] = boxcox(test['scannedLineItemsPerSecond'])[0]

#variable six: 1/grandTotal
train['1_grandTotal'] = 1/(train['grandTotal'])
test['1_grandTotal'] = 1/(test['grandTotal'])

#variable seven: natural log of totalScanTimeInSeconds
train['log_totalScanTimeInSeconds']= np.log(train['totalScanTimeInSeconds'])
test['log_totalScanTimeInSeconds']= np.log(test['totalScanTimeInSeconds'])

#variable eight: lineItemVoidsPerPosition^2
train['squared_lineItemVoidsPerPosition']= np.power(train['lineItemVoidsPerPosition'], 2)
test['squared_lineItemVoidsPerPosition']= np.power(test['lineItemVoidsPerPosition'], 2)

#variable nine: attempted a scan without registration
train['madeScansWithoutRegistration'] = np.where(train['scansWithoutRegistration'] > 0, 1, 0)
test['madeScansWithoutRegistration'] = np.where(test['scansWithoutRegistration'] > 0, 1, 0)

#variable ten: made a modification to quantity
train['madeModification'] = np.where(train['quantityModifications'] > 0, 1, 0)
test['madeModification'] = np.where(test['quantityModifications'] > 0, 1, 0)

#3 heredity principle features
train['heredity_interaction_1'] = train['trustLevel'] * train['lowTrust']
test['heredity_interaction_1'] = test['trustLevel'] * test['lowTrust']

train['heredity_interaction_2'] = train['trustLevel'] * train['scannedLineItemsPerSecond']
test['heredity_interaction_2'] = test['trustLevel'] * test['scannedLineItemsPerSecond']

train['heredity_interaction_3'] = train['lowTrust'] * train['scannedLineItemsPerSecond']
test['heredity_interaction_3'] = test['lowTrust'] * test['scannedLineItemsPerSecond']


#decision tree features
train['tree_interaction_1'] = np.where(train['heredity_interaction_3'] <= 0.012, 1, 0)
test['tree_interaction_1'] = np.where(test['heredity_interaction_3'] <= 0.012, 1, 0)

train['tree_interaction_2'] = np.where((train['heredity_interaction_3'] > 0.012) & 
                                       (train['totalScanTimeInSeconds'] <= 993.0) &
                                       (train['heredity_interaction_1'] > 1.5) &
                                       (train['scansWithoutRegistration'] <= 7.5), 1, 0)
test['tree_interaction_2'] = np.where((test['heredity_interaction_3'] > 0.012) & 
                                       (test['totalScanTimeInSeconds'] <= 993.0) &
                                       (test['heredity_interaction_1'] > 1.5) &
                                       (test['scansWithoutRegistration'] <= 7.5), 1, 0)

train['tree_interaction_3'] = np.where((train['heredity_interaction_3'] > 0.012) & 
                                       (train['totalScanTimeInSeconds'] <= 993.0) &
                                       (train['heredity_interaction_1'] <= 1.5) &
                                       (train['valuePerSecond'] <= 0.119), 1, 0)
test['tree_interaction_3'] = np.where((test['heredity_interaction_3'] > 0.012) & 
                                       (test['totalScanTimeInSeconds'] <= 993.0) &
                                       (test['heredity_interaction_1'] <= 1.5) &
                                       (test['valuePerSecond'] <= 0.119), 1, 0)

In [8]:
#defining the input (top 7 features) and target variable (fraud)
x_train_7 = train[['log_totalScanTimeInSeconds', 'trustLevel', 'tree_interaction_1', 'heredity_interaction_3', 
                 'boxcox_SLIPS','scansWithoutRegistration','lineItemVoids']]
y_train = train['fraud']

#top 6 features
x_train_6 = x_train_7.drop(columns = ['lineItemVoids'])

#top 5 features
x_train_5 = x_train_6.drop(columns = ['scansWithoutRegistration'])

In [10]:
#defining scorer
my_scorer = make_scorer(cost_functions.cost_function, greater_is_better = True, needs_proba = True)

**Exercise 2: (85 points) Using the train data-frame (including the top 7 features from homework assignment 5), do the following:**

- (i) Consider a model to predict fraud. Then, do the following:
  - With the top 5 important features and using the GridSearchCV function with cv = 3, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  - With the top 6 important features and using the GridSearchCV function with cv = 3, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  - With the top 7 important features and using the GridSearchCV function with cv = 3, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.

From above three scenarios, identify the best model; that is, the model (input features
and hyper-parameters) that has the best performance.

In [15]:
#Model: Random Forest

#defining parameter dictionary
rf_param_grid = {'n_estimators': [100, 300, 500],
                 'min_samples_split': [10, 15],
                 'min_samples_leaf': [5, 7],
                 'max_depth' : [3, 5, 7]}

#GridSearchCV w/ top 5 most important features:----------
rf_grid_search_1 = GridSearchCV(estimator = RandomForestClassifier(), param_grid = rf_param_grid, 
                              cv = 3, scoring = my_scorer).fit(x_train_5, y_train)

print('Best hyper-parameter combination for RandomForestClassifier with top-5 variables: \n', rf_grid_search_1.best_params_)
print('\nBest score:\n', rf_grid_search_1.best_score_)
print('\n----------')
#GridSearchCV w/ top 6 most important features:----------
rf_grid_search_2 = GridSearchCV(estimator = RandomForestClassifier(), param_grid = rf_param_grid, 
                              cv = 3, scoring = my_scorer).fit(x_train_6, y_train)

print('\nBest hyper-parameter combination for RandomForestClassifier with top-6 variables: \n', rf_grid_search_2.best_params_)
print('\nBest score:\n', rf_grid_search_2.best_score_)
print('\n----------')
#GridSearchCV w/ top 7 most important features:----------
rf_grid_search_3 = GridSearchCV(estimator = RandomForestClassifier(), param_grid = rf_param_grid, 
                              cv = 3, scoring = my_scorer).fit(x_train_7, y_train)

print('\nBest hyper-parameter combination for RandomForestClassifier with top-7 variables: \n', rf_grid_search_3.best_params_)
print('\nBest score:\n', rf_grid_search_3.best_score_)

Best hyper-parameter combination for RandomForestClassifier with top-5 variables: 
 {'max_depth': 7, 'min_samples_leaf': 7, 'min_samples_split': 15, 'n_estimators': 100}

Best score:
 -26.666666666666668

----------

Best hyper-parameter combination for RandomForestClassifier with top-6 variables: 
 {'max_depth': 7, 'min_samples_leaf': 7, 'min_samples_split': 15, 'n_estimators': 100}

Best score:
 -18.333333333333332

----------

Best hyper-parameter combination for RandomForestClassifier with top-7 variables: 
 {'max_depth': 7, 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 100}

Best score:
 -30.0


- (ii) Consider a model different from part (i) to predict fraud. Then, do the following:
  - With the top 5 important features and using the RandomizedSearchCV function with cv = 3 and n iter = 30, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  - With the top 6 important features and using the RandomizedSearchCV function with cv = 3 and n iter = 30, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  - With the top 7 important features and using the RandomizedSearchCV function with cv = 3 and n iter = 30, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  
From above three scenarios, identify the best model; that is, the model (input features and hyper-parameters) that has the best performance.

In [17]:
#Model: AdaBoost

#defining parameter dictionary
ada_param_grid = {'n_estimators': [100, 300],
                  'estimator__min_samples_split': [10, 15],
                  'estimator__min_samples_leaf': [5, 7],
                  'estimator__max_depth': [3, 5, 7],
                  'learning_rate': [0.001, 0.01, 0.1]}

#RandomizedSearchCV w/ top 5 most important features:----------
ada_randomized_search_1 = RandomizedSearchCV(estimator = AdaBoostClassifier(estimator = DecisionTreeClassifier()), 
                                             param_distributions = ada_param_grid, cv = 3, scoring = my_scorer,
                                             n_jobs = -1, n_iter = 30).fit(x_train_5, y_train)

print('Best hyper-parameter combination for AdaBoostClassifier with top-5 variables: \n', ada_randomized_search_1.best_params_)
print('\nBest score:\n', ada_randomized_search_1.best_score_)
print('\n----------')
#RandomizedSearchCV w/ top 6 most important features:----------
ada_randomized_search_2 = RandomizedSearchCV(estimator = AdaBoostClassifier(estimator = DecisionTreeClassifier()), 
                                             param_distributions = ada_param_grid, cv = 3, scoring = my_scorer,
                                             n_jobs = -1, n_iter = 30).fit(x_train_6, y_train)

print('Best hyper-parameter combination for AdaBoostClassifier with top-6 variables: \n', ada_randomized_search_2.best_params_)
print('\nBest score:\n', ada_randomized_search_2.best_score_)
print('\n----------')
#RandomizedSearchCV w/ top 7 most important features:----------
ada_randomized_search_3 = RandomizedSearchCV(estimator = AdaBoostClassifier(estimator = DecisionTreeClassifier()), 
                                             param_distributions = ada_param_grid, cv = 3, scoring = my_scorer,
                                             n_jobs = -1, n_iter = 30).fit(x_train_7, y_train)

print('Best hyper-parameter combination for AdaBoostClassifier with top-7 variables: \n', ada_randomized_search_3.best_params_)
print('\nBest score:\n', ada_randomized_search_3.best_score_)

Best hyper-parameter combination for AdaBoostClassifier with top-5 variables: 
 {'n_estimators': 100, 'learning_rate': 0.001, 'estimator__min_samples_split': 10, 'estimator__min_samples_leaf': 5, 'estimator__max_depth': 5}

Best score:
 -18.333333333333332

----------
Best hyper-parameter combination for AdaBoostClassifier with top-6 variables: 
 {'n_estimators': 300, 'learning_rate': 0.1, 'estimator__min_samples_split': 15, 'estimator__min_samples_leaf': 5, 'estimator__max_depth': 7}

Best score:
 1.6666666666666667

----------
Best hyper-parameter combination for AdaBoostClassifier with top-7 variables: 
 {'n_estimators': 100, 'learning_rate': 0.01, 'estimator__min_samples_split': 15, 'estimator__min_samples_leaf': 7, 'estimator__max_depth': 3}

Best score:
 -3.3333333333333335


- (iii) Consider a model different from parts (i) & (ii) to predict fraud. Then, do the following:
  - With the top 5 important features and using the Optuna framework using 3 folds and N TRIALS = 30, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  - With the top 6 important features and using the Optuna framework using 3 folds and N TRIALS = 30, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.
  - With the top 7 important features and using the Optuna framework using 3 folds and N TRIALS = 30, run a hyper-parameter tuning procedure on the model. Please see page 4 of DATA-MINING-CUP-2019-task.pdf file to understand how the model should be evaluated.

From above three scenarios, identify the best model; that is, the model (input features and hyper-parameters) that has the best performance.

In [None]:
#Model: Gradient Boosting

#defining parameter dictionary
gb_param_grid = {'n_estimators': [100, 300],
                  'estimator__min_samples_split': [10, 15],
                  'estimator__min_samples_leaf': [5, 7],
                  'estimator__max_depth': [3, 5, 7],
                  'learning_rate': [0.01]}

#Optuna w/ top 5 most important features:----------

#Optuna w/ top 6 most important features:----------

#Optuna w/ top 7 most important features:----------

**Exercise 3: (70 points) Using the train data-frame and the models from exercise 2, split the train data-frame into two data-frames: training (80%) and validation (20%) taking into account the proportions of 0s and 1s. Then, do the following:**

- (i) Consider the best model from exercise 2(i). Build that model on the training data-frame. After that, predict the likelihood of fraud on the validation and test data-frames.

- (ii) Consider the best model from exercise 2(ii). Build that model on the training data-frame. After that, predict the likelihood of fraud on the validation and test data-frames.

- (iii) Consider the best model from exercise 2(iii). Build that model on the training data-frame. After that, predict the likelihood of fraud on the validation and test data-frames.

Using the prediction on the validation data-frame as inputs from parts (i)-(ii)-(iii) and the actual fraud values from the validation data-frame as the target variable, build a meta-learner to predict fraud. Make sure to tune the hyper-parameters of the meta-learner keeping in mind how the results are going to be evaluated. For more info, see page 4 of DATA-MINING-CUP-2019-task.pdf file. Finally, use the best meta-learner to predict the likelihood of fraud in the test data-frame. Submit the likelihoods in a csv file. Also submit the associated cut-off value.