# Unit3_Lesson6_Project1: Credit Card Fraud
Data Source: https://www.kaggle.com/mlg-ulb/creditcardfraud<br>
**Question:** Using this credit card fraud dataset, develop an algorithm to predict fraud. Prioritize correctly finding fraud rather than correctly labeling non-fraudulent transactions. In the outcome column(Class), 1 signifies fraud while zero is the reverse.

Note: There's is no much visibility into this dataset as it has been processed already with PCA. I'm just going to go straight into modelling the predictor.

-  First I will use decision tree to classify the data, then employ bagging technique using Random Forest.
-  I will then use logistic regression to see if there will be any disparity.<br>

In [81]:
#import relevant tools
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import math

In [82]:
#import data, create a dataframe and split into dependent/independent variables
PATH = 'unit3_data/creditcard_fraud.csv'
data = pd.read_csv(PATH)

#data into X and Y
Y = data.Class
#X = data.loc[:,'Time':'Amount']
X = data.loc[:, ~data.columns.isin(['Class'])]

#split data into train-test using ration 9:1
offset = int(X.shape[0]*0.9)
x_train, x_test = X[: offset], X[offset :]
y_train, y_test = Y[: offset], Y[offset :]

### Now I will create a function to run models and output their accuracy. In terms of accuracy, I will focuse more on  `Type 1 Error` (i.e labelling `fraud` as `non-fraud`) and try to minimize this error as much as possible. In each modelling case I will evaluate errors for both training and test dataset

In [83]:
#function for running model 
def run_model(model, x_train, y_train, x_test, y_test):
    
    # Initialize and fit the model.
    model.fit(x_train, y_train)
    
    #predict training and test data
    predict_train = model.predict(x_train)
    predict_test = model.predict(x_test)
    
    # creat accuracy tables for trian and test predictions
    table_train = pd.crosstab(y_train, predict_train, margins=True)
    table_test = pd.crosstab(y_test, predict_test, margins=True)
    
    #generate Type-I and Type-II errors rates for training data predictions
    train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
    train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
    
    #generate Type-I and Type-II errors rates for testing data predictions
    test_tI_errors = table_test.loc[0.0,1.0] / table_test.loc['All','All']
    test_tII_errors = table_test.loc[1.0,0.0] / table_test.loc['All','All']
    
    #output accuracies
    print((
    'Training set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))


### Decision Tree

In [84]:
from sklearn.tree import DecisionTreeClassifier

##### First let's run DT with default parameters

In [85]:
#create model with default parameters
dtm_default = DecisionTreeClassifier()

#run model in our customized function
run_model(dtm_default, x_train, y_train, x_test, y_test)

Training set accuracy:
Percent Type I errors: 0.0
Percent Type II errors: 0.0

Test set accuracy:
Percent Type I errors: 0.0005266669007408447
Percent Type II errors: 0.00028088901372845055


##### This is a very good result, let's see if we can make it  better by specifying some key parameters in our DT

In [93]:
#set some important three performace parameters.
#Due to the class in balance in ths dataset one very important parameter that must be 
#set for this data is `min_samples_split`. The value should normally be 0.5-1% all observationsin a balance class case
#here I will use 0.25% 
params = {'criterion':'gini', 
         'max_depth':8,
         'max_features':int(math.sqrt(len(X.columns))),
         'min_samples_split':int(0.0025*len(X))
        }

dtm_params = DecisionTreeClassifier(**params)
run_model(dtm_params, x_train, y_train, x_test, y_test)

Training set accuracy:
Percent Type I errors: 0.0004330422976990239
Percent Type II errors: 0.000542278192614093

Test set accuracy:
Percent Type I errors: 0.0001755556335802816
Percent Type II errors: 0.0003862223938766195


##### We can see a significant improvement in `Type-I-Error` here from `0.0005617780274569011` to `0.0001755556335802816`. Although performance dropped a little in terms of `Type-II-Error` but this is not our focus, we only need to be sure the value is decent(of which what we have is more than decent)

### Now I will used BGM to iterate over several trees for further improvement of this model

## Gradient Boosting Model
Here I will retain all the tree parameters I applied above for the DT. In optimizing the GBM, my parameter of focus is the `n_estimators`. So I will choosing a relatively high `learning_rate` (which I can reduce later), choose a `subsample` value of 0.8 for each tree and then run `GridSearch` to obtain the optimum `n_estimators`.

In [99]:
from sklearn.ensemble import GradientBoostingClassifier  #GBM algorithm
from sklearn.model_selection import cross_val_score   #Additional scklearn functions
from sklearn.model_selection import GridSearchCV   #Perforing grid search
from sklearn import metrics

In [106]:
param_test = param_test2 = {'n_estimators':range(20,81,10)}
gsearch = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, 
                                     min_samples_split=int(0.0025*len(X)),
                                     max_depth=8,
                                     max_features='sqrt',
                                     subsample=0.8,
                                     random_state=10), param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(x_train,y_train)
gsearch.best_estimator_, gsearch.best_params_, gsearch.best_score_



(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=8,
               max_features='sqrt', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=712,
               min_weight_fraction_leaf=0.0, n_estimators=20,
               n_iter_no_change=None, presort='auto', random_state=10,
               subsample=0.8, tol=0.0001, validation_fraction=0.1,
               verbose=0, warm_start=False),
 {'n_estimators': 20},
 0.86727666910062)

##### As we can see, the optimum `n_estimator` is 20, the model score at this values is also quite good. The n_estimator value along with other parameters mentioned above will now be used to create our GBM model.

In [115]:
#Create GBM model
gbm = GradientBoostingClassifier(learning_rate=0.1, 
                                     min_samples_split=int(0.0025*len(X)),
                                     max_depth=8,
                                     max_features='sqrt',
                                     subsample=0.8,
                                     random_state=10,
                                     n_estimators=20)
#run model in our customized function
run_model(gbm, x_train, y_train, x_test, y_test)

Training set accuracy:
Percent Type I errors: 0.0003238064027839548
Percent Type II errors: 0.0007958615200955034

Test set accuracy:
Percent Type I errors: 0.00028088901372845055
Percent Type II errors: 0.0004915557740247885


##### This result is not impressive, infact the `type-I-Error` is worse off compared to our tuned DT above. May be we should have ran another `GridSearch` while cutting back on the staring value of our `n_estimators` with say range(5,20,5). Be that as it may, I will process with the same model while reducing our `learning_rate` to 0.01

In [114]:
#Create GBM model
gbm = GradientBoostingClassifier(learning_rate=0.01, 
                                     min_samples_split=int(0.0025*len(X)),
                                     max_depth=8,
                                     max_features='sqrt',
                                     subsample=0.8,
                                     random_state=10,
                                     n_estimators=20)
#run model in our customized function
run_model(gbm, x_train, y_train, x_test, y_test)

Training set accuracy:
Percent Type I errors: 0.0
Percent Type II errors: 0.00178678713825363

Test set accuracy:
Percent Type I errors: 0.0
Percent Type II errors: 0.0007022225343211264


##### Bingo! Our `Type-I-Error` has now reduced to zero, this is like getting more that what you bargained for. This model if deployed will have almost 100% accuracy in identify fraud.