## Credit Card Default Problem
 - Predicting Credit Card Default Payment Problem
 - Data will be based on open source customer defauly payments data in Taiwan for a period in 2005
 - Data can be sourced from UCI [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#)

### Problem specification

**<mark>Introduction</mark>**

Over the years millions of adults all over the world default on thier credit card monthly payments.
In recent times, in order to gain maximum market share, banks have over-issued credit cards to borrowers with suspect credit ratings/history. The problem is further exacerbate due to availiability of cheap credit and card holders overuse of their credit cards for consumption irrespective of their debt repayment ability. 
In this exercise an ensemble of Machine Learing (ML) models are going to be used to predict this credit default risk. There have been a number of studies in the past to develop predictive analytic models of this type. One such study can be found in the seminal paper by [Yeh, I. C., and Lien, C. H.](https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.1-lesson/assets/datasets/DefaultCreditCardClients_yeh_2009.pdf). Typical such models use financial information such as client transaction data, demographic data, repayment details etc to predict the credit card holder risk of default.

In this study, I will be using Taiwan credit card data provided by Yeh, I. C., and Lien, C. H. that can be sourced from the UCI [location](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). This data consist of 30,000 samples and is made up of 23 independent variables with the dependent variable being the measure of default payment for the next month.

---

**<mark>References</mark>**
- "Feature Enginerring Made Easy" by Sinan Ozdemir and Divya Susarla (Chapter 5)
- Jason Brownlee's ML blog located [here](https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/)







### Approach Steps

- Read Data
- Data Cleaning:
    * Check for missing data and correct with data Imputation
    * Rename columns where necessary
- Exporatory Data Analysis:
    * Descriptive Statistics
    * Check feature data types
    * Identify features which are continious (Numeric)
    * Identify features which are categorical (Discrete/Nominal)
    * Measure class imbalance
    * Measure correlation  between the features
    * Measure skewness of univariate distributions
- Feature Preprocessing:
    * Convert categorical to numeric data (using [one-hot](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) label-encoding encoding for cases when the feature has more than 2 unique values)
    * Convert categorical to numeric data (using [label](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) encoding when the categorical variable consists of only 2 unique values)
    * Normalize or re-scale the numeric features within a suitable range such as (0-1) using [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) or [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) from Scikit Learn
    * Resolve any data imbalancing issues using re-sampling (over-sampling/under-sample) techniques such as SMOTE - Synthetic Minority Oversampling TEchnique. A python library for doing this can be found [here](http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html) 
- Feature Selection and Dimensionality Reduction
    * Use the Pearson correlation metrics (including heatmap) to find the features highly correlated with the response (label)
    * Use Hypothesis testing using Scikit Learn's [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
    * Use Recursive Feature Elimination using Scikit Learn's [RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)
    * Use Principal Component Analysis [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to remove redundant data 
- Specify the classification Models:
    * K-Nearest Neighbor [KNN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
    * Naive Bayes [GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
    * Decesion Tree [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
    * Random Forest [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    * Support Vector Machines [svm](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
    * Quadratic Discriminant Analysis [QuadraticDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html#sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis)
    * Deep MLP (Muti Layer Perceptron) Neural Net [KerasClassifier](https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/)
- Evaluation of the Classification Models:
    * Stratified K-fold Cross validation [StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
    * Receiver Operating Characteristics (ROC) curves evaluation [roc_curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve)
    * Area Under Curve of ROC evaluation [auc](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc)
    * Precision and Recall Curves [precision_recall_curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html)
    * Confusion Matrix [evaluation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) and [plot](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py) 

### Define imports

In [2]:
import numpy as np
import pandas as pd
import os
import pprint
import matplotlib.pylab as pl

### Read Credit Card Default data and define global variables

In [3]:
SEED = 10
DATA_FILE_PATH_XLS = './data/default of credit card clients.xls'
DATA_FILE_PATH_CSV = './data/default_of_credit_card_clients.csv'
RAW_CREDIT_DATA_RAW = pd.read_excel(DATA_FILE_PATH_XLS, sheet = 0, skiprows= 1, header = 0)
RAW_CREDIT_DATA_RAW.to_csv(DATA_FILE_PATH_CSV)
RAW_CREDIT_DATA = pd.read_csv(DATA_FILE_PATH_CSV, index_col=0)
N_FEATURES = len(RAW_CREDIT_DATA.columns) - 1
N_SAMPLES = RAW_CREDIT_DATA.shape[0]
DEPENDENT_VARIABLE = 'default payment next month'
INDEPENDENT_VARIABLES = list(set(RAW_CREDIT_DATA.columns) - set([DEPENDENT_VARIABLE]))
N_RESPONSES = 1
SPLIT_FRACTION = 0.30
CV = 5

### Exploratory Data Analysis

In [4]:
pp = pprint.PrettyPrinter(indent=4)
def basicExploratoryDataFacts():
    """
    """
    print("Number of (independent) features: {}".format(N_FEATURES))
    print("Number of samples/observations: {}".format(N_SAMPLES))
    print("Label/Dependent attribute: '{}'".format(DEPENDENT_VARIABLE))
    print("Features are:")
    pp.pprint(INDEPENDENT_VARIABLES)

def getIndependentAndDependentVariables():
    """
    Extraxt the Indpendent/depend variables from the problem data
    """
    # Create our feature matrix
    X = RAW_CREDIT_DATA.drop(DEPENDENT_VARIABLE, axis=1)
    # create our response variable
    y = RAW_CREDIT_DATA[DEPENDENT_VARIABLE]
    return (X, y)

In [6]:
basicExploratoryDataFacts()
X, y = getIndependentAndDependentVariables()

Number of (independent) features: 24
Number of samples/observations: 30000
Label/Dependent attribute: 'default payment next month'
Features are:
[   'AGE',
    'SEX',
    'PAY_6',
    'PAY_4',
    'PAY_5',
    'PAY_2',
    'PAY_3',
    'PAY_0',
    'BILL_AMT5',
    'BILL_AMT4',
    'BILL_AMT6',
    'LIMIT_BAL',
    'BILL_AMT3',
    'BILL_AMT2',
    'ID',
    'PAY_AMT6',
    'PAY_AMT5',
    'PAY_AMT4',
    'PAY_AMT3',
    'PAY_AMT2',
    'PAY_AMT1',
    'BILL_AMT1',
    'MARRIAGE',
    'EDUCATION']


### Classification Models
- Define the imports for the models
- Define a MLP Neural Net classifier model
- Define other classifier models
- Define the evaluation of the models

#### Imports for the Classifier models

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import (train_test_split, StratifiedKFold, cross_val_score, GridSearchCV)
from sklearn.preprocessing import (LabelEncoder, StandardScaler, MinMaxScaler)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, accuracy_score)
import keras.models as km
import keras.layers as kl
from keras.regularizers import l2
from keras.wrappers.scikit_learn import KerasClassifier
from tensorflow import set_random_seed
%matplotlib inline

  from numpy.core.umath_tests import inner1d
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### Create MLP Model

In [8]:
def createMlpModel(
    num_inputs=N_FEATURES,
    num_ouputs=N_RESPONSES,
    input_nodes = 10, 
    hidden_act='relu', 
    output_act='sigmoid',
    hidden_nodes = 10,
    optimizer='rmsprop', 
    init='glorot_uniform'):
    '''
    Create a MLP model using Keras
    '''
    model = km.Sequential()
    model.add(kl.Dense(input_nodes, input_dim=num_inputs, activation=hidden_act,kernel_initializer=init))
    model.add(kl.Dense(hidden_nodes, activation=hidden_act,kernel_initializer=init))
    model.add(kl.Dense(1, activation=output_act,kernel_initializer=init))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

#### Define other Classifier models

In [9]:
from collections import OrderedDict
def createMlpParams():
    """
    Creates a dictionary of MLP params
    """
    optimizers = ['rmsprop', 'adam']
    init = ['glorot_uniform', 'normal', 'uniform']
    epochs = [50, 100, 150]
    batches = [5, 10, 20]
    param_dict = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
    return param_dict

def createModelParams():
    """
    Creates a dictionary of Model Parameters
    """
    model_params = {}
    # Logistic Regression
    lr_params = {'C':[1e-1, 1e0, 1e1, 1e2], 'penalty':['l1', 'l2']}
    model_params['Logistic'] = lr_params

    # Random Forest
    forest_params = {'n_estimators': [10, 50, 100], 'max_depth': [None, 1, 3, 5, 7]}
    model_params['RFC'] = forest_params
    
    # SVM
    svm_params = {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf', 'linear']}
    model_params['SVM'] = svm_params
                  
    # QDA
    qda_params = {'reg_param':[0, 1e-1, 1e-2, 1e0, 1e1], 'tol':[1e-4, 1e-5]}
    model_params['QDA'] = qda_params
    
    # KNN
    knn_params = {'n_neighbors': [1, 3, 5, 7]}
    knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.iteritems()}
    model_params['KNN'] = knn_pipe_params
    
    # MLP
    mlp_params = createMlpParams()
    mlp_pipe_params = {'classifier__{}'.format(k): v for k, v in mlp_params.iteritems()}
    model_params['MLP'] = mlp_pipe_params
    return model_params
        
    
def createAllModels():
    """
    Creates a dictionary of classification models
    """
    clf1 = LogisticRegression(random_state=0)
    clf2 = RandomForestClassifier(random_state=0)
    clf3 = SVC(random_state=0, probability=True)
    clf4 = QuadraticDiscriminantAnalysis()
    knn = KNeighborsClassifier()
    clf5 = Pipeline([('scale', StandardScaler()), ('classifier', knn)])    
    mlp = KerasClassifier(build_fn=createMlpModel, epochs=150, batch_size=10, verbose=0)
    clf6 = knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', mlp)])
    classifiers = [('Logistic', clf1), ('RFC', clf2), ('SVM', clf3), ('QDA', clf4), ('KNN', clf5), ('MLP', clf6)]    
    models = OrderedDict(classifiers)
    return models

def createAllModelsAndParams(models, params):
    """
    Returns a dictionary of all the models/params
    """
    models_and_params = {key:(params[key], models[key]) for key in models}
    return models_and_params

# def countNumberOfCalculations(params):
#     """
#     Returns/computes the number of calculations for all the grid searched models/params
#     """
#     count = 0
#     for key in params:
#         model_param = params[key]
#         for key2 in model_param:
#             model_param_list = model_param[key2]
#             count += len(model_param_list)
            
        
    return models_and_params

def createEnsembleVotingModel(optimal_models):
    """
    """
    eclf = VotingClassifier(estimators=optimal_models.values, weights=[2, 1, 1, 1, 1, 2], voting='soft')
    return eclf
    

def evaluateBestModels(models_and_params, X, y, cv=CV):
    """
    Evaluates the model using a search grid approach
    """
    best_models = {}
    model_names = []
    model_accuracy = []
    model_best_params = []
    model_avg_fit_time = []
    model_avg_score_time = []
    total_count = len(list(models_and_params.keys()))
    curr_count = 0.0
    for model_name, model_and_param in models_and_params.iteritems():
        curr_count += 1
        per_progress = 100.0*(float(curr_count)/float(total_count))
        params = model_and_param[0]
        model = model_and_param[1]         
        grid = GridSearchCV(model, # the model to grid search
                        params, # the parameter set to try 
                        error_score=0., n_jobs = -1) # if a parameter set raises an error, continue and set the performance as a big, fat 0
        grid.fit(X, y) # fit the model and parameters
        best_models[model_name] = grid
        model_accuracy.append("{}".format(grid.best_score_))
        model_best_params.append("{}".format(grid.best_params_))
        model_avg_fit_time.append("{}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
        model_avg_score_time.append("{}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
        print("Current processing model: {0} and the progress is: {1} ".format(model_name,per_progress))
    results = {
        'Model Names': model_names,
        'Best Accuracy':model_accuracy,
        'Best Params': model_best_params,
        'Avg Fit Time':model_avg_fit_time,
        'Avg Score Time':model_avg_score_time
    }
    results_table = pd.DataFrame(results)
    return best_models, results_table
    

In [None]:
model_params = createModelParams()
models = createAllModels()
models_and_params = createAllModelsAndParams(models, model_params)
best_models, results_table = evaluateBestModels(models_and_params, X, y) 

Current processing model: KNN and the progress is: 16.6666666667 


In [16]:
l = [(1,2), (1,4)]
l.append((1,3))
l1, l2 = zip(*l)
print("l1 = {0}\nl2 = {1}".format(l1, l2))

l1 = (1, 1, 1)
l2 = (2, 4, 3)


In [1]:
x = {'A': range(10), 'B':range(10,100,10)}
for key, value in x.iteritems():
    print(key, value)

('A', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
('B', [10, 20, 30, 40, 50, 60, 70, 80, 90])
