# The Team

- Daniel Davieau
- Nathan Wall
- Olga Tanyuk
- Paul Panek

# Data Preparation Part 1

In [8]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_csv('data/CreditcardDefaults.csv')
df.rename(columns={'default payment next month':'default'}, inplace=True)

#set index to the "ID" value and remove the ID column
df.index = df.ID
del df['ID']

#Create Lists for Analysis
continuous_features = ['LIMIT_BAL', 'BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
                       'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
                       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5',
                       'PAY_AMT6']
ordinal_features = ['EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3',
                    'PAY_4', 'PAY_5', 'PAY_6','default']

#Convert datatypes
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)

#convert any non-identified education categories to 'OTHER'
df['EDUCATION'] = df['EDUCATION'].replace(to_replace=(0,5,6),value=4)

#convert any non-identified marriage categories to 'OTHER'
df['MARRIAGE'] = df['MARRIAGE'].replace(to_replace=(0),value=3)

#Log transform continuous variables; as they each have a mostly 
##exponential distribution
df["log_LIMIT_BAL"]=np.log(df.LIMIT_BAL)
df["log_PAY_AMT1"]=np.log(df.PAY_AMT1+1)
df["log_PAY_AMT2"]=np.log(df.PAY_AMT2+1)
df["log_PAY_AMT3"]=np.log(df.PAY_AMT3+1)
df["log_PAY_AMT4"]=np.log(df.PAY_AMT4+1)
df["log_PAY_AMT5"]=np.log(df.PAY_AMT5+1)
df["log_PAY_AMT6"]=np.log(df.PAY_AMT6+1)

#Create a separate dataset with only useful variables as identified in Lab1 and Mini-lab1.
df = df[['SEX','EDUCATION','MARRIAGE','AGE', 'default'
            ,'PAY_0','PAY_2', 'PAY_3', 'PAY_4', 'PAY_5','PAY_6', "log_LIMIT_BAL"
            ,"log_PAY_AMT1","log_PAY_AMT2","log_PAY_AMT3","log_PAY_AMT4","log_PAY_AMT5"
            ,"log_PAY_AMT6"]]

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 18 columns):
SEX              30000 non-null int64
EDUCATION        30000 non-null int64
MARRIAGE         30000 non-null int64
AGE              30000 non-null int64
default          30000 non-null int64
PAY_0            30000 non-null int64
PAY_2            30000 non-null int64
PAY_3            30000 non-null int64
PAY_4            30000 non-null int64
PAY_5            30000 non-null int64
PAY_6            30000 non-null int64
log_LIMIT_BAL    30000 non-null float64
log_PAY_AMT1     30000 non-null float64
log_PAY_AMT2     30000 non-null float64
log_PAY_AMT3     30000 non-null float64
log_PAY_AMT4     30000 non-null float64
log_PAY_AMT5     30000 non-null float64
log_PAY_AMT6     30000 non-null float64
dtypes: float64(7), int64(11)
memory usage: 4.3 MB


In [9]:
# One-hot encoding of "EDUCATION" and "MARRIAGE".
tmp_df_1 = pd.get_dummies(df.EDUCATION,prefix='EDUCATION')
tmp_df_2 = pd.get_dummies(df.MARRIAGE,prefix='MARRIAGE')
dfsub1 = pd.concat((df,tmp_df_1,tmp_df_2),axis=1)
#Drop variables for which we used one-hot encoding
del dfsub1['EDUCATION']
del dfsub1['MARRIAGE']


split = np.random.rand(len(dfsub1)) < 0.8

df_train = dfsub1[split]
df_test = dfsub1[~split]

from sklearn.preprocessing import StandardScaler
# fit training for scaling after upsampling
X_train = df_train.drop(columns=['default']).values     
scl_obj = StandardScaler()
scl_obj.fit(X_train)

print("Dimensions of training data " , df_train.shape)
print("Dimensions of test are " , df_test.shape)

Dimensions of training data  (24107, 23)
Dimensions of test are  (5893, 23)


In [10]:
dfsub1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 23 columns):
SEX              30000 non-null int64
AGE              30000 non-null int64
default          30000 non-null int64
PAY_0            30000 non-null int64
PAY_2            30000 non-null int64
PAY_3            30000 non-null int64
PAY_4            30000 non-null int64
PAY_5            30000 non-null int64
PAY_6            30000 non-null int64
log_LIMIT_BAL    30000 non-null float64
log_PAY_AMT1     30000 non-null float64
log_PAY_AMT2     30000 non-null float64
log_PAY_AMT3     30000 non-null float64
log_PAY_AMT4     30000 non-null float64
log_PAY_AMT5     30000 non-null float64
log_PAY_AMT6     30000 non-null float64
EDUCATION_1      30000 non-null uint8
EDUCATION_2      30000 non-null uint8
EDUCATION_3      30000 non-null uint8
EDUCATION_4      30000 non-null uint8
MARRIAGE_1       30000 non-null uint8
MARRIAGE_2       30000 non-null uint8
MARRIAGE_3       30000 non-null uint8
dty

# Data Preparation Part 2


###  Field Definitions:

- The items in the final data set are shown below.
- Monetary amounts are in New Taiwanese dollars ($)


- SEX
     -  1 = Male
     -  2 = Female
        

- AGE = Credit Card holder age in years


- default:   Whether or not the customer defaulted (Target for predicting default)

    
- Payment history (2005)
     - PAY_0 = September
     - PAY_2 = August
     - PAY_3 = July
     - PAY_4 = June
     - PAY_5 = May
     - PAY_6 = April
          -  -1 = payment received on time
          -   1 = payment received one month late
          -   2 = payment received two months late
          -   "......"
          -   9 = payment received nine months late or more
         

- log_LIMIT_BAL:  Natural log of the Credit Limit (max amount of credit allowed) 

    
- log_PAY_AMT:  Natural log of the amount paid by month + $1
     - log_PAY_AMT1 = September
     - log_PAY_AMT2 = August
     - log_PAY_AMT3 = July
     - log_PAY_AMT4 = June
     - log_PAY_AMT5 = May
     - log_PAY_AMT6 = April

    
 - EDUCATION (One-Hot Encoded when Predicting Default.  Used in the form below when used as target.)
     - EDUCATION_1 = Graduate School
     - EDUCATION_2 = University
     - EDUCATION_3 = High School
     - EDUCATION_4 = Other

        
 - EDUCATION (This variable appears only as the target only when predicting Education.  Not one-hot encoded.)
     - 1 = Graduate School
     - 2 = University
     - 3 = High School
     - 4 = Other

        
 - MARRIAGE (One-Hot Encoded)
     - MARRIAGE_1 = Married
     - MARRIAGE_2 = Single
     - MARRIAGE_3 = Other
    
    
###  OverSampling:

Due to class imbalance in the Default variable, we oversampled the Default variable for Vaulue = 1 to create a balnced Traing set.
The Test set was not re-sampled.

### Scaling:

Explanatory variables have been standard-scaled in Test and Train sets for both Modeling Objectives.  (Default and Education)

### Excluded Fields:

Data elements related to Amounts Billed by month were included in the original data set, but excluded here.  This is because the Exploratory analysis from earlier work indicated high correlation with Payment Amount, yet the Amount Billed had less explanatory value.


In [12]:
# perform one-hot encoding of the categorical data "EDUCATION" and "MARRIAGE".
tmp_df_2 = pd.get_dummies(df.MARRIAGE,prefix='MARRIAGE')
dfsub2 = pd.concat((df,tmp_df_2),axis=1)
del dfsub2['MARRIAGE']

from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
if 'default' in dfsub2:
    y = dfsub2['EDUCATION'].values # get the labels we want
    del dfsub2['EDUCATION'] # get rid of the class label
    X = dfsub2.values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 5
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
dfsub2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 19 columns):
SEX              30000 non-null int64
AGE              30000 non-null int64
default          30000 non-null int64
PAY_0            30000 non-null int64
PAY_2            30000 non-null int64
PAY_3            30000 non-null int64
PAY_4            30000 non-null int64
PAY_5            30000 non-null int64
PAY_6            30000 non-null int64
log_LIMIT_BAL    30000 non-null float64
log_PAY_AMT1     30000 non-null float64
log_PAY_AMT2     30000 non-null float64
log_PAY_AMT3     30000 non-null float64
log_PAY_AMT4     30000 non-null float64
log_PAY_AMT5     30000 non-null float64
log_PAY_AMT6     30000 non-null float64
MARRIAGE_1       30000 non-null uint8
MARRIAGE_2       30000 non-null uint8
MARRIAGE_3       30000 non-null uint8
dtypes: float64(7), int64(9), uint8(3)
memory usage: 4.0 MB


# Modeling and Evaluation 1

The choice of evaluation metric should consider the purpose for which the projects are undertaken.  Our objective with this data set is to predict two variables:

    1) Whether a given customer defaults or not, and 
    2) The level of Education for a given customer.
    
- Predicting whether a customer will default.

A cleint may use this model to determine which customer-facing strategies to deploy for a given customer.  If a customer is expected to default, there should be a more intesive collection strategy, and management of credit lines should be more conservative.  While such strategies are designed to limit losses from defaults, they may also lead to sub-optimal customer experiences for those who do not default.  Customers who are treated as likely to default are expected to attrit if they do not default.

Precision is therefore important.  Low precision would lead to higher customer attrition.  Recall is also important.  Low Recall represents missed opportunity to apply loss-mitigation strategies to customer who need them, and will result in high default costs.

Since balancing these two factors is important, we will choose F1 as the evaluation metric for this suite of models. 

- Predicting Education

One use for this suite of models is to apply marketing strategies for other products to these demographic groups if the customer does not provide the information.

The evaluation metric should again consider a balance between Recall and Precision.  Low Recall again represents a missed opportunity to apply a more targeted strategy, and low Precision would result in the wrong strategy applied to a given prospect.  
Applying the wrong strategy can be expected to decrease the credibility of the business in the customer's view, and may make the customer less likely to respond to other offers.

Since F1 considers both, we will choose it as the evaluation metric for this as well.








In [None]:
# Modeling and Evaluation 2

In [None]:
# Modeling and Evaluation 3

In [None]:
# Modeling and Evaluation 4

In [None]:
# Modeling and Evaluation 5

In [None]:
# Modeling and Evaluation 6

In [None]:
# Modeling and Evaluation 1

In [None]:
# Deployment

In [None]:
# Exceptional Work