# In-depth Analysis (Applying Machine Learning)

## Step 1) Read the Manual

Before we progress further, we display the information about the dataset that we obtained from the dataset manual, that is, from Kaggle and the UCI Machine Learning Repository.

From Kaggle, an overview of the variables:

There are 25 variables:

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar) default.payment.next.month: Default payment (1=yes, 0=no)

And from UCI:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
* X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
* X2: Gender (1 = male; 2 = female). 
* X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
* X4: Marital status (1 = married; 2 = single; 3 = others). 
* X5: Age (year). 
* X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
* X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
* X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

Potential issue: We'll want to group values 5 and 6 for Education into one value (looking at the Kaggle description) since they both stand for "unknown". And perhaps we'll want to include 4 in that grouping since it has the value of "others".

## Step 2) Review the Data Types

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import random
import sklearn

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

In [2]:
filename = 'UCI_Credit_Card.csv'

In [3]:
data = pd.read_csv(filename, index_col=0)

In [4]:
pd.set_option('display.max_columns', 500)
data.sample(5).transpose()

ID,10636,6326,27138,16412,4569
LIMIT_BAL,90000.0,50000.0,490000.0,200000.0,30000.0
SEX,2.0,1.0,1.0,2.0,1.0
EDUCATION,2.0,3.0,3.0,2.0,2.0
MARRIAGE,2.0,1.0,1.0,2.0,2.0
AGE,24.0,51.0,45.0,38.0,23.0
PAY_0,0.0,-1.0,0.0,0.0,0.0
PAY_2,0.0,-1.0,0.0,0.0,0.0
PAY_3,0.0,0.0,2.0,0.0,0.0
PAY_4,0.0,0.0,0.0,0.0,0.0
PAY_5,-2.0,0.0,0.0,0.0,0.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
LIMIT_BAL                     30000 non-null float64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null float64
BILL_AMT2                     30000 non-null float64
BILL_AMT3                     30000 non-null float64
BILL_AMT4                     30000 non-null float64
BILL_AMT5                     30000 non-null float64
BILL_AMT6                     30000 non-null float64
PAY_AMT1  

All columns in this dataset have a numeric type. They are either float-valued (continuous) or int-valued (discrete). Nothing seems to be off, so we may continue.

In [6]:
display(data.shape)

(30000, 24)

## Step 3) Fixing the Issues (Data Cleaning):

#### Problem 1: Get rid of Bad Column Names


In [7]:
## Rename columns
data.rename(columns={'PAY_0': 'PAY_1', 'default.payment.next.month': 'default'}, inplace=True)

#### Problem 2: Replace Negative Values with 0 in Pay_X columns

To deal with with values for the PAY_X columns, a sensible solution is to convert all non-positive values to 0. The dataset description says that a value of -1 means "pay duly" and positive values represent a payment delay by that number of months. Therefore, converting -1 and -2 values to 0, and having 0 represent "pay duly" is logical.

In [8]:
for i in range(1,7):
    data.loc[data["PAY_" + str(i)] < 0, "PAY_" + str(i)] = 0

#### Problem 3: Get rid of Values of 0 for Marriage

A logical move is to group the 0 values with the "Other" values, coded as 3, so that is what we'll do:


In [9]:
data.loc[data["MARRIAGE"] == 0, 'MARRIAGE'] = 3

"Other" for marriage can possibly refer to divorced, widowed, seperated, etc.

#### Problem 4: Get rid of 0 Values for Education

Currently coded as:
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

We see that 0 is not even in the dataset desciption, and we have 2 values for unknown. So a logical move is to convert the 0, 5 and 6 values to 4, which is what we'll do. "Other" can  refer to education less than high school or perhaps vocational training.

In [10]:
replace = (data["EDUCATION"] == 0) | (data["EDUCATION"] == 5) | (data["EDUCATION"] == 6) 
data.loc[replace,'EDUCATION'] = 4

## Step 4) Analysis

### Supervised Learning

#### Preprocessing

Now need to deal with SEX, EDUCATION, and MARRIAGE appropriately.

In [11]:
replace_map = {'SEX': {1:"Male", 2:"Female"}, 'EDUCATION': {1: "Grad School", 2: "University", 3:"High School", 4:"Other"}, 'MARRIAGE': {1:"Married", 2:"Single", 3:"Other"}}
data.replace(replace_map, inplace=True)

In [12]:
# # Changed

# data['default'] = data['default'].astype('category') 
# #Convert default variable from int64 to categorical variable

# data = pd.get_dummies(data, columns=['SEX', 'EDUCATION', 'MARRIAGE'], prefix=['SEX', 'EDUCATION', 'MARRIAGE'])

# # Not import because will just use X and y
# col_at_end = ['default']
# data = data[[column for column in data if column not in col_at_end] + [column for column in col_at_end if column in data]]
# ## Put default column at the end of the dataframe

# # Don't use Pandas categoricals
# data['PAY_1'] = data.PAY_1.astype('category')
# data['PAY_2'] = data.PAY_2.astype('category')
# data['PAY_3'] = data.PAY_3.astype('category')
# data['PAY_4'] = data.PAY_4.astype('category')
# data['PAY_5'] = data.PAY_5.astype('category')
# data['PAY_6'] = data.PAY_6.astype('category')


In [None]:
data = pd.get_dummies(data, columns=['SEX', 'EDUCATION', 'MARRIAGE'], prefix=['SEX', 'EDUCATION', 'MARRIAGE'], drop_first=True)
# switch this to sklearn OneHotEncoder later

In [13]:
sklearn.__version__

'0.20.2'

Let's examine the PAY_X columns now:

In [None]:
df = data[['PAY_6', 'PAY_5', 'BILL_AMT6', 'PAY_AMT6']]
df.columns = ['Repayment status in April', 'Repayment status in May', 'Amount of bill statement in April', 'Amount of previous payment in April']

In [None]:
df.loc[(df['Amount of bill statement in April'] < df['Amount of previous payment in April']) & (df['Repayment status in April'] < df['Repayment status in May'])]

Represent instances where our repayment status is **worse** and we've paid more than our bill.

Should I convert the PAY_X variables to 0/1? Maybe see what kind of accuracy we get with each, then decide.

##### KNN

**Supervised Learning with Scikit Learn - Classification**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)

In [None]:
y = data['default'].values
X = data.drop('default', axis=1).values

Ask question about this on SB or SO

In [None]:
X.shape

In [None]:
knn.fit(X, y)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

In [None]:
knn = KNeighborsClassifier(n_neighbors=8)

In [None]:
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
print("Test set predictions: \n {}".format(y_pred))

In [None]:
knn.score(X_test, y_test)

In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1,21)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))


In [None]:
# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)
    
    #Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    # Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    
    # Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate Plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()

What stands out is that accuracy increases for an even-number of neighbors for testing accuracy while decreasing for training accuracy! Wait, how can you even have an aeven number for KNN. Not sure why that's happening.

Also, our accuracy is not that high, at least, it never exceeds 80%. If we just predicted no default every time, we would get an accuracy of about 78%, so this is dissapointing.

In [None]:
1  - data['default'].value_counts()[1]/len(data) ## percentage of records that are 'no default' records

Pretty sure I need to normalize the features to prevent the money features from overwhelming the categorical features in KNN.

**Supervised Learning with Scikit Learn - Regression**

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
knn = KNeighborsClassifier(n_neighbors=12)

In [None]:
cv_results = cross_val_score(knn, X_train, y_train, cv=5)

In [None]:
cv_results

**Supervised Learning with Scikit Learn - Fine Tuning Your Model**

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

##### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.fit(X_train, y_train)

In [None]:
y_pred = logreg.predict(X_test)

In [None]:
from sklearn.metrics import roc_curve

In [None]:
y_pred_prob = logreg.predict_proba(X_test)[:,1]
fpr, tpr, thresholds  = roc_curve(y_test, y_pred_prob)

In [None]:
plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred_prob)

In [None]:
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

In [None]:
cv_scores

*Hyperparameter Tuning*

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'n_neighbors': np.arange(1,10)}

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn_cv = GridSearchCV(knn, param_grid, cv=5)

In [None]:
knn_cv.fit(X,y)

In [None]:
knn_cv.best_params_

In [None]:
knn_cv.best_score_

In [None]:
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [None]:
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

In [None]:
# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

In [None]:
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

In [None]:
# Fit it to the data
logreg_cv.fit(X,y)

In [None]:
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))

##### Decision Trees

In [None]:
# Import necessary modules 
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3,None],
             "max_features": randint(1,9),
             "min_samples_leaf": randint(1,9),
             "criterion": ["gini", "entropy"]}

In [None]:
# Instantiate a Decision Tree Classifier: tree
tree = DecisionTreeClassifier()

In [None]:
# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

In [None]:
# Fit it to the data
tree_cv.fit(X,y)

In [None]:
# Print the tuned parameters and score
print("Tuned Decision Tree Parameteres: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

In [None]:
X.shape

*Hold-out set for final evaluation*

**Supervised Learning with Scikit Learn - Preprocessing and Pipelines**

*Preprocessing data*

In [None]:
data2 = pd.read_csv(filename, index_col=0)
## Rename columns
data2.rename(columns={'PAY_0': 'PAY_1', 'default.payment.next.month': 'default'}, inplace=True)
for i in range(1,7):
    data2.loc[data2["PAY_" + str(i)] < 0, "PAY_" + str(i)] = 0
data2.loc[data2["MARRIAGE"] == 0, 'MARRIAGE'] = 3
replace = (data2["EDUCATION"] == 0) | (data2["EDUCATION"] == 5) | (data2["EDUCATION"] == 6) 
data2.loc[replace,'EDUCATION'] = 4

In [None]:
replace_map = {'SEX': {1:"Male", 2:"Female"}, 'EDUCATION': {1: "Grad School", 2: "University", 3:"High School", 4:"Other"}, 'MARRIAGE': {1:"Married", 2:"Single", 3:"Other"}}
data2.replace(replace_map, inplace=True)

In [None]:
df2 = pd.get_dummies(data2, drop_first=True) #Use sklearn instead

In [None]:
df2.shape

In [None]:
df2.head().shape

In [None]:
df2.info()

In [None]:
df2['PAY_1'] = data.PAY_1.astype('category')
df2['PAY_2'] = data.PAY_2.astype('category')
df2['PAY_3'] = data.PAY_3.astype('category')
df2['PAY_4'] = data.PAY_4.astype('category')
df2['PAY_5'] = data.PAY_5.astype('category')
df2['PAY_6'] = data.PAY_6.astype('category')

In [None]:
col_at_end = ['default']
df2 = df2[[column for column in df2 if column not in col_at_end] + [column for column in col_at_end if column in df2]]
## Put default column at the end of the dataframe

In [None]:
df2.boxplot('AGE', )

In [None]:
df2.info()

In [None]:
df2.describe().T

*Handling Missing Data*

Use One-Hot-Encoder in sklearn for SEX, EDUCATION, and MARRIAGE. Use LabelEncoder in sklearn for PAY_X (if need to do something)

*Centering and Scaling*

In [None]:
from sklearn.preprocessing import scale

In [None]:
y = data['default'].values
X = data.drop('default', axis=1).values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

In [None]:
data.shape

Hmm, at this point, should I scale the PAY_X columns as well? Might be best to just binarize them. Nope. Gives lower accuracy.

In [None]:
scaled_features = data.copy()

In [None]:
col_names = ['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 
             'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 
            'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

In [None]:
features = scaled_features[col_names]

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit(features.values)

In [None]:
features = scaler.transform(features.values)

In [None]:
scaled_features[col_names] = features

In [None]:
scaled_features.head()

In [None]:
y = scaled_features['default'].values
X = scaled_features.drop('default', axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn_scaled = knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
knn_scaled.score(X_test, y_test)

In [None]:
parameters = {'n_neighbors': np.arange(1,50)}

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

In [None]:
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(knn, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

In [None]:
cv.best_params_

In [None]:
cv.score(X_test, y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Focus on getting it right, so don't use Pipeline now

In [None]:
# from sklearn.preprocessing import StandardScaler
# from sklearn.pipeline import Pipeline
# steps = [('scaler', StandardScaler()),
#         ('knn', KNeighborsClassifier())]
# pipeline = Pipeline(steps)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

In [None]:
knn_scaled = pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
knn_scaled.score()

In [None]:
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

In [None]:
knn_unscaled.score(X_test, y_test)