# Challenge

This file created to answer the given datasets problem, to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
I trying to approach this problem with Linear Models (Generalized Linear Models) and Ensemble Gradient Boosting.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import ensemble, linear_model
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

First of all we have to get the datasets into Pandas data frame

In [2]:
data_test = pd.read_csv('.\DataSets\cs-test.csv')
# because the data is containing id of staff we should not including it in
# data we want to process, we can remove it because the data already sorted too
data_test = data_test.drop(data_test.columns[[0]], 1)

Because some data is missing i have to handle it. I handle it by fillna method to fill the missing data with zero value.

Reference :

https://youtu.be/O5v4NrSCw_A

In [3]:
data_test.fillna(value = 0, inplace = True)

In [4]:
data_expected_answer = pd.read_csv('.\DataSets\sampleEntry.csv')
# because the data is containing id of staff we should not including it in
# data we want to process, we can remove it because the data already sorted too
data_expected_answer = data_expected_answer.drop(data_expected_answer.columns[[0]], 1)

# My own assumptions
Because we don't get the data training expected answer in the given data set so, we can use the data test for training by spliting it into 2 parts, the training data and testing data.

In [5]:
#training parts is all the data except the last 100 data
X_train_prt = data_test[:-100] 
Y_train_prt = data_expected_answer[:-100]

In [6]:
#testing parts is all the last 100 data
X_test_prt = data_test[-100:]
Y_test_prt = data_expected_answer[-100:]

In [7]:
print(X_train_prt.shape,
Y_train_prt.shape,
X_test_prt.shape,
Y_test_prt.shape)

(101403, 11) (101403, 1) (100, 11) (100, 1)


# The First Approach
I use the Generalized Linear Models as Ordinary least squares to approach this problem, why i use it because i think this problem have some linear correlations so i decided to test it use this approach.

In [8]:
reg = linear_model.LinearRegression()

In [9]:
reg.fit(X_train_prt, Y_train_prt)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The coeficient result is listed below, because the model is linear regression so we can see the coeficient for each column

In [10]:
reg.coef_

array([[  0.00000000e+00,   1.22611370e-05,  -1.49867544e-03,
          4.75828908e-02,   1.05031111e-06,  -1.48703508e-08,
         -7.88632213e-04,   4.72314331e-02,  -4.18733966e-04,
         -8.92942362e-02,   2.51099618e-03]])

In [11]:
predict_test = reg.predict(X_test_prt)

From the MSE result below we conclude that the models is fit well with the coeficient result as mention before

In [12]:
MSE1 = mean_squared_error(Y_test_prt, predict_test)
print(MSE1)

0.00945691072358


# Surprise
Surprisingly i found some values is really distorted, so i think this model is not good enough, like the values below, the prediction probability is minus 0.0294.. but the actual probability value is 0.12199.. So we have to try another way

In [13]:
print(Y_test_prt.iloc[-2], predict_test[-2])

Probability    0.121994
Name: 101501, dtype: float64 [-0.02948929]


# Second Approach
So in this section i use the Ensemble methods, gradient tree boosting. The approach is to improve generalizability, because the Ensemble methods is combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability.

In [14]:
clf =  ensemble.GradientBoostingRegressor(n_estimators = 500,
                                          max_depth = 4,
                                          learning_rate = 0.1 , loss = 'ls')

In [15]:
Y = Y_train_prt.values.ravel()

In [16]:
clf.fit(X_train_prt, Y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=4, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=500, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

In [17]:
a = clf.predict(X_test_prt)

In [18]:
MSE2 = mean_squared_error(Y_test_prt, a)
print(MSE2)

0.000151599804143


Using the Ensemble Gradient Boost Regressor with 500 n estimators and the learning rate = 0.1 is a quite accurate, and i use the max_depth = 4 to control the tree size.
The ensemble support least squares method that seeking a line of best fit that explains the potential relationship. So the models will looking the best fit line by iteration

# Summary
So we can use the linier models regression to get the coeficient of every column(represent the relationship between coloumn with the prediction model) but the error is not handled well use this models. So we use the Gradient tree bosting to handle the error by least squares method that would seeking a line of best fit that explains the potential relationship.
In the case above, i approach use the n_estimators = 500, max_depth = 4, and learning_rate = 0.1, loss = 'ls'. Actually you can explore more about, feel free to search more about the Ensemble methods especially Gradient Tree Boosting.

References from:

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

http://www.saedsayad.com/docs/gbm2.pdf