# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [131]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [132]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

In [133]:
# check for missing values
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [134]:
# drop rows with any missing values
data.dropna().shape

(108648, 12)

In [135]:
# drop rows where Age is missing
data[data.age.notnull()].shape

(108648, 12)

In [136]:
# mean Age
data.age.mean()

51.36130439584714

In [137]:
# median Age
data.age.median()

51.0

In [138]:
data.loc[data.age.isnull()]

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
55,55,0,0.004264,,0.0,0.111444,2000.0,6.0,1.0,0.0,0.0,
60,60,0,0.234218,,0.0,0.116614,8600.0,19.0,0.0,0.0,0.0,
77,77,0,0.363200,,0.0,0.480524,2900.0,4.0,0.0,1.0,0.0,
117,117,0,0.000000,,2.0,0.370876,3000.0,14.0,0.0,1.0,0.0,
126,126,0,0.000000,,1.0,0.726567,3477.0,5.0,0.0,1.0,0.0,
138,138,0,0.000000,,0.0,0.907539,2400.0,6.0,0.0,1.0,0.0,
155,155,0,0.078739,,0.0,0.166215,4800.0,6.0,0.0,1.0,0.0,
162,162,0,1.000000,,1.0,0.358101,1937.0,4.0,1.0,0.0,0.0,
163,163,0,0.013501,,0.0,0.183464,19000.0,8.0,0.0,1.0,0.0,
193,193,0,0.012151,,0.0,0.008887,4500.0,12.0,0.0,0.0,0.0,


In [139]:
# fill missing values for Age with the median age
data.age.fillna(data.age.median(), inplace=True)

In [111]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0
5,5,0,0.213179,74.0,0.0,0.375607,3500.0,3.0,0.0,1.0,0.0,1.0
6,6,0,0.754464,39.0,0.0,0.20994,3500.0,8.0,0.0,0.0,0.0,0.0
7,7,0,0.189169,57.0,0.0,0.606291,23684.0,9.0,0.0,4.0,0.0,2.0
8,8,0,0.644226,30.0,0.0,0.309476,2500.0,5.0,0.0,0.0,0.0,0.0
9,9,0,0.018798,51.0,0.0,0.531529,6501.0,7.0,0.0,2.0,0.0,2.0


In [140]:
data.age

0         45.0
1         40.0
2         38.0
3         30.0
4         49.0
5         74.0
6         39.0
7         57.0
8         30.0
9         51.0
10        46.0
11        40.0
12        64.0
13        53.0
14        43.0
15        25.0
16        43.0
17        38.0
18        39.0
19        32.0
20        58.0
21        58.0
22        69.0
23        24.0
24        58.0
25        28.0
26        24.0
27        57.0
28        42.0
29        64.0
          ... 
112885    31.0
112886    48.0
112887    63.0
112888    57.0
112889    55.0
112890    43.0
112891    58.0
112892    83.0
112893    51.0
112894    44.0
112895    61.0
112896    52.0
112897    55.0
112898    64.0
112899    43.0
112900    37.0
112901    82.0
112902    26.0
112903    49.0
112904    28.0
112905    31.0
112906    62.0
112907    46.0
112908    59.0
112909    22.0
112910    50.0
112911    74.0
112912    44.0
112913    30.0
112914    64.0
Name: age, Length: 112915, dtype: float64

In [141]:
# drop rows where NumberOfDependents is missing
data[data.NumberOfDependents.notnull()].shape

(108648, 12)

In [142]:
# mean NumberOfDependents
data.NumberOfDependents.mean()

0.8565735218319711

In [143]:
# median NumberOfDependents
data.NumberOfDependents.median()

0.0

In [144]:
data.loc[data.NumberOfDependents.isnull()]

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
55,55,0,0.004264,51.0,0.0,0.111444,2000.0,6.0,1.0,0.0,0.0,
60,60,0,0.234218,51.0,0.0,0.116614,8600.0,19.0,0.0,0.0,0.0,
77,77,0,0.363200,51.0,0.0,0.480524,2900.0,4.0,0.0,1.0,0.0,
117,117,0,0.000000,51.0,2.0,0.370876,3000.0,14.0,0.0,1.0,0.0,
126,126,0,0.000000,51.0,1.0,0.726567,3477.0,5.0,0.0,1.0,0.0,
138,138,0,0.000000,51.0,0.0,0.907539,2400.0,6.0,0.0,1.0,0.0,
155,155,0,0.078739,51.0,0.0,0.166215,4800.0,6.0,0.0,1.0,0.0,
162,162,0,1.000000,51.0,1.0,0.358101,1937.0,4.0,1.0,0.0,0.0,
163,163,0,0.013501,51.0,0.0,0.183464,19000.0,8.0,0.0,1.0,0.0,
193,193,0,0.012151,51.0,0.0,0.008887,4500.0,12.0,0.0,0.0,0.0,


In [145]:
# fill missing values for Age with the median age
data.NumberOfDependents.fillna(data.NumberOfDependents.mean(), inplace=True)

In [118]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0
5,5,0,0.213179,74.0,0.0,0.375607,3500.0,3.0,0.0,1.0,0.0,1.0
6,6,0,0.754464,39.0,0.0,0.20994,3500.0,8.0,0.0,0.0,0.0,0.0
7,7,0,0.189169,57.0,0.0,0.606291,23684.0,9.0,0.0,4.0,0.0,2.0
8,8,0,0.644226,30.0,0.0,0.309476,2500.0,5.0,0.0,0.0,0.0,0.0
9,9,0,0.018798,51.0,0.0,0.531529,6501.0,7.0,0.0,2.0,0.0,2.0


# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

In [147]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# define X and y
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

X.head

<bound method DataFrame.head of         Unnamed: 0  RevolvingUtilizationOfUnsecuredLines   age  \
0                0                              0.766127  45.0   
1                1                              0.957151  40.0   
2                2                              0.658180  38.0   
3                3                              0.233810  30.0   
4                4                              0.907239  49.0   
5                5                              0.213179  74.0   
6                6                              0.754464  39.0   
7                7                              0.189169  57.0   
8                8                              0.644226  30.0   
9                9                              0.018798  51.0   
10              10                              0.010352  46.0   
11              11                              0.964673  40.0   
12              12                              0.548458  64.0   
13              13                          

In [155]:
list(X)

['Unnamed: 0',
 'RevolvingUtilizationOfUnsecuredLines',
 'age',
 'NumberOfTime30-59DaysPastDueNotWorse',
 'DebtRatio',
 'MonthlyIncome',
 'NumberOfOpenCreditLinesAndLoans',
 'NumberOfTimes90DaysLate',
 'NumberRealEstateLoansOrLines',
 'NumberOfTime60-89DaysPastDueNotWorse',
 'NumberOfDependents']

In [None]:
from sklearn.feature_selection import RFE
from sklearn.feature.svm im

In [156]:

# define X and y
feature_cols = ['RevolvingUtilizationOfUnsecuredLines', 'age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberOfTimes90DaysLate','NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse','NumberOfDependents']


X = data[feature_cols]


In [None]:



# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))




In [120]:
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.cross_validation import KFold
kf = KFold(25, n_folds=5, shuffle=False)

# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
    print('{:^9} {} {:^25}'.format(str(iteration), str(data[0]), str(data[1])))

Iteration                   Training set observations                   Testing set observations
    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]       
    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]       
    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]     
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]     
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]     


In [121]:
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))

In [122]:
pd.Series(results).describe()

count    10.000000
mean      0.932728
std       0.002739
min       0.929236
25%       0.931208
50%       0.931940
75%       0.935218
max       0.936498
dtype: float64

In [123]:
from sklearn.cross_validation import cross_val_score

logreg = LogisticRegression(C=1e9)

results = cross_val_score(logreg, X, y, cv=10, scoring='accuracy')

In [124]:
pd.Series(results).describe()

count    10.000000
mean      0.932755
std       0.000281
min       0.932341
25%       0.932629
50%       0.932696
75%       0.932932
max       0.933310
dtype: float64

# Exercise 3.3

Now which is the best set of features selected by AUC

In [None]:
  f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')
data.head(5)

In [None]:
# define X and y

data = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))