# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [300]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [301]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

In [302]:
# Identificar los valores Faltantes
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [303]:
#La variable que tiene datos faltantes y a la cual se le deben reemplazar valores es la edad 
#y NumberOfDependents , los cuales buscaremos reemplazar bajo el método de imptación,
# por lo cual se usará los valores de las medidas de tendencia central.

Medidas de Tendencia Central para la edad

In [304]:
# mode Age
data.age.mode()

0    49.0
dtype: float64

In [305]:
# mean Age
data.age.mean()

51.36130439584714

In [306]:
# median Age
data.age.median()

51.0

Funciones para Imputar los Datos de la Variable Age

In [307]:
# fill missing values for Age with the mean age
data.age.fillna(data.age.mean(), inplace=True)

In [308]:
# Identificar los valores Faltantes luego de la imputación y nos muestra que los valores
# faltantes de la variable Age
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                        0
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

Funciones para Imputar los Datos de la Variable "NumberOfDependents"

In [309]:
# mode NumberOfDependents
data.NumberOfDependents.mode()

0    0.0
dtype: float64

In [310]:
# mean NumberOfDependents
data.NumberOfDependents.mean()

0.8565735218319711

In [311]:
# median NumberOfDependents
data.NumberOfDependents.median()

0.0

In [312]:
# fill missing values for NumberOfDependents with the median NumberOfDependents
data.NumberOfDependents.fillna(data.NumberOfDependents.median(), inplace=True)

In [313]:
# Identificar los valores Faltantes luego de la imputación y nos muestra que los valores
# faltantes de la variable NumberOfDependents
data.isnull().sum()

Unnamed: 0                              0
SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

Modelo con todas las variables Completas

In [314]:
# train/test split
from sklearn.cross_validation import train_test_split

y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

In [315]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_class)

array([[26302,    13],
       [ 1888,    26]], dtype=int64)

In [316]:
from sklearn.metrics import precision_score, recall_score, f1_score
print('precision_score ', precision_score(y_test, y_pred_class))

precision_score  0.6666666666666666


Primer Modelo 

In [317]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# define X and y
feature_cols1 = ['age','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 
                'NumberOfDependents','DebtRatio','NumberOfTime60-89DaysPastDueNotWorse']
X1 = data[feature_cols1]
y1 = data['SeriousDlqin2yrs']

In [318]:
# train/test split
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, random_state=1)

# train a logistic regression model
logreg1 = LogisticRegression(C=1e9)
logreg1.fit(X1_train, y1_train)

# make predictions for testing set
y1_pred_class = logreg1.predict(X1_test)

# calculate testing accuracy
print(metrics.accuracy_score(y1_test, y1_pred_class))

0.932657904991321


Segundo Modelo

In [335]:
# define X and y
feature_cols2 = ['age','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 
                'NumberOfDependents','DebtRatio']
X2 = data[feature_cols2]
y2 = data['SeriousDlqin2yrs']



TypeError: tuple indices must be integers or slices, not list

In [320]:
# train/test split
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=2)

# train a logistic regression model
logreg2 = LogisticRegression(C=1e9)
logreg2.fit(X2_train, y2_train)

# make predictions for testing set
y2_pred_class = logreg2.predict(X2_test)

# calculate testing accuracy
print(metrics.accuracy_score(y2_test, y2_pred_class))

0.932657904991321


Tercer Modelo

In [321]:
# define X and y
feature_cols3 = ['age','MonthlyIncome', 'NumberRealEstateLoansOrLines', 
                'NumberOfDependents','DebtRatio']
X3 = data[feature_cols3]
y3 = data['SeriousDlqin2yrs']

In [322]:
# train/test split
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, random_state=3)

# train a logistic regression model
logreg3 = LogisticRegression(C=1e9)
logreg3.fit(X3_train, y3_train)

# make predictions for testing set
y3_pred_class = logreg3.predict(X3_test)

# calculate testing accuracy
print(metrics.accuracy_score(y3_test, y3_pred_class))

0.9343937085975416


Cuarto Modelo

In [323]:
# define X and y
feature_cols4 = ['age','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 
                'NumberOfDependents']
X4 = data[feature_cols4]
y4 = data['SeriousDlqin2yrs']

In [324]:
# train/test split
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, random_state=4)

# train a logistic regression model
logreg4 = LogisticRegression(C=1e9)
logreg4.fit(X4_train, y4_train)

# make predictions for testing set
y4_pred_class = logreg4.predict(X4_test)

# calculate testing accuracy
print(metrics.accuracy_score(y4_test, y4_pred_class))

0.9323390839207907


Resumen Acuraccy

In [325]:
a1=print(metrics.accuracy_score(y1_test, y1_pred_class))
a2=print(metrics.accuracy_score(y2_test, y2_pred_class))
a3=print(metrics.accuracy_score(y3_test, y3_pred_class))
a4=print(metrics.accuracy_score(y4_test, y4_pred_class))

0.932657904991321
0.932657904991321
0.9343937085975416
0.9323390839207907


K FOLD

In [326]:
# simulate splitting a dataset of 100 observations into 5 folds
from sklearn.cross_validation import KFold
kf = KFold(100, n_folds=5, shuffle=False)

# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
    print('{:^9} {} {:^25}'.format(str(iteration), str(data[0]), str(data[1])))

Iteration                   Training set observations                   Testing set observations
    1     [20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
 92 93 94 95 96 97 98 99] [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
    2     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 40 41 42 43
 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
 92 93 94 95 96 97 98 99] [20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
    3     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67
 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
 92 93 94 95 96 97 98 99] [40 41 42 43 4

In [327]:
# Create k-folds
kf = KFold(X.shape[0], n_folds=5, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))

In [328]:
from sklearn.cross_validation import cross_val_score

logreg = LogisticRegression(C=1e9)

results = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')

In [329]:
pd.Series(results).describe()

count    5.000000
mean     0.932746
std      0.000219
min      0.932560
25%      0.932604
50%      0.932607
75%      0.932914
max      0.933044
dtype: float64

# Exercise 3.3

Now which is the best set of features selected by AUC

In [330]:
print(metrics.roc_auc_score(y1_test, y1_pred_class))

0.5065450511059741


In [331]:
print(metrics.roc_auc_score(y2_test, y2_pred_class))

0.5


In [332]:
print(metrics.roc_auc_score(y3_test, y3_pred_class))

0.5


In [333]:
print(metrics.roc_auc_score(y4_test, y4_pred_class))

0.5
