# Exercise 11

## Credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
X['age'].fillna(X['age'].mean(), inplace=True)

In [3]:
X['NumberOfDependents'].fillna(0, inplace=True)

In [4]:
y.mean()

0.06744896603639906

# Exercice 11.1

Estimate 60 bagged samples

Estimate the following set of classifiers:

* 10 Decision Trees where max_depth=None
* 10 Decision Trees where max_depth=2
* 10 Decision Trees using an under-sampled dataset where max_depth=None
* 10 Decision Trees using an under-sampled dataset where max_depth=2
* 10 GaussianNB
* 10 GaussianNB using an under-sampled dataset

**note:** the under-sampled set is estimated from the bagged samples, i.e., each under-sampled set is different

In [5]:
import numpy as np

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

n_estimators = 60

# set a seed for reproducibility
np.random.seed(123)

n_samples = X_train.shape[0]

# create bootstrap samples (will be used to select rows from the DataFrame)
samples = [np.random.choice(a=n_samples, size=n_samples, replace=True) for _ in range(n_estimators)]

In [6]:

def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

10 Decision trees max_depth=None

In [7]:
from sklearn.tree import DecisionTreeClassifier
models = []

for i in range(10):
    models.append(DecisionTreeClassifier(max_depth=None))
    models[i].fit(X_train.iloc[samples[i]], y_train.iloc[samples[i]])

10 Decision trees max_depth=2

In [8]:
for i in range(10, 20):
    models.append(DecisionTreeClassifier(max_depth=2))
    models[i].fit(X_train.iloc[samples[i]], y_train.iloc[samples[i]])

In [9]:
len(models)

20

10 Decision trees max_depth=None using an under sampled set

In [10]:
for i in range(20, 30):
    X_u, y_u = UnderSampling(X_train.iloc[samples[i]], y_train.iloc[samples[i]], 
                             target_percentage=0.5, seed=i)
    models.append(DecisionTreeClassifier(max_depth=None))
    models[i].fit(X_u, y_u)

10 Decision trees max_depth=2 using an under sampled set

In [11]:
for i in range(30, 40):
    X_u, y_u = UnderSampling(X_train.iloc[samples[i]], y_train.iloc[samples[i]], 
                             target_percentage=0.5, seed=i)
    models.append(DecisionTreeClassifier(max_depth=2))
    models[i].fit(X_u, y_u)

10 GaussianNB

In [12]:
from sklearn.naive_bayes import GaussianNB

In [13]:
for i in range(40, 50):
    models.append(GaussianNB())
    models[i].fit(X_train.iloc[samples[i]], y_train.iloc[samples[i]])

10 GaussianNB using an under sampled set

In [14]:
for i in range(50, 60):
    X_u, y_u = UnderSampling(X_train.iloc[samples[i]], y_train.iloc[samples[i]], 
                             target_percentage=0.5, seed=i)
    models.append(GaussianNB())
    models[i].fit(X_u, y_u)

In [15]:
len(models)

60

# Exercice 11.2 

Ensemble using majority voting

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=2)

In [18]:
from sklearn import metrics

# Exercice 11.3

Estimate te probability as %models that predict positive

Modify the probability threshold and select the one that maximizes the fbeta_score

# Exercice 11.4

Ensemble using Weighted Voting

Calculate the weight using the oob error

In [16]:
samples_oob = []
# show the "out-of-bag" observations for each sample
for sample in samples:
    samples_oob.append(sorted(set(range(n_samples)) - set(sample)))

In [19]:
errors = np.zeros(n_estimators)

for i in range(60):
    y_pred_ = models[i].predict(X_train.iloc[samples_oob[i]])
    errors[i] = 1 - metrics.accuracy_score(y_train.iloc[samples_oob[i]], y_pred_)

# Exercice 11.5

Estimate a AdaBoostClassifier and a GradientBoostingClassifier using the training and an under-sampled set