# Exploring Ensemble Methods
In this assignment, we will explore the use of boosting. We will use the pre-implemented gradient boosted trees in Scikit-learn. You will:

* Use Pandas to do some feature engineering.
* Train a boosted ensemble of decision-trees (gradient boosted trees) on the LendingClub dataset.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Evaluate the trained model and compare it with a baseline.
* Find the most positive and negative loans using the learned model.
* Explore how the number of trees influences classification performance.


In [None]:
# Import some libs

import pandas
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Load LendingClub Dataset
This assignment will use the [LendingClub](https://www.lendingclub.com/) dataset used in the previous two assignments.

In [None]:
loans_df = pandas.read_csv('/content/drive/MyDrive/FUNIX Progress/MLP303x_1.1-A_EN/data/lending-club-data.csv', low_memory=False)

# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans_df['safe_loans'] = loans_df['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans_df.drop(columns=['bad_loans'])

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,is_inc_v,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,not_compliant,status,inactive_loans,emp_length_num,grade_num,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none,safe_loans
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,,10+ years,RENT,24000.0,Verified,20111201T000000,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 12/22/11 > I need to upgra...,credit_card,Computer,860xx,AZ,27.65,0.0,19850101T000000,1.0,,,3.0,0.0,13648,83.7,9.0,f,0.0,0.0,5861.07,5831.78,5000.00,861.07,0.00,0.00,0.000,20150101T000000,171.62,,20150101T000000,0.0,,1,0,Fully Paid,1,11,5,0.4,1.0,1.0,1.0,0,8.143500,20141201T000000,1,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,20111201T000000,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 12/22/11 > I plan to use t...,car,bike,309xx,GA,1.00,0.0,19990401T000000,5.0,,,3.0,0.0,1687,9.4,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.00,117.08,1.110,20130401T000000,119.66,,20130901T000000,0.0,,1,0,Charged Off,1,1,4,0.8,1.0,1.0,1.0,1,2.393200,20161201T000000,1,1,1,-1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,20111201T000000,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,small_business,real estate business,606xx,IL,8.72,0.0,20011101T000000,2.0,,,2.0,0.0,2956,98.5,10.0,f,0.0,0.0,3003.65,3003.65,2400.00,603.65,0.00,0.00,0.000,20140601T000000,649.91,,20150201T000000,0.0,,1,0,Fully Paid,1,11,4,1.0,1.0,1.0,1.0,0,8.259550,20141201T000000,1,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,20111201T000000,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 12/21/11 > to pay for prop...,other,personel,917xx,CA,20.00,0.0,19960201T000000,1.0,35.0,,10.0,0.0,5598,21.0,37.0,f,0.0,0.0,12226.30,12226.30,10000.00,2209.33,16.97,0.00,0.000,20150101T000000,357.48,,20150101T000000,0.0,,1,0,Fully Paid,1,11,4,0.2,1.0,1.0,1.0,0,8.275850,20141201T000000,0,1,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.90,156.46,A,A4,Veolia Transportaton,3 years,RENT,36000.0,Source Verified,20111201T000000,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,wedding,My wedding loan I promise to pay back,852xx,AZ,11.20,0.0,20041101T000000,3.0,,,9.0,0.0,7963,28.3,12.0,f,0.0,0.0,5631.38,5631.38,5000.00,631.38,0.00,0.00,0.000,20150101T000000,161.03,,20150201T000000,0.0,,1,0,Fully Paid,1,4,6,0.8,1.0,1.0,1.0,0,5.215330,20141201T000000,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122602,9856168,11708132,6000,6000,6000,60 months,23.40,170.53,E,E5,,,MORTGAGE,45600.0,Source Verified,20140101T000000,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 12/13/13 > Having major me...,medical,Medical,317xx,GA,1.50,1.0,19840101T000000,0.0,15.0,,3.0,0.0,1199,14.6,13.0,f,0.0,0.0,511.49,511.49,163.71,347.78,0.00,0.00,0.000,20140401T000000,170.53,,20140901T000000,0.0,15.0,1,0,Charged Off,1,0,2,1.0,0.0,1.0,1.0,1,4.487630,20190101T000000,0,1,0,-1
122603,9795013,11647121,15250,15250,15250,36 months,17.57,548.05,D,D2,Installer,10+ years,MORTGAGE,65000.0,Verified,20140101T000000,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 12/11/13 > Pay off high in...,debt_consolidation,Fresh,190xx,PA,11.26,1.0,19980701T000000,2.0,23.0,54.0,8.0,2.0,6122,15.2,26.0,f,0.0,0.0,15904.90,15905.00,15250.00,654.95,0.00,0.00,0.000,20140401T000000,14810.30,,20140401T000000,0.0,56.0,1,0,Fully Paid,1,11,3,0.4,0.0,0.0,1.0,0,10.117800,20170101T000000,0,0,0,1
122604,9695736,11547808,8525,8525,8525,60 months,18.25,217.65,D,D3,MANAGER,5 years,MORTGAGE,37536.0,Verified,20140101T000000,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,medical,Medical expenses,011xx,MA,12.28,4.0,19941101T000000,0.0,3.0,,12.0,0.0,5318,10.7,26.0,f,0.0,0.0,2029.93,2029.93,360.08,510.45,0.00,1159.40,11.594,20140501T000000,217.65,,20141001T000000,0.0,4.0,1,0,Charged Off,1,6,3,0.6,0.0,1.0,1.0,0,6.958120,20190101T000000,0,1,0,-1
122605,9684700,11536848,22000,22000,22000,60 months,19.97,582.50,D,D5,Chief of Interpretation (Park Ranger),10+ years,MORTGAGE,78000.0,Verified,20140101T000000,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,377xx,TN,18.45,0.0,19970601T000000,5.0,,116.0,18.0,1.0,18238,46.3,30.0,f,0.0,0.0,4677.92,4677.92,1837.04,2840.88,0.00,0.00,0.000,20141201T000000,17.50,,20150201T000000,0.0,,1,0,Charged Off,1,11,3,1.0,1.0,0.0,1.0,0,8.961540,20190101T000000,1,0,1,-1


## Selecting features

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

The features we will be using are described in the code comments below:

In [None]:
target = 'safe_loans'
features = ['grade',                     # grade of the loan (categorical)
            'sub_grade_num',             # sub-grade of the loan as a number from 0 to 1
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'payment_inc_ratio',         # ratio of the monthly payment to income
            'delinq_2yrs',               # number of delinquincies 
            'delinq_2yrs_zero',          # no delinquincies in last 2 years
            'inq_last_6mths',            # number of creditor inquiries in last 6 months
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'open_acc',                  # number of open credit accounts
            'pub_rec',                   # number of derogatory public records
            'pub_rec_zero',              # no derogatory public records
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
            'int_rate',                  # interest rate of the loan
            'total_rec_int',             # interest received to date
            'annual_inc',                # annual income of borrower
            'funded_amnt',               # amount committed to the loan
            'funded_amnt_inv',           # amount committed by investors for the loan
            'installment',               # monthly payment owed by the borrower
           ]

## Skipping observations with missing values

Recall from the lectures that one common approach to coping with missing values is to **skip** observations that contain missing values.

We run the following code to do so:

In [None]:
n = loans_df.shape[0]
loans_df = loans_df[features + [target]].dropna()
na = loans_df.shape[0]

print ("Drop {} and keep {}".format(n - na, na))

Drop 29 and keep 122578


## Subsample dataset to make sure classes are balanced
Just as we did in the previous assignment, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed = 1` so everyone gets the same results.

In [None]:
safe_loans_raw = loans_df[loans_df[target] == +1]
risky_loans_raw = loans_df[loans_df[target] == -1]

# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = risky_loans_raw.shape[0]/safe_loans_raw.shape[0]

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(frac=percentage, random_state=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

print ("Percentage of safe loans                 : {}".format(safe_loans.shape[0] / loans_data.shape[0]))
print ("Percentage of risky loans                : {}".format(risky_loans.shape[0] / loans_data.shape[0]))
print ("Total number of loans in our new dataset : {}".format(loans_data.shape[0]))

Percentage of safe loans                 : 0.5
Percentage of risky loans                : 0.5
Total number of loans in our new dataset : 46294


## Transform categorical data into binary features

In [None]:
print(loans_data.dtypes)
categorical_variables = list(loans_data.select_dtypes(include=['object']).columns)
print(categorical_variables)

one_hot_data = pandas.get_dummies(loans_data[categorical_variables], prefix=categorical_variables)
# need to add inplace in oreder to drop columns.
loans_data.drop(columns=categorical_variables, axis=1, inplace=True)
loans_data = pandas.concat([loans_data, one_hot_data], axis=1)

print(loans_data['grade_A'].values.sum())
print(loans_data.dtypes)

grade                     object
sub_grade_num            float64
short_emp                  int64
emp_length_num             int64
home_ownership            object
dti                      float64
purpose                   object
payment_inc_ratio        float64
delinq_2yrs              float64
delinq_2yrs_zero         float64
inq_last_6mths           float64
last_delinq_none           int64
last_major_derog_none      int64
open_acc                 float64
pub_rec                  float64
pub_rec_zero             float64
revol_util               float64
total_rec_late_fee       float64
int_rate                 float64
total_rec_int            float64
annual_inc               float64
funded_amnt                int64
funded_amnt_inv            int64
installment              float64
safe_loans                 int64
dtype: object
['grade', 'home_ownership', 'purpose']
6522
sub_grade_num                 float64
short_emp                       int64
emp_length_num                  int64
dti

## Train-test split

We split the data into a train test split with 80% of the data in the training set and 20% of the data in the test set. We use `seed=1` so that everyone gets the same result.

In [None]:
np.random.seed(1)

train_data, validation_data = train_test_split(loans_data, test_size=0.2)

# Gradient boosted tree classifier
Gradient boosted trees are a powerful variant of boosting methods; they have been used to win many Kaggle competitions, and have been widely used in industry. We will explore the predictive power of multiple decision trees as opposed to a single decision tree.
<br>
Now, let's use the built-in scikit learn gradient boosting classifier [sklearn.ensemble.GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) to create a gradient boosted classifier on the training data. You will need to import **sklearn, sklearn.ensemble, and numpy.**
<br>
You will have to first convert the DataFrame into a numpy data matrix. See the API for more information. You will also have to extract the label column. Make sure to set **max_depth=6** and **n_estimators=5.**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model_5 = GradientBoostingClassifier(n_estimators=5, max_depth=6, random_state=0)

X = train_data.loc[:, train_data.columns != target].values
y = train_data[target].values

model_5.fit(X, y)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=6,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=5,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=0, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

# Making predictions

Just like we did in previous sections, let us consider a few positive and negative examples **from the validation set**. We will do the following:
* Predict whether or not a loan is likely to default.
* Predict the probability with which the loan is likely to default.

In [None]:
# Select all positive and negative examples.
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

# Select 2 examples from the validation set for positive & negative loans
sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

# Append the 4 examples into a single dataset
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data[target]

93539     1
10285     1
121325   -1
74825    -1
Name: safe_loans, dtype: int64

### Predicting on sample validation data

For each row in the **sample_validation_data**, write code to make **model_5** predict whether or not the loan is classified as a **safe loan**.

**Hint:** Use the `predict` method in `model_5` for this.

In [None]:
print(model_5.predict(sample_validation_data.loc[:, train_data.columns != target].values))
print(model_5.predict_proba(sample_validation_data.loc[:, train_data.columns != target].values))

[-1  1 -1 -1]
[[0.54974932 0.45025068]
 [0.41588311 0.58411689]
 [0.6787601  0.3212399 ]
 [0.5564696  0.4435304 ]]


**Quiz Question:** What percentage of the predictions on `sample_validation_data` did `model_5` get correct?
<br>
**Your answer:**
<br>
**Checkpoint:** Can you verify that for all the predictions with `probability >= 0.5`, the model predicted the label **+1**?

## Evaluating the model on the validation data
Recall that the accuracy is defined as follows:
$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

Evaluate the accuracy of the **model_5** on the **validation_data**.

**Hint**: Use the `.score()` method in the model.

**Quiz Question:** What is the accuracy of **model_5**?
<br>
**Your answer:**

In [None]:
# YOUR CODE HERE
val_X = validation_data.loc[:, train_data.columns != target].values
val_y = validation_data[target].values
print(model_5.score(val_X, val_y))

0.6595744680851063


In [None]:
predictions = model_5.predict(val_X)

In [None]:
# YOUR CODE HERE
neg = val_y == 1
pos = val_y == -1

false_pos = np.sum(val_y[neg] != predictions[neg])
false_neg = np.sum(val_y[pos] != predictions[pos])

print(false_pos, false_neg)

1479 1673


**Quiz Question:** What is the **false positive**?
<br>
**Your answer:**
<br>
**Quiz Question:** What is the **false negative**?
<br>
**Your answer:**

## Comparison with decision trees

In the earlier assignment, we saw that the prediction accuracy of the decision trees was around **0.64** (rounded). In this assignment, we saw that **model_5** has an accuracy of **0.66** (rounded).

Here, we quantify the benefit of the extra 2% increase in accuracy of **model_5** in comparison with a single decision tree from the original decision tree assignment.

As we explored in the earlier assignment, we calculated the cost of the mistakes made by the model. We again consider the same costs as follows:

* **False negatives**: Assume a cost of \$10,000 per false negative.
* **False positives**: Assume a cost of \$20,000 per false positive.

Assume that the number of false positives and false negatives for the learned decision tree was

* **False negatives**: 1936
* **False positives**: 1503

Using the costs defined above and the number of false positives and false negatives for the decision tree, we can calculate the total cost of the mistakes made by the decision tree model as follows:

```
cost = $10,000 * 1936  + $20,000 * 1503 = $49,420,000
```

The total cost of the mistakes of the model is $49.42M. That is a **lot of money**!.

**Quiz Question**: Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (**model_5**) as evaluated on the **validation_set**?
<br>
**Your answer:**

In [None]:
# YOUR CODE HERE
cost = 10000 * 1479  + 20000 * 1673
print(cost)

48250000


## Most positive & negative loans.

In this section, we will find the loans that are most likely to be predicted **safe**. We can do this in a few steps:

* **Step 1**: Use the **model_5** (the model with 5 trees) and make **probability predictions** for all the loans in the **validation_data**.
* **Step 2**: Similar to what we did in the very first assignment, add the probability predictions as a column called **predictions** into the validation_data.
* **Step 3**: Sort the data (in descreasing order) by the probability predictions.

Start here with **Step 1** & **Step 2**. Make predictions using **model_5** for examples in the **validation_data**. Use `output_type = probability`.

In [None]:
predictions_prob = [i[1] for i in model_5.predict_proba(val_X)] 
validation_data['predictions'] = predictions_prob

**Checkpoint:** For each row, the probabilities should be a number in the range **[0, 1]**. We have provided a simple check here to make sure your answers are correct.

Now, we are ready to go to **Step 3**. You can now use the `prediction` column to sort the loans in **validation_data** (in descending order) by prediction probability. Find the top 5 loans with the highest probability of being predicted as a **safe loan**.

**Quiz Question**: What grades are the top 5 loans?
<br>
**Your answer**:

In [None]:
# YOUR CODE HERE
high_safe = validation_data.sort_values('predictions', ascending=False)
print (high_safe[:5].loc[:, ['grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F', 'grade_G', 'predictions', target]])

low_safe = validation_data.sort_values('predictions', ascending=True)
print (low_safe[:5].loc[:, ['grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F', 'grade_G', 'predictions', target]])

        grade_A  grade_B  grade_C  ...  grade_G  predictions  safe_loans
29080         1        0        0  ...        0      0.66411           1
54108         1        0        0  ...        0      0.66411           1
9624          1        0        0  ...        0      0.66411           1
85067         1        0        0  ...        0      0.66411          -1
116768        1        0        0  ...        0      0.66411           1

[5 rows x 9 columns]
       grade_A  grade_B  grade_C  ...  grade_G  predictions  safe_loans
30677        1        0        0  ...        0     0.308491          -1
25697        0        0        1  ...        0     0.308606          -1
52678        0        1        0  ...        0     0.308606          -1
24272        0        1        0  ...        0     0.310796          -1
29273        0        0        1  ...        0     0.310835          -1

[5 rows x 9 columns]


## Effect of adding more trees
In this assignment, we will train 5 different ensemble classifiers in the form of gradient boosted trees. We will train models with 10, 50, 100, 200, and 500 trees.  We use the **max_iterations** parameter in the boosted tree module. 

Let's get sarted with a model with **max_iterations = 10**:

In [None]:
model_10 = GradientBoostingClassifier(n_estimators=10, max_depth=6, random_state=0).fit(X, y)
model_50 = GradientBoostingClassifier(n_estimators=50, max_depth=6, random_state=0).fit(X, y)
model_100 = GradientBoostingClassifier(n_estimators=100, max_depth=6, random_state=0).fit(X, y)
model_200 = GradientBoostingClassifier(n_estimators=200, max_depth=6, random_state=0).fit(X, y)
model_500 = GradientBoostingClassifier(n_estimators=500, max_depth=6, random_state=0).fit(X, y)

In [None]:
print('Train accuraccy {}, val accuracy {}'.format(model_10.score(X, y), model_10.score(val_X, val_y)))
print('Train accuraccy {}, val accuracy {}'.format(model_50.score(X, y), model_50.score(val_X, val_y)))
print('Train accuraccy {}, val accuracy {}'.format(model_100.score(X, y), model_100.score(val_X, val_y)))
print('Train accuraccy {}, val accuracy {}'.format(model_200.score(X, y), model_200.score(val_X, val_y)))
print('Train accuraccy {}, val accuracy {}'.format(model_500.score(X, y), model_500.score(val_X, val_y)))

Train accuraccy 0.671769947347104, val accuracy 0.6619505346149692
Train accuraccy 0.7183745105980829, val accuracy 0.6785830003240091
Train accuraccy 0.7470500877548265, val accuracy 0.682579112215142
Train accuraccy 0.7874173079519373, val accuracy 0.6858192029376823
Train accuraccy 0.8665316592412583, val accuracy 0.6830111243114807


**Quiz Question:** Which model has the **best** accuracy on the **validation_data**?
<br>
**Your answer**: model_200
<br>
**Quiz Question:** Is it always true that the model with the most trees will perform best on test data?
<br>
**Your answer**: No