# -----------------------------
# TO BE COMPLETED
# -----------------------------

# Exploring Ensemble Methods

In this assignment, we will explore the use of boosting. We will use the pre-implemented gradient boosted trees. You will:

* Use numpy DataFrames to do some feature engineering.
* Train a boosted ensemble of decision-trees (gradient boosted trees) on the LendingClub dataset.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Evaluate the trained model and compare it with a baseline.
* Find the most positive and negative loans using the learned model.
* Explore how the number of trees influences classification performance.

Let's get started!

In [1]:
import pandas as pd
import numpy as np

In [2]:
import ast

In [3]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix

# Load LendingClub dataset

We will be using the [LendingClub](https://www.lendingclub.com/) data. As discussed earlier, the [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. 

Just like we did in previous assignments, we will build a classification model to predict whether or not a loan provided by lending club is likely to default.

Let us start by loading the data.

In [1]:
loans = pd.read_csv("./data/lending-club-data.csv", index_col=False)

NameError: name 'pd' is not defined

Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset. We have done this in previous assignments, so we won't belabor this here.

In [51]:
loans.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans', 'bad_loans',
       'emp_length_num', 'grade_num', 'sub_gra

## Modifying the target column

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

As in past assignments, in order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans`.

In [52]:
loans["safe_loans"] = loans["bad_loans"].apply(lambda x : +1 if x==0 else -1)
loans.drop("bad_loans", inplace=True, axis=1)

## Selecting features

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

The features we will be using are described in the code comments below:

In [53]:
target = 'safe_loans'
features = ['grade',                     # grade of the loan (categorical)
            'sub_grade_num',             # sub-grade of the loan as a number from 0 to 1
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'payment_inc_ratio',         # ratio of the monthly payment to income
            'delinq_2yrs',               # number of delinquincies 
            'delinq_2yrs_zero',          # no delinquincies in last 2 years
            'inq_last_6mths',            # number of creditor inquiries in last 6 months
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'open_acc',                  # number of open credit accounts
            'pub_rec',                   # number of derogatory public records
            'pub_rec_zero',              # no derogatory public records
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
            'int_rate',                  # interest rate of the loan
            'total_rec_int',             # interest received to date
            'annual_inc',                # annual income of borrower
            'funded_amnt',               # amount committed to the loan
            'funded_amnt_inv',           # amount committed by investors for the loan
            'installment',               # monthly payment owed by the borrower
           ]
loans = loans[features + [target]]

## Skipping observations with missing values

Recall from the lectures that one common approach to coping with missing values is to **skip** observations that contain missing values.

Let's see how many NAN are in the DataFrame:

In [54]:
loans.isnull().sum()

grade                     0
sub_grade_num             0
short_emp                 0
emp_length_num            0
home_ownership            0
dti                       0
purpose                   0
payment_inc_ratio         4
delinq_2yrs              29
delinq_2yrs_zero         29
inq_last_6mths           29
last_delinq_none          0
last_major_derog_none     0
open_acc                 29
pub_rec                  29
pub_rec_zero             29
revol_util                0
total_rec_late_fee        0
int_rate                  0
total_rec_int             0
annual_inc                4
funded_amnt               0
funded_amnt_inv           0
installment               0
safe_loans                0
dtype: int64

Getting indices for NAN

In [55]:
print("delinq_2yrs:    ",loans[loans.isnull()["delinq_2yrs"]].index.tolist())
print("open_acc:       ",loans[loans.isnull()["open_acc"]].index.tolist())
print("annual_inc:     ",loans[loans.isnull()["annual_inc"]].index.tolist())

delinq_2yrs:     [38141, 38142, 38151, 38164, 38172, 38175, 38186, 38201, 38206, 38207, 38208, 38209, 38210, 38211, 38212, 38213, 38214, 38215, 38216, 38217, 38218, 38219, 38220, 38221, 38222, 38223, 38224, 38225, 38226]
open_acc:        [38141, 38142, 38151, 38164, 38172, 38175, 38186, 38201, 38206, 38207, 38208, 38209, 38210, 38211, 38212, 38213, 38214, 38215, 38216, 38217, 38218, 38219, 38220, 38221, 38222, 38223, 38224, 38225, 38226]
annual_inc:      [38141, 38142, 38172, 38225]


Now we remove these NANs

In [56]:
num_rows_all = loans.shape[0]
loans = loans.dropna(axis = 0)
num_rows = loans.shape[0]

# Count the number of rows with missing data
print("Dropping",num_rows_all - num_rows,"observations; keeping",num_rows)

Dropping 29 observations; keeping 122578


Fortunately, there are not too many missing values. We are retaining most of the data.

### Resetting index !
Index from the DataFrame needs to be reset. It seems the .idx files to download uses index after reset. Indeed, if not reset when creating train_data, the train_data DataFrame will contains NAN value for the following index values:

**{38141, 38151, 38175, 38186, 38208, 38209, 38211, 38223, 38225, 38226}**

which corresponds to index values of NAN from initial loans DataFrame, i.e. rows that have been dropped.

In [57]:
loans = loans.reset_index(drop=True)

In [58]:
loans.head()

Unnamed: 0,grade,sub_grade_num,short_emp,emp_length_num,home_ownership,dti,purpose,payment_inc_ratio,delinq_2yrs,delinq_2yrs_zero,...,pub_rec_zero,revol_util,total_rec_late_fee,int_rate,total_rec_int,annual_inc,funded_amnt,funded_amnt_inv,installment,safe_loans
0,B,0.4,0,11,RENT,27.65,credit_card,8.1435,0.0,1.0,...,1.0,83.7,0.0,10.65,861.07,24000.0,5000,4975,162.87,1
1,C,0.8,1,1,RENT,1.0,car,2.3932,0.0,1.0,...,1.0,9.4,0.0,15.27,435.17,30000.0,2500,2500,59.83,-1
2,C,1.0,0,11,RENT,8.72,small_business,8.25955,0.0,1.0,...,1.0,98.5,0.0,15.96,603.65,12252.0,2400,2400,84.33,1
3,C,0.2,0,11,RENT,20.0,other,8.27585,0.0,1.0,...,1.0,21.0,16.97,13.49,2209.33,49200.0,10000,10000,339.31,1
4,A,0.8,0,4,RENT,11.2,wedding,5.21533,0.0,1.0,...,1.0,28.3,0.0,7.9,631.38,36000.0,5000,5000,156.46,1


## Make sure the classes are balanced

We saw in an earlier assignment that this dataset is also imbalanced. We will undersample the larger class (safe loans) in order to balance out our dataset. We used `seed=1` to make sure everyone gets the same results.

**This step not to be done as the index files we're downloading when not using SFrame already take care of class imbalance**

**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5128907&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F69%2F5173046%2F05128907.pdf%3Farnumber%3D5128907 ). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

## Transform categorical data into binary features
### Let's use SciKit-Learn for that
Each features are categorical data. We loop through each feature and for each feature (**k categories**) we use LabelEncoder and OneHotEncoder to create **k binary features**.
Then we concat the created features to **loans**.

First, get features name to which hot encoding shall be applied

In [59]:
categorical_variables = []
for feat_name, feat_type in zip(loans.columns.tolist(), loans.dtypes):
    if feat_type == object:
        categorical_variables.append(feat_name)
print("categorical_variables:", categorical_variables)

categorical_variables: ['grade', 'home_ownership', 'purpose']


In [60]:
le = LabelEncoder()
ohe = OneHotEncoder(sparse=False)

**For the concat to work below, it is required to used the index of the loans when creating the Dataframe from np_ar_ohe. Else concat gives incorrect results**

In [61]:
for f in categorical_variables:
    le.fit(loans[f])
    col_names = []
    for le_class in le.classes_:
        col_names.append(f + "." + le_class)
    np_ar = le.transform(loans[f])
    np_ar_ohe = ohe.fit_transform(np_ar.reshape(-1,1))
    loans = pd.concat([loans, pd.DataFrame(np_ar_ohe, columns=col_names, index=loans.index)],axis=1)

### Now let's drop old features

In [62]:
loans.drop(categorical_variables, axis=1, inplace=True)

In [63]:
loans.head()

Unnamed: 0,sub_grade_num,short_emp,emp_length_num,dti,payment_inc_ratio,delinq_2yrs,delinq_2yrs_zero,inq_last_6mths,last_delinq_none,last_major_derog_none,...,purpose.debt_consolidation,purpose.home_improvement,purpose.house,purpose.major_purchase,purpose.medical,purpose.moving,purpose.other,purpose.small_business,purpose.vacation,purpose.wedding
0,0.4,0,11,27.65,8.1435,0.0,1.0,1.0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.8,1,1,1.0,2.3932,0.0,1.0,5.0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0,11,8.72,8.25955,0.0,1.0,2.0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.2,0,11,20.0,8.27585,0.0,1.0,1.0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.8,0,4,11.2,5.21533,0.0,1.0,3.0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### And finally recreate features list

In [64]:
features = loans.columns.tolist()
features.remove("safe_loans")

The full list of features is now:

In [65]:
print(features)

['sub_grade_num', 'short_emp', 'emp_length_num', 'dti', 'payment_inc_ratio', 'delinq_2yrs', 'delinq_2yrs_zero', 'inq_last_6mths', 'last_delinq_none', 'last_major_derog_none', 'open_acc', 'pub_rec', 'pub_rec_zero', 'revol_util', 'total_rec_late_fee', 'int_rate', 'total_rec_int', 'annual_inc', 'funded_amnt', 'funded_amnt_inv', 'installment', 'grade.A', 'grade.B', 'grade.C', 'grade.D', 'grade.E', 'grade.F', 'grade.G', 'home_ownership.MORTGAGE', 'home_ownership.OTHER', 'home_ownership.OWN', 'home_ownership.RENT', 'purpose.car', 'purpose.credit_card', 'purpose.debt_consolidation', 'purpose.home_improvement', 'purpose.house', 'purpose.major_purchase', 'purpose.medical', 'purpose.moving', 'purpose.other', 'purpose.small_business', 'purpose.vacation', 'purpose.wedding']


In [95]:
len(features)

44

## Download training and validation sets

In [66]:
f = open("./data/module-8-assignment-1-train-idx.json")
train_data_index = f.readline()
f.close()

# Transform the read string into a list
train_data_index = ast.literal_eval(train_data_index)

train_data = loans.loc[train_data_index]

In [67]:
f = open("./data/module-8-assignment-1-validation-idx.json")
validation_data_index = f.readline()
f.close()

# Transform the read string into a list
validation_data_index = ast.literal_eval(validation_data_index)

validation_data = loans.loc[validation_data_index]

# Gradient boosted tree classifier

Gradient boosted trees are a powerful variant of boosting methods; they have been used to win many [Kaggle](https://www.kaggle.com/) competitions, and have been widely used in industry.  We will explore the predictive power of multiple decision trees as opposed to a single decision tree.

**Additional reading:** If you are interested in gradient boosted trees, here is some additional reading material:
* [Advanced material on boosted trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)


We will now train models to predict `safe_loans` using the features above. In this section, we will experiment with training an ensemble of 5 trees.

We will use scikit-learn built in  gradient boosting classifier **GradientBoostingClassifier** with settings **max_depth=6** and **n_estimators=5**.

In [71]:
model_5 = GradientBoostingClassifier(n_estimators = 5, max_depth = 6)

In [72]:
model_5.fit(train_data[features], train_data["safe_loans"])

GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=6, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=5, presort='auto',
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

# Making predictions

Just like we did in previous sections, let us consider a few positive and negative examples **from the validation set**. We will do the following:
* Predict whether or not a loan is likely to default.
* Predict the probability with which the loan is likely to default.

Select 2 positives and two negatives examples:

In [73]:
# Select all positive and negative examples.
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

# Select 2 examples from the validation set for positive & negative loans
sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

Append the 4 examples into a single dataset:

In [74]:
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,sub_grade_num,short_emp,emp_length_num,dti,payment_inc_ratio,delinq_2yrs,delinq_2yrs_zero,inq_last_6mths,last_delinq_none,last_major_derog_none,...,purpose.debt_consolidation,purpose.home_improvement,purpose.house,purpose.major_purchase,purpose.medical,purpose.moving,purpose.other,purpose.small_business,purpose.vacation,purpose.wedding
22,0.2,0,3,29.44,6.30496,0.0,1.0,0.0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26,0.6,1,1,12.19,13.4952,0.0,1.0,0.0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24,0.4,0,3,13.97,2.96736,3.0,0.0,0.0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
41,1.0,0,11,16.33,1.90524,0.0,1.0,0.0,1,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Predicting on sample validation data

For each row in the **sample_validation_data**, write code to make **model_5** predict whether or not the loan is classified as a **safe loan**.

**Hint:** Use the `predict` method in `model_5` for this.

In [75]:
prediction = model_5.predict(sample_validation_data[features])

In [76]:
print("prediction:")
print(prediction.reshape(-1,1))
print("sample_validation_data[target]:")
print(sample_validation_data[target])

prediction:
[[ 1]
 [ 1]
 [-1]
 [ 1]]
sample_validation_data[target]:
22    1
26    1
24   -1
41   -1
Name: safe_loans, dtype: int64


**Quiz Question:** What percentage of the predictions on `sample_validation_data` did `model_5` get correct?

**Quiz Response:** 3 predictions on 4 corrects = **75%**

### Prediction probabilities

For each row in the **sample_validation_data**, what is the probability (according **model_5**) of a loan being classified as **safe**?

In [77]:
prediction_proba = model_5.predict_proba(sample_validation_data[features])

In [78]:
print("prediction_proba:")
print(prediction_proba)

prediction_proba:
[[ 0.41642331  0.58357669]
 [ 0.46949689  0.53050311]
 [ 0.53807792  0.46192208]
 [ 0.39591639  0.60408361]]


**Quiz Question:** According to **model_5**, which loan is the least likely to be a safe loan?

**Quiz Response:** **Second** (probability of only 53%)

**Checkpoint:** Can you verify that for all the predictions with `probability >= 0.5`, the model predicted the label **+1**?

## Evaluating the model on the validation data

Recall that the accuracy is defined as follows:
$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

Evaluate the accuracy of the **model_5** on the **validation_data**.

**Using scikit-learn**

In [79]:
print("model_5 score (accuracy):",model_5.score(validation_data[features], validation_data[target]))

model_5 score (accuracy): 0.661460577337


**Using math!**

In [80]:
prediction = model_5.predict(validation_data[features])

In [81]:
correctly_classified_examples = (prediction == validation_data[target]).sum()
total_examples = validation_data.shape[0]
print("model_5 score (accuracy):",correctly_classified_examples/total_examples)

model_5 score (accuracy): 0.661460577337


Calculate the number of **false positives** made by the model.

**Using scikit-learn (confusion matrix)**

In [82]:
confusion_matrix(validation_data[target],prediction)

array([[3020, 1652],
       [1491, 3121]])

**Using math!!**

In [83]:
false_positives = ((prediction == 1) & (validation_data[target] == -1)).sum()
print("False positives:",false_positives)

False positives: 1652


**Quiz Question**: What is the number of **false positives** on the **validation_data**?

**Quiz Response**: **1652**

Calculate the number of **false negatives** made by the model.

In [84]:
false_negatives = ((prediction == -1) & (validation_data[target] == 1)).sum()
print("False negatives:",false_negatives)

False negatives: 1491


## Comparison with decision trees

In the earlier assignment, we saw that the prediction accuracy of the decision trees was around **0.64** (rounded). In this assignment, we saw that **model_5** has an accuracy of **0.67** (rounded).

Here, we quantify the benefit of the extra 3% increase in accuracy of **model_5** in comparison with a single decision tree from the original decision tree assignment.

As we explored in the earlier assignment, we calculated the cost of the mistakes made by the model. We again consider the same costs as follows:

* **False negatives**: Assume a cost of \$10,000 per false negative.
* **False positives**: Assume a cost of \$20,000 per false positive.

Assume that the number of false positives and false negatives for the learned decision tree was

* **False negatives**: 1936
* **False positives**: 1503

Using the costs defined above and the number of false positives and false negatives for the decision tree, we can calculate the total cost of the mistakes made by the decision tree model as follows:

```
cost = $10,000 * 1936  + $20,000 * 1503 = $49,420,000
```

The total cost of the mistakes of the model is $49.42M. That is a **lot of money**!.

**Quiz Question**: Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (**model_5**) as evaluated on the **validation_set**?

In [85]:
print("cost for boosting:", 10000 * false_positives + 20000 * false_negatives)

cost for boosting: 46340000


**Reminder**: Compare the cost of the mistakes made by the boosted trees model with the decision tree model. The extra 3% improvement in prediction accuracy can translate to several million dollars!  And, it was so easy to get by simply boosting our decision trees.

## Most positive & negative loans.

In this section, we will find the loans that are most likely to be predicted **safe**. We can do this in a few steps:

* **Step 1**: Use the **model_5** (the model with 5 trees) and make **probability predictions** for all the loans in the **validation_data**.
* **Step 2**: Similar to what we did in the very first assignment, add the probability predictions as a column called **predictions** into the validation_data.
* **Step 3**: Sort the data (in descreasing order) by the probability predictions.

Start here with **Step 1** & **Step 2**. Make predictions using **model_5** for examples in the **validation_data**. Use `output_type = probability`.

In [86]:
model_5.predict_proba(validation_data[features])

array([[ 0.53807792,  0.46192208],
       [ 0.39591639,  0.60408361],
       [ 0.52012758,  0.47987242],
       ..., 
       [ 0.53530977,  0.46469023],
       [ 0.52280924,  0.47719076],
       [ 0.53807792,  0.46192208]])

In [87]:
validation_data["predictions"] = pd.DataFrame(model_5.predict_proba(validation_data[features]).reshape(-1,1))

**Checkpoint:** For each row, the probabilities should be a number in the range **[0, 1]**. We have provided a simple check here to make sure your answers are correct.

In [88]:
print("Your loans      :")
print(validation_data["predictions"].head(4))
print("Expected answer :[0.4492515948736132, 0.6119100103640573,0.3835981314851436, 0.3693306705994325]")

Your loans      :
24    0.494036
41    0.529939
60    0.530096
93    0.572321
Name: predictions, dtype: float64
Expected answer :[0.4492515948736132, 0.6119100103640573,0.3835981314851436, 0.3693306705994325]


## ---------------
## So check point fails here. Is it the index problem?
## ---------------

Now, we are ready to go to **Step 3**. You can now use the `prediction` column to sort the loans in **validation_data** (in descending order) by prediction probability. Find the top 5 loans with the highest probability of being predicted as a **safe loan**.

In [89]:
validation_data_sorted = validation_data.sort_values("predictions", ascending=False)

In [90]:
validation_data_sorted[["predictions","grade.A", "grade.B","grade.C", "grade.D", "grade.E", "grade.F"]].head(5)

Unnamed: 0,predictions,grade.A,grade.B,grade.C,grade.D,grade.E,grade.F
3128,0.675019,0.0,0.0,0.0,0.0,1.0,0.0
8218,0.675019,0.0,0.0,0.0,1.0,0.0,0.0
570,0.675019,0.0,0.0,0.0,1.0,0.0,0.0
1134,0.675019,0.0,1.0,0.0,0.0,0.0,0.0
2012,0.674571,0.0,0.0,1.0,0.0,0.0,0.0


** Quiz Question**: What grades are the top 5 loans?

Let us repeat this excercise to find the top 5 loans (in the **validation_data**) with the **lowest probability** of being predicted as a **safe loan**:

**Checkpoint:** You should expect to see 5 loans with the grade ['**D**', '**C**', '**C**', '**C**', '**B**'] or with ['**D**', '**C**', '**B**', '**C**', '**C**'].

## Effect of adding more trees

In this assignment, we will train 5 different ensemble classifiers in the form of gradient boosted trees. We will train models with 10, 50, 100, 200, and 500 trees.  We use the **max_iterations** parameter in the boosted tree module. 

Let's get sarted with a model with **max_iterations = 10**:

In [None]:
model_10 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 10, verbose=False)

Now, train 4 models with **max_iterations** to be:
* `max_iterations = 50`, 
* `max_iterations = 100`
* `max_iterations = 200`
* `max_iterations = 500`. 

Let us call these models **model_50**, **model_100**, **model_200**, and **model_500**. You can pass in `verbose=False` in order to suppress the printed output.

**Warning:** This could take a couple of minutes to run.

In [None]:
model_50 = 
model_100 = 
model_200 = 
model_500 = 

## Compare accuracy on entire validation set

Now we will compare the predicitve accuracy of our models on the validation set. Evaluate the **accuracy** of the 10, 50, 100, 200, and 500 tree models on the **validation_data**. Use the `.evaluate` method.

**Quiz Question:** Which model has the **best** accuracy on the **validation_data**?

**Quiz Question:** Is it always true that the model with the most trees will perform best on test data?

## Plot the training and validation error vs. number of trees

Recall from the lecture that the classification error is defined as

$$
\mbox{classification error} = 1 - \mbox{accuracy} 
$$

In this section, we will plot the **training and validation errors versus the number of trees** to get a sense of how these models are performing. We will compare the 10, 50, 100, 200, and 500 tree models. You will need [matplotlib](http://matplotlib.org/downloads.html) in order to visualize the plots. 

First, make sure this block of code runs on your computer.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
def make_figure(dim, title, xlabel, ylabel, legend):
    plt.rcParams['figure.figsize'] = dim
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if legend is not None:
        plt.legend(loc=legend, prop={'size':15})
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()

In order to plot the classification errors (on the **train_data** and **validation_data**) versus the number of trees, we will need lists of these accuracies, which we get by applying the method `.evaluate`. 

**Steps to follow:**

* **Step 1:** Calculate the classification error for model on the training data (**train_data**).
* **Step 2:** Store the training errors into a list (called `training_errors`) that looks like this:
```
[train_err_10, train_err_50, ..., train_err_500]
```
* **Step 3:** Calculate the classification error of each model on the validation data (**validation_data**).
* **Step 4:** Store the validation classification error into a list (called `validation_errors`) that looks like this:
```
[validation_err_10, validation_err_50, ..., validation_err_500]
```
Once that has been completed, the rest of the code should be able to evaluate correctly and generate the plot.


Let us start with **Step 1**. Write code to compute the classification error on the **train_data** for models **model_10**, **model_50**, **model_100**, **model_200**, and **model_500**.

Now, let us run **Step 2**. Save the training errors into a list called **training_errors**

In [None]:
training_errors = [train_err_10, train_err_50, train_err_100, 
                   train_err_200, train_err_500]

Now, onto **Step 3**. Write code to compute the classification error on the **validation_data** for models **model_10**, **model_50**, **model_100**, **model_200**, and **model_500**.

Now, let us run **Step 4**. Save the training errors into a list called **validation_errors**

In [None]:
validation_errors = [validation_err_10, validation_err_50, validation_err_100, 
                     validation_err_200, validation_err_500]

Now, we will plot the **training_errors** and **validation_errors** versus the number of trees. We will compare the 10, 50, 100, 200, and 500 tree models. We provide some plotting code to visualize the plots within this notebook. 

Run the following code to visualize the plots.

In [None]:
plt.plot([10, 50, 100, 200, 500], training_errors, linewidth=4.0, label='Training error')
plt.plot([10, 50, 100, 200, 500], validation_errors, linewidth=4.0, label='Validation error')

make_figure(dim=(10,5), title='Error vs number of trees',
            xlabel='Number of trees',
            ylabel='Classification error',
            legend='best')

**Quiz Question**: Does the training error reduce as the number of trees increases?

**Quiz Question**: Is it always true that the validation error will reduce as the number of trees increases?