# Week 9, Classification and Regression Trees

**_Author: Jessica Cervi_**

**Expected time = 2 hours**

    
## Assignment overview

Decision trees are a non-parametric supervised learning method used for classification and regression. Decision trees learn from data to approximate a curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.
Decision trees build classification or regression models in the form of trees. They break down a data set into smaller and smaller subsets while incrementally developing the associated decision tree. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf nodes represent a classification or decision. The topmost decision node in a tree, which corresponds to the best predictor, is called the root node. 



This assignment is designed to help you apply the machine learning algorithms you have learnt using packages in Python. Python concepts, instructions, and starter code are embedded within this Jupyter Notebook to help guide you as you progress through the assignment. Remember to run the code of each code cell prior to submitting the assignment. Upon completing the assignment, we encourage you to compare your work against the solution file to perform a self-assessment.


### Learning objectives


- Discuss the difference between conducting splits on categorical and numerical input variables
- Demonstrate how to measure purity in categorical models using entropy and the Gini index
- Compare strategies for pruning a classification tree
- Explain the differences between classification and regression trees
- Use bagging to move from a single tree to an ensemble of trees
- Outline two algorithms for de-correlating decision trees
- Share real-life applications of decision trees

## Index:

#### Week 9:   Classification and regression trees


- [Part 1](#part1)- Importing the data set and exploratory data analysis (EDA)
- [Part 2](#part2)- Translate the categorical predictors into numerical predictors
- [Part 3](#part3)- Shuffle the data set
- [Part 4](#part4)- Calculate the accuracy of the Naïve benchmark on the validation set.
- [Part 5](#part5)- Train a decision tree using the default settings
- [Part 6](#part6)- Train a decision tree using different maximum depths for the tree
- [Part 7](#part7)- Retrain the best classifier using all the samples



## Classification and regression trees

In Week 9, you learnt about classification using regression trees.
The basic idea behind the algorithm for classification via regression trees can be summarised as follows:

- Load the data set
- Select the best attribute using Attribute Selection Measures (ASM) to split the records.
- Make that attribute a decision node and break the data set into smaller subsets.
- Start bulding the tree by repeating this process recursively for each child until one of the conditions will match:
    - All the tuples belong to the same attribute value.
    - There are no more remaining attributes.
    - There are no more instances.


## Predict defaults for student loans applications

For this exercise, we will use the data set `loandata.xlsx` to predict defaults for student loans applications using regression trees. We will perform the following steps:

1. Load the data set `loandata.xlsx` into Python.
2. Translate the categorical predictors into numerical predictors. 
3. Shuffle the data set and split it into 50% training data, 25% validation data and 25% test data.
4. Calculate the accuracy of the Naïve benchmark  on the validation set.
5. Train a decision tree using the default settings.
6. Retry the previous step using different maximum depths for the tree. 
7. Choose the most appropriate tree depth and justify your choice. Re-train the best classifier using all the samples from both the training and the validation set. Retrain the best classifier on all samples (including the test set) and describe the tree that you obtain.

[Back to top](#Index:) 

<a id='part1'></a>

### Part 1 -  Importing the data set and exploratory data analysis (EDA)

We begin by importing the necessary libraries. We will then use `pandas` to import the data set. 

In [None]:
import numpy as np
import pandas as pd
from sklearn import tree, ensemble

Notice that this week's datacset is in `.xlsx` format.

Complete the code cell below by adding the name of the data set as a `str` to `.read_excel()`. Assign the dataframe to the variable `df`.


Before building any machine learning algorithms, we should explore the data.

We begin by visualising the first ten rows of the dataframe `df` using the function `.head()`. By default, `.head()` displays the first five rows of a dataframe.

Complete the code cell below by passing the desired number of rows as an `int` to the function `.head()`.

In [None]:
messages.head( )

For your convenience, here is a brief description of what some of the columns represent:
    
- field: the field in which each student is taking their studies in
- graduationYear: the year in which each student graduated
- loanAmount: the amount each student owns
- selective College: binary valued column: 1 for students who attend a selective college, 0 for students that do not
- sex: sex of the student


[Back to top](#Index:) 

<a id='part2'></a>

### Part 2 -  Translate the categorical predictors into numerical predictors

How do we handle categorical features?

In most of the well-established machine learning systems, categorical variables are handled naturally. However, when dealing with decision trees using `scikit-learn`, we need to encode (translate) categorical features into numerical features.

Arguably, the easiest way to achieve this is by using the `pandas` function `get_dummies()` that converts categorical variables into dummy/indicator variables.



Observe the dataframe `df` above. Which columns do you think are affected by the `get_dummies()` function?

**TYPE YOUR ANSWER BY DOUBLE CLICKING ON THIS CELL**

Complete the code cell below by using the function on the dataframe `df`.

In [None]:
#encode categorical variables 
df = 

Because we are only interested in the students that will apply for a student loan,  we will only need to keep the column `Default_Yes`.

Complete the code cell below by using the function `.drop()` on `df` to eliminate the *column* `Default_no`. The `axis` parameter in `.drop()` controls whether the function acts on rows or columns.



In [None]:
#drop Default_No
#df = 

Run the code cell below to visualise the new dataframe with the encoded columns.

In [None]:
df

[Back to top](#Index:) 

<a id='part3'></a> 

### Part 3 - Shuffle the data set

Now, we want to shuffle the data: one way of doing this is by converting our dataframe to a `NumPy` array and then using the `.shuffle()` function to achieve this. Run the code cell below to convert the `df` into a `NumPy` array.

In [None]:
Xy=np.array(df)

For reproducibility, set the random seed = 2. You can do this by using the `NumPy` function `random.seed()`. Assign your seed to the variable `seed`. Next, complete the code cell below by using the function `random.shuffle()` on `Xy`.

In [None]:
seed = np.random.seed(2)
np.random.shuffle(Xy)

#seed = 
#shuffle
#np.

Before splitting the data into a training set, a test set, and a validation set, we need to divide `Xy` into two arrays: the first one, `X`, a 2D array containing all the predictors and the second, `y`, is a 1D array with the response. 

Run the code cell below to generate `X`. Complete the remaining code to define `y`.

In [None]:
X=Xy[:,:-1]

In [None]:
#define y
y= 

Because we need to split into sets with certain dimensions according to the instructions given above, it would be useful to know how big our `X` and `y` are.

Run the code cell below to retrieve this information.

In [None]:
print(len(X))
print(len(y))

Next, we need to split the messages into into 50% training data, 25% validation data, and 25% test data.

Run the code below to split `X` into training, validation and test sets.

In [None]:
trainsize = 1000
trainplusvalsize = 500
X_train=X[:trainsize]
X_val=X[trainsize:trainsize + trainplusvalsize]
X_test=X[trainsize + trainplusvalsize:]


Following the syntax used above, complete the cell below to split `y` into training set, a validation set, and a test set.

**HINT:** Remember that `y` is a 1D array!

In [None]:
y_train=
y_val=
y_test=


[Back to top](#Index:)

<a id='part4'></a>

### Part 4 - Calculate the accuracy of the Naïve benchmark on the validation set

In this part, we want to calculate the accuracy of the Naïve benchmarch on both the `y` training and validation sets. In other words, we want to understand how accurate our predictions would be if we assumed that no one defaulted on their student loans.

Accuracy can be computed by comparing actual test set values and predicted values. In this example, the formulae to compute accuracy are:

$$ \text{acc_train} = 1 - \frac{\sum{\text{y_train}}}{\text{len(y_train)}},$$

$$ \text{acc_val} = 1 - \frac{\sum{\text{y_val}}}{\text{len(y_val)}}.$$

Note that $\frac{\sum{\text{y_train}}}{\text{len(y_train)}}$ reflects the proportion of students who defaulted on their loan in the training set, and $\frac{\sum{\text{y_val}}}{\text{len(y_val)}}$ reflects the proportion of students who defaulted on their loan in the validation set.

Compute the required accuracy in the code cell below.

In [None]:
acc_train = 
acc_val = 

Run the code cell below to print the results to screen. What can you say about the baseline accuracy if we predict that no students defaulted (i.e., everyone belongs to the majority class)?

In [None]:
print ( 'Naïve guess train and validation', acc_train , acc_val)

[Back to top](#Index:) 

<a id='part5'></a>

### Part 5 - Train a decision tree using the default settings

The easiest way to create a decision tree model is by using the function `DecisionTreeClassifier()`. This function is part of the `tree` module of `Scikit-learn` (`sklearn`).

As we will see, there are ways to improve the accuracy of our tree. For now, let's build a classifier using the default settings.

In the code cell below, use `DecisionTreeClassifier()` to define a classifier `clf` . Next, use the method `fit()` of your classifier to fit your training sets, `X_train` and `y_train`.

In [None]:
clf =
#Fit X_train and y_train 


Run the code cell below to visualise the new scores on the training and validation sets.

In [None]:
print ( 'Full tree guess train/validation ',clf.score(X_train, y_train),clf.score(X_val, y_val))


Based on the results obtained so far, what do you think of this classifier?

**TYPE YOUR ANSWER BY DOUBLE CLICKING ON THIS CELL**

[Back to top](#Index:) 

<a id='part6'></a>

### Part 6 - Train a decision tree using  different maximum depths for the tree

One way we can optimise the decision tree algorithm is by adjusting the maximum depth of the tree. This process is an example of pre-pruning. 

In the following example, we will compute the score for  a decision tree on the same data with `max_depth = 15`.

We begin by defining the variables `bestdepth` and `bestscore`, assuming the *worst case scenario*. Run the code cell below to inizialise the variable as desired.

In [None]:
bestdepth=-1
bestscore=0
max_depth = 15

Next, we will write a for loop to progressively compute the new training/validation scores for different depths.

Here is the pseudocode for the for loop you will need to implement:

```python

for i in range(max_depth):
    # compute new classifier clf with depth = max_depth = i+1
    # fit the X and y training sets with the new classifier
    # compute the updated trainscore using .score() on the training set 
    # compute the updated valscore using .score() on the validation set
    # print the scores
    print ( 'Depth:', i+1, 'Training Score:', trainscore, 'Validation Score:', valscore)
     
    # if valscore is better than bestscore:
        # update the value of bestscore
        # increase bestdepth by one unit
    
```

In [None]:
for i in range(15):
    clf = 
    #fit the training sets
    clf.fit( )
    #update trainscore
    trainscore=
    #update valscore
    valscore=
    print( 'Depth:', i+1, 'Train Score:', trainscore, 'Validation Score:', valscore)
    if    :
        #update bestscore
        bestscore=
        #update depth
        bestdepth=

Choose the most appropriate tree depth and justify your choice.

**TYPE YOUR ANSWER BY DOUBLE CLICKING ON THIS CELL**

[Back to top](#Index:) 

<a id='part7'></a>

### Part 7 - Retrain the best classifier using all the samples

For the last part of this assignment, retrain the best classifier using all the samples from the training and the validation sets *together*. 

We begin by re-defining our `X_trainval` and `y_trainval`. Below, we have defined `X_trainval` for you.

In [None]:
X_trainval=X[:trainplusvalsize,:]

Following the syntax given above, define `y_trainval`.

Again, remember that `y` is a 1D array! 

In [None]:
y_trainval = 

To re-train the sets using the best classifier, re-define `clf`  using `DecisionTreeClassifier()` with `max_depth` equal to the `bestdepth` computed in Part 6. Next, fit the classifiers to the sets just defined above.

Complete the code cell below:

In [None]:
clf = 
#fitX_trainval and y_trainval

Finally, re-train the best classifier on all samples (including the test set).

Do so by using the function `score()` to compute the score on both the test sets. Assign the result to `test_score`.
 
 


In [None]:
test_score = 

In [None]:
print('testing set score', test_score)

 What do you observe?

**TYPE YOUR ANSWER BY DOUBLE-CLICKING ON THIS CELL**

CONGRATULATIONS ON COMPLETING THE WEEK 9 ASSIGNMENT!