# Self-study try-it activity 10.1: Selecting tree depth in Python


    
## Overview

Decision trees are non-parametric supervised learning methods used for both classification and regression tasks. They structure decisions as a tree, comprising a root node, internal decision nodes, branches and leaf nodes, where each leaf represents a final prediction or classification. The tree recursively splits data into subsets based on feature values, forming simple if-then-else rules. 

While deeper trees can capture complex patterns, they also increase the risk of overfitting. Decision trees are intuitive and easy to visualise, require minimal data preparation and learn to approximate the target variable from data features.

### About this assignment

This assignment is designed to help you apply machine learning algorithms in Python. You’ll work within a Jupyter Notebook that includes embedded instructions, relevant Python concepts and starter code to guide your progress. Be sure to run all code cells before submitting your work. After completing the assignment, you are advised to compare your results with the provided solution file for self-assessment.

### About this notebook

This notebook is structured into seven parts as follows:

- [Part 1](#part1): Import the data set and exploratory data analysis 

- [Part 2](#part2): Translate the categorical predictors into numerical predictors

- [Part 3](#part3): Shuffle the data set

- [Part 4](#part4): Calculate the accuracy of the naive benchmark on the validation set.

- [Part 5](#part5): Train a decision tree using the default settings

- [Part 6](#part6): Train a decision tree using different maximum depths for the tree

- [Part 7](#part7): Retrain the best classifier using all of the samples

## Classification and regression trees

The basic idea behind the algorithm for classification via regression trees can be summarised as follows:

- Load the data set.

- Select the best attribute using attribute selection measures to split the records.

- Make that attribute a decision node, and break the data set into smaller subsets.

- Start building the tree by repeating this process recursively for each child until one of the conditions will match:
    - All the tuples belong to the same attribute value.
    - There are no more remaining attributes.
    - There are no more instances.

### Predict defaults for student loans applications

For this exercise, you will use the data set `loandata.csv` to predict defaults for student loans applications using regression trees. 

You will perform the following steps:

1. Load the data set `loandata.csv` into Python.

2. Translate the categorical predictors into numerical predictors.

3. Split the data set into 50 per cent training data, 25 per cent validation data and 25 per cent test data.

4. Calculate the accuracy of the naive benchmark on the validation set.

5. Train a decision tree using the default settings.

6. Retry the previous step using different maximum depths for the tree.

7. Choose the most appropriate tree depth and justify your choice. Retrain the best classifier using all the samples from both the training and the validation set. Retrain the best classifier on all samples (including the test set), and describe the tree that you obtain.

[Back to top](#Index:)

<a id='part1'></a>

### Part 1: Import the data set and exploratory data analysis

Begin by importing the necessary libraries. You will then use `pandas` to import the data set.

In [None]:
import numpy as np
import pandas as pd
from sklearn import tree, ensemble
from sklearn import preprocessing

Assign the data frame to the variable `df`.

In [None]:
df=pd.read_csv('data/loandata.csv')



Before building any machine learning algorithms, you need to explore the data.

Begin by visualising the first ten rows of the data frame `df` using the function `.head()`. By default, `.head()` displays the first five rows of a data frame.

Complete the code cell below by passing the desired number of rows as an `int` to the function `.head()`.

In [None]:
df.head(10)


For your convenience, here is a brief description of what some of the columns represent:
    
- `field`: the field in which each student is taking their studies in

- `graduationYear`: the year in which each student graduated

- `loanAmount`: the amount each student owns

- `selectiveCollege`: binary valued column: 1 for students who attend a selective college, 0 for students that do not

- `sex`: sex of the student


[Back to top](#Index:)

<a id='part2'></a>

### Part 2: Translate the categorical predictors into numerical predictors


In most of the well-established machine learning systems, categorical variables are handled naturally. However, when dealing with decision trees using `scikit-learn`, you need to encode (translate) categorical features into numerical features.

Arguably, the easiest way to achieve this is by using the `pandas` function `get_dummies()`, which converts categorical variables into dummy/indicator variables.


Complete the code cell below by using the data frame `df`.

In [None]:
df=pd.get_dummies(df)



Because you are only interested in the students that will apply for a student loan, you will only need to keep the column `Default_Yes`.

Complete the code cell below by using the function `.

- Select the `bool` columns and convert them to `int` data type.

- `drop()` on `df` to eliminate the *column* `Default_no`. The `axis` parameter in `.drop()` controls whether the function acts on rows or columns.

- Convert the `Default_Yes` to `int` data type.



Run the code cell below to visualise the new data frame with the encoded columns.

In [None]:
bool_cols = df.select_dtypes('bool').columns
df[bool_cols] = df[bool_cols].astype(int)


In [None]:
df=df.drop(['Default_No'], axis=1)
y = df['Default_Yes'].astype(int)

[Back to top](#Index:)

<a id='part3'></a>

### Part 3: Prepare the target

 Convert the DataFrame to a `NumPy` array, and ensure that the last column contains integer values, which is often done to prepare target labels for machine learning tasks.

In [None]:
Xy=np.array(df)
Xy[:,-1] = Xy[:,-1].astype(int)

For reproducibility, set the random `seed = 2`. You can do this by using the `NumPy` function `random.seed()`. 

Assign your seed to the variable `seed`. 

Next, complete the code cell below by using the function `random.shuffle()` on $Xy$.

In [None]:
seed = np.random.seed(2)
np.random.shuffle(Xy)

Before splitting the data into a training set, a test set and a validation set, you need to divide $Xy$ into two arrays: the first one, $X$, a 2D array containing all the predictors, and the second, $y$, is a 1D array with the response.

Run the code cell below to generate $X$. Complete the remaining code to define $y$.

In [None]:
X=Xy[:,:-1]


In [None]:
y=Xy[:,-1]


Because you need to split the data into sets with certain dimensions according to the instructions given above, it would be useful to know how big the $X$ and $y$ are.

Run the code cell below to retrieve this information.

In [None]:
print(len(X))
print(len(y))

Next, you need to split the messages into 50 per cent training data, 25 per cent validation data and 25 per cent test data.

Run the code below to split $X$ into training, validation and test sets.

In [None]:
trainsize = 1000
trainplusvalsize = 500
X_train=X[:trainsize]
X_val=X[trainsize:trainsize + trainplusvalsize]
X_test=X[trainsize + trainplusvalsize:]


Following the same syntax, complete the cell below to split `y` into training set, a validation set and a test set.

**Hint:** Remember that `y` is a 1D array.

In [None]:
y_train=y[:trainsize]
y_val=y[trainsize:trainsize + trainplusvalsize]
y_test=y[trainsize + trainplusvalsize:]

[Back to top](#Index:)

<a id='part4'></a>

### Part 4: Calculate the accuracy of the naive benchmark on the validation set

In this part, you want to calculate the accuracy of the naive benchmark on both the $y$ training and validation sets. In other words, you want to understand how accurate your predictions would be, assuming that no one defaulted on their student loans.

Accuracy can be computed by comparing actual test set values and predicted values. In this example, the formulae to compute accuracy are:

$$\text{acc_train} = 1 - \frac{\sum{\text{y_train}}}{\text{len(y_train)}},$$

$$ \text{acc_val} = 1 - \frac{\sum{\text{y_val}}}{\text{len(y_val)}}.$$

Note that $\frac{\sum{\text{y_train}}}{\text{len(y_train)}}$ reflects the proportion of students who defaulted on their loan in the training set, and $\frac{\sum{\text{y_val}}}{\text{len(y_val)}}$ reflects the proportion of students who defaulted on their loan in the validation set.

Compute the required accuracy in the code cell below.

In [None]:
acc_train = 1-sum(y_train)/len(y_train)
acc_val = 1-sum(y_val)/len(y_val)


Run the code cell below to print the results to screen. What can you say about the baseline accuracy if you predict that no students defaulted (i.e. everyone belongs to the majority class)?

In [None]:
print ( 'Naïve guess train and validation', acc_train , acc_val)

[Back to top](#Index:)

<a id='part5'></a>

### Part 5: Train a decision tree using the default settings

The easiest way to create a decision tree model is by using the function `DecisionTreeClassifier()`. This function is part of the `tree` module of `Scikit-learn` (`sklearn`).

You will explore that there are ways to improve the accuracy of the tree. For now, let's build a classifier using the default settings.

In the code cell below, use `DecisionTreeClassifier()` to define a classifier `clf` . 

Next, use the method `fit()` of your classifier to fit your training sets, `X_train` and `y_train`.

In [None]:
from sklearn import preprocessing

lab = preprocessing.LabelEncoder()
y_train_categorical = lab.fit_transform(y_train)
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train_categorical)


Run the code cell below to visualise the new scores on the training and validation sets.

In [None]:
print ( 'Full tree guess train/validation ',clf.score(X_train, y_train),clf.score(X_val, y_val))


[Back to top](#Index:)

<a id='part6'></a>

### Part 6: Train a decision tree using  different maximum depths for the tree

One way to optimise the decision tree algorithm is by adjusting the maximum depth of the tree. This process is an example of pre-pruning.

In the following example, you will compute the score for a decision tree on the same data with `max_depth = 15`.

You will begin by defining the variables `bestdepth` and `bestscore`, assuming the *worst case scenario*. Run the code cell below to initialise the variable as desired.

In [None]:
bestdepth=-1
bestscore=0
max_depth = 15

Next, write a loop to progressively compute the new train/validation scores for different depths.

Here is the pseudocode for the for loop you will need to implement:

```python

for i in range(max_depth):
    # Compute a new classifier clf with depth = max_depth = i+1
    # Fit the X and y training sets with the new classifier
    # Compute the updated trainscore using .score() on the training set
    # Compute the updated valscore using .score() on the validation set
    # Print the scores
    print ( 'Depth:', i+1, 'Training Score:', trainscore, 'Validation Score:', valscore)
     
    # If valscore is better than bestscore:
        # Update the value of bestscore
        # Increase bestdepth by one unit
    
```

In [None]:
for i in range(max_depth):
    clf = tree.DecisionTreeClassifier(max_depth=i+1)
    clf.fit(X_train,y_train)
    trainscore=clf.score(X_train,y_train)
    valscore=clf.score(X_val,y_val)
    print( 'Depth:', i+1, 'Train Score:', trainscore, 'Validation Score:', valscore)

    if valscore>bestscore:
        bestscore=valscore
        bestdepth=i+1


Depths 2 to 4 yield identical high validation accuracy (0.872), with consistent training accuracy around 0.891. These depths represent the best balance of performance and generalizability.
Starting from depth 5 onward, training accuracy continues to increase while validation accuracy declines, indicating the onset of overfitting


[Back to top](#Index:)

<a id='part7'></a>

### Part 7: Retrain the best classifier using all of the samples

For the last part of this assignment, retrain the best classifier using all the samples from the training and the validation sets *together*.

Begin by redefining your `X_trainval` and `y_trainval`. Below, the `X_trainval`function has been defined for you.

In [None]:
X_trainval=X[:trainsize + trainplusvalsize,:]

Following the syntax given above, define `y_trainval`.

Again, remember that `y` is a 1D array.

In [None]:
y_trainval=y[:trainsize + trainplusvalsize]


To retrain the sets using the best classifier, redefine `clf`  using `DecisionTreeClassifier()` with `max_depth` equal to the `bestdepth` computed in Part 6. 

Next, fit the classifiers to the sets just defined above.

Complete the code cell below:

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=bestdepth)
clf.fit(X_trainval,y_trainval)


Use the function `score()` to compute the score on both the test sets. Assign the result to `test_score`.



In [None]:
test_score = clf.score(X_test,y_test)


In [None]:
print('testing set score', test_score)

 This is the score of your best classifier on the test set.