# Computational Astrophysics 2021
---
## Eduard Larrañaga

Observatorio Astronómico Nacional\
Facultad de Ciencias\
Universidad Nacional de Colombia

---

## 02. The Problem of Over-Fitting the Decision Trees
### About this notebook

In this worksheet we will illustrate the problem of over-fitting decision trees.

---

Over-fitting the data means that an algorithm tries to incorporate (all) the outlying data points, which implies that the prediction accuracy of the general trend is diminished. 


In this worksheet, we will illustrate how decision trees tend to overfit the data if they are left unsupervised. We will use the same dataset of galaxies.


### Loading the Data

As before, we will use the dataset provided as a NumPy strctured array in a binary format (.npy) called 'sdss_galaxy_colors.npy'. 


In [None]:
import numpy as np

In [None]:
path='' #Define an empty string to use in case of local working

In [None]:
# Working with google colab needs to mount the Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# we define the path to the files
path = '/content/drive/MyDrive/Colab Notebooks/CA2021/11. Decision Trees/presentation/'

In [None]:
data = np.load(path+'sdss_galaxy_colors.npy')
data

array([(19.84132, 19.52656, 19.46946, 19.17955, 19.10763, b'QSO', 0.539301  , 6.543622e-05),
       (19.86318, 18.66298, 17.84272, 17.38978, 17.14313, b'GALAXY', 0.1645703 , 1.186625e-05),
       (19.97362, 18.31421, 17.47922, 17.0744 , 16.76174, b'GALAXY', 0.04190006, 2.183788e-05),
       ...,
       (19.82667, 18.10038, 17.16133, 16.5796 , 16.19755, b'GALAXY', 0.0784592 , 2.159406e-05),
       (19.98672, 19.75385, 19.5713 , 19.27739, 19.25895, b'QSO', 1.567295  , 4.505933e-04),
       (18.00024, 17.80957, 17.77302, 17.72663, 17.7264 , b'QSO', 0.4749449 , 6.203324e-05)],
      dtype=[('u', '<f8'), ('g', '<f8'), ('r', '<f8'), ('i', '<f8'), ('z', '<f8'), ('spec_class', 'S6'), ('redshift', '<f8'), ('redshift_err', '<f8')])

In this kind of data structure, the `dtype` attribute corresponds to the name of the features. For our example, we identify the following:

| dtype | Feature|
|:-:|:-:|
|`u` |u band filter|
|`g` |g band filter|
|`r` |r band filter|
|`i` |i band filter|
|`z` |z band filter|
|`spec_class` |spectral class|
|`redshift` |redshift|
|`redshift_err` |redshift error|


The number of samples (galaxies) in this dataset is

In [None]:
n = data.size
n

50000

---
### Training the Decision Tree

As we have seen, decision trees have many advantages such as 
- They are simple to implement
- They are easy to interpret
- The data does not require too much preparation
- Decision trees are (usually) computationally efficient.

However, decision trees also have some limitations.One of the biggest is that they tend to over-fit the data if they are not checked. The over-fitting means that they will create a super-complicated tree that attempts to account for (all) the outliers in the data. This problem appears because the algorithm tries to optimise the decision locally at each node. 


In order to implement the decision tree, we will use the functions defined in the previous worksheet to define the features (4 color indices) and the targets (redshift)

In [None]:
# Function returning the 4 color indices and the redshifts

features, targets = ...


We will also split the data into training and testing subsets. You can choose the size of the split (if in doubt, a 50:50 split is fine).

In [None]:
split_n = n//2
train_features = features[:split]
test_features = features[split:]
train_targets = ...
trest_targets = ...

As before, we will use the function `sklearn.tree.DecisionTreeRegressor`.

However, in order to reduce the over-fitting, we can constrain the number of decision node rows, called the **tree depth**. We can control the depth of decision tree learned, using an argument in the `DecisionTreeRegressor`function. 

To set the maximum depth to 5 we use

In [None]:
from sklearn.tree import DecisionTreeRegressor

dec_tree = DecisionTreeRegressor(max_depth=5)


The decision tree is trained using the method `.fit()` and the defined train subsets of the 'features' and 'targets' arrays. 

Detailed information about this function is available at

https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

The decision tree is trained using the method `.fit()` and the training subsets of the arrays 'features' and 'targets',




In [None]:
dec_tree.fit(train_features, train_targets)

#### Testing the Decision Tree

Once the decision tree is ready, we will apply the method `.predict()` to the test subset.

In [None]:
predictions = dec_tree.predict(test_features)

In order to evaluate the decision tree, we will use again the median of the differences between the predictions and the target values, i.e.

\begin{equation}
\text{eval_dec_tree} = \text{median}\left\lbrace \left| \text{predictions}_i - \text{targets}_i \right|\right\rbrace
\end{equation}

Use the same function defined in the previous worksheet. 

In [None]:
eval_dec_tree = ...

#### Over-Fitting and Tree Depth

In order to see how the tree is overfitting the data, we will examine how the decision tree performs for different tree depths. 

It may be expected that, the deeper the tree, the better it should perform. However, we will show that as the model overfits, there is an important difference in the accuracy of the prediction when applied to the training data and to the testing data.

**1. Define a function that creates a decision tree with depths in the range 0 to 40. The function must use the decision tree to predict the redshift of the training and test subsets and calculate the corresponding median of the differences to evaluate the algorithm.**

**2. Plot the median of the differences vs tree depth.**

The plot should look like this

<center>
<img src="https://groklearning-cdn.com/problems/8Cet6iLGMbP2L8t7SVkEEg/overfitting.png" width=450>
</center>

The above graphic shows thhoew the accuracy of the decision tree on the **training set** gets better as we allow the tree to grow to greater depths. **In fact, at a depth of around 27, the errors goes to zero!**

On the other hand, the accuracy measure of the predictions for the **test set** gets better initially but then it gets worse at larger tree depths. Hence, this plot shows how at a tree depth of around 19, the decision tree starts to **overfit** the data!. This happens because the algorithm tries to include outliers in the training set and it produces a decrease in its general predictive accuracy.

In order to prevent the over-fitting problem, we note that the the better value of the accuracy of the predictions for the test set is around 19 or 20 and therefore, we will adjust the tree depth to this value. 