# Weak Learners and Decision Trees

In [20]:
import sys
sys.executable
import numpy as np
# import matplotlib as pyplot

This week we will learn abouot weak learners: specifically, decision stumps and decision trees. This will provide us with the building blocks that we will need for ensemble learning (week 7)

## Weak Learners

A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples slightly better than random guessing). While the specific learning rule can take many forms, the basic concept stays constant. Essentially the goal is to split the data using a simple rule i.e. for a two_class problem:

$$ h(\mathbf{x},\theta): \mathbb{R}^d \times \Gamma \rightarrow {0,1} $$ 

Here, $\Gamma$ represents the full parameter space, and  $\theta= (\phi, \psi, \tau)$ reflects the parameters for a specific weak learner; of these $\phi=\phi (\mathbf{x})$ selects a subset of features, $\psi$ defines the method to split the data and $\tau$ determines the thresholds used to split the data. 

Examples of simple weak learner are shown in the below figure:

<img src="imgs/weakleaners.png" style="max-width:100%; width: 70%; max-width: none">

Here (a) shows what is known as an 'axis-aligned' weak learner. More specifically this means that $\psi$ is a simple split of the data is made by thresholding on the value of a single feature ($x_1$, selected by $\phi(\mathbf{x})$). The choice of threshold is optimised so as to maximally split the different classes of the data. By example, try different ```thresholds``` below so as to maximally separate the classes ```{0,1}``` of the below  ```dataset```. Note, ```dataset``` represents a set of examples with single feature (first column); the second example represents the class. The below code seeks to set all data examples strictly below the threshold to the first group and all examples above to the second group; select an appropriate threshold and run the code.


In [21]:
dataset = np.asarray([[1,0],
           [2,0],
           [3,0],
           [4,0],
           [5,0],
           [6,1],
           [7,1],
           [8,1],
           [9,1],
           [10,1]])

threshold=6

group1labels=[]; group2labels=[]

for row in dataset:
        # if the value of this feature for this row is less than the (threshold) value, 
        # split into left branch, else split into right
        if row[0] < threshold:
            group1labels.append(row[1])
        else:
            group2labels.append(row[1])
            
print('With threshold {} the group1labels are {}'.format(threshold, group1labels))
print('With threshold {} the group2labels are {}'.format(threshold, group2labels))

With threshold 6 the group1labels are [0, 0, 0, 0, 0]
With threshold 6 the group2labels are [1, 1, 1, 1, 1]


(b) on the other hand, represents fitting a simple linear classifier (e.g. perception/logistic regression/linear SVM) to a subset of features (in this case 2). Thus here $\phi(\mathbf{x})$ has been used to select 2 features from the data set and $\psi$ has parameterised the slope and intercept of the line such that:

$$ h(\mathbf{x},\theta) = [\tau_1 > \phi(\mathbf{x}).\psi > \tau_2 ]$$ 

Thus, $\phi(\mathbf{x}).\psi$ reflects the projection of the feature subset onto the line and $\tau_1$ and $\tau_2$ reflect the thresholds, such that below $\tau_1$ all data is assigned to one group and above $\tau_2$ all data is assigned to a second group.

Finally, (c) takes this one step further and instead learns a non-linear classier (i.e. non-linear SVM with RBF kernel) such that 

$$ h(\mathbf{x},\theta) = [\tau_1 > \phi(\mathbf{x})^T.\psi\phi(\mathbf{x}) > \tau_2 ]$$ 

And thus $\psi$ reflects a conic section in $\mathbb{R}^2$.

## Decision Stumps

A Decision stump represents both a weak learner _and_ one level of a decision tree for example:


<img src="imgs/decisionstump.png" style="max-width:100%; width: 20%; max-width: none">


Here, a decision is made on a single feature from the data, and the choice reflects an exis-aligned classifier (a simple threshold on that follows). In what follows we will focus on classification stumps. However, details on regression trees are provided in the video lecture.

### Weak Learning rules for classification

We will discuss two options for classification tree cost functions; these are Information Gain and the Gini Index.

- ***Information Gain*** represents the decrease in entropy obtained after a dataset is split on an attribute.

- ***Gini Index*** reflects how mixed the classes are following the split. Perfect separation results in a score of 0, whereas the worst case split (that results in 50/50 classes in each group) results in a Gini score of 0.5 (for a 2 class problem).

The calculation for ***Information Gain*** is as follows:

$$ I(S_j,\theta_j) = H(S_j) - \sum_{i \in {L,R}} \frac{|S_j^i|}{|S_j|}H(S^i_j) $$

Where $H(S_j)$ represents entropy (or the amount of disorder in a system): 

$$H(S_j)=\sum_{y_k \in Y} p(y_k) log_2 p(y_k); $$

Y are the class labels (i.e. {0,1} for a binary problem); $ p(y_k)$ is the proportion of examples that have class $k$ reaching the current (in this case parent) node; $|S_j|$ is the total number of examples reaching node $j$ and $|S_j^i|$ is the number of examples passing down branch $i$ from node $j$.

Looking at a toy example:

<img src="imgs/InformationGainexample.png" style="max-width:100%; width: 70%; max-width: none">

Here, we start with 23 examples at the parent node (14 o and 9 +). We are interested in testing a split that results in the right child node taking 11 instances (4 o and 7 +) and the left child node taking 12 instances (10 o and 2+). We calculate the entropies for the parent node and each child node as follows:


<img src="imgs/InformationGainexample2.png" style="max-width:100%; width: 70%; max-width: none">

The final cost is then estimated from a weighted sum of entropies from each child $H(S^i_j)$:

$$ \frac{11}{23}  \times 0.946 + \frac{12}{23}\times 0.650 =0.792 $$

Subtracted from the original parent entropy ($H(S_j)$):

$$ 𝐼(𝑆_𝑗,\theta_𝑗 )= 0.966−0.792=0.163 $$

Where the weights here are estimated from the proportion of examples in each node relative to that in the parent node i.e. ($\frac{|S_j^i|}{|S_j|}$).

The ***Gini Index*** is estimated as :

$$Gini = 1 -\sum_{y_k\in Y} p(y_k)^2 $$

Where, again, $p(y_k)$ represents the proportion of examples that have class $k$ reaching the node. Thus, estimating this for each child node separately:

<img src="imgs/Gini_example.png" style="max-width:100%; width: 70%; max-width: none">
<a id='gini'></a>
And once again left and right splits are combined as a weighted sum:

$$ I(S_j,\theta_j) = \sum_{i \in {L,R}} \frac{|S_j^i|}{|S_j|}Gini_i $$

Which for this example returns:

$$ 𝐼(𝑆_𝑗,\theta_𝑗 )= \frac{11}{23} \times 0.463 + \frac{12}{23} \times 0.278 =0.366 $$

Note, for datasets containing many categorical variables Information Gain is biased in favour of attributes with more categories; thus, Gini index is default for Scikit-Learn.

Further, remember that when optimising costs, ***Gini Index*** must be ***MINIMISED*** and ***Information Gain*** must be ***MAXIMISED***


## Exercise 1: building a decision stump classifier from scratch

In what follows we will go step by step through the process of building and training a decision stump to perform classification. This will build towards Exercise 2 (optional) where multiple decision stumps are stacked together into a tree

In the first part of the exercise we will use the Gini Coefficient to identify the best split of the data for one node in our tree. To test this we will use the following toy dataset $\mathbf{X}$ made up of ten examples (rows) each with two features (columns 1 and 2) and binary labels (column 3):

In [22]:
X = np.asarray([[2.771244718,1.784783929,0],
           [1.728571309,1.169761413,0],
           [3.678319846,2.81281357,0],
           [3.961043357,2.61995032,0],
           [2.999208922,2.209014212,0],
           [7.497545867,3.162953546,1],
           [9.00220326,3.339047188,1],
           [7.444542326,0.476683375,1],
           [10.12493903,3.234550982,1],
           [6.642287351,3.319983761,1]])

We will go through the process step by step editing the below functions, and testing it on this data, until we are sure our code functions correctly

###  Exercise 1.1: calculate the gini coefficient for a given split

<a id='step1'></a>

First we need to write our own function for evaluating the cost for any proposed split of the data using the Gini Coefficient (c)

In the below function the input will take a list (```branch```) with the total number of examples for each class in the split. Since we have two classes the length of the list must be two and given 10 data examples the sum of this list must be 10. Thus one possible example might be the list  ```[2 8]```). 

Thus we must calculate the gini coefficient by first estimating the total number of examples reaching this tree node. Then estimate the proportions of each class ($p(y_k$) and use this to estimate the coefficient.

Complete the function by replacing all ```None ``` statements with the correct code. 

1. Sum the elements of ```branch``` to estimate total number of examples
2. Estimate ($p(y_k$): the proportion of total items in the split that belong to each class
3. Subtract $p(y_k)^2$ from the current estimate of the Gini Coefficient. See how here we have initialised our Gini as 1 outside of the loop; we can then iteratively subtract the proportion for each class from this total using the shorthand ```-=``` notation

**If you get stuck there are more hints that you can lock by viewing the hidden cell below (go to view->cell Toobar-> toggle ```Hide code``` and uncheck)**

**Hidden cell below**

In [23]:
# Extra hints

# 1. to estimate the total number of examples in a list you might use np.sum 
# 2. to estimate ($p(y_k$)) take the specific example [2, 8] 
#.   for this given list the total items is 10 and we have 2 elements of class 0 and 8 elements of class 1
# 3. As Gini is initialised to 1 we can achieve $gini=(1- \sum_{y_k \in Y} p(y_k)^2)$ 
#    by subtracting the proportion (for each class) estimate for each iteration of the loop

In [24]:
def gini_coefficient(branch):
    
    """
        Estimates Gini Coefficient for a given class split
        input:
            split: list of length k (where k= number of classes).
                   The values at each index reflect the toal number of instances 
                   of each class, for this proposed branch split
                             
        output:
            gini: gini coefficient for this split 
    """
    # estimating total number of samples in branch split (by summing contents of split list)
    split_size=np.sum(branch)
    gini=1
    # iterating over all items in the array
    for class_total in branch:
        # estimating p*p for this class label; subtracting from current gini total
        proportion_class_k=(class_total/split_size)
        gini-=proportion_class_k*proportion_class_k
        
    return gini

Let's test assume the split of data in our branch is [2,8]:

In [25]:
gini_coefficient([2,8])

0.31999999999999984

###  Exercise 1.2: Propose splits 
<a id='step2'></a>


Now that we are able to estimate the cost of any proposed split we need to be able to generate suggested splits of our dataset. For this we need a function, that given a) some feature to split on (```index``` - column index) and b) ```value``` some threshold to split on,  will 
1. check all the values of the features at that indexed position, and 
2. split the data into a left branch (if that data example's feature is below the threshold) and into a right branch (if that data example's feature is below the threshold). We will call this function ```test_split```.

Edit the code here to input an ```if``` statement that checks for each row whether the value of the feature (indicated by the position variable ```index```) is below or above the threshold ```value```. If the feature value is below the threshold then add the row to the '```left```' list; if it is below add it to the '```right```' list.

**Note that:***

- ```for row in dataset:``` will slice rows from ```dataset``` so what we are asking you to code is a check against the feature value for the feature located at ``index``` . 
- you need to subsequently choose (based on that threshold) whether you add that row to the ```left``` list or the ```right``` list. How do you add items to a list?

**More hints in Hidden cell below**

In [26]:
# hints
# 1. for a give row, how do you return the value of the feature at a column position given by 'index'
# 2. too add to a list you might want to use 'append'

In [27]:
def test_split(index, value, dataset):
    """
        Split a dataset based on an attribute and an attribute value 
        input:
            index = feature/attribute index (i.e. data column index) on which to split on
            value = threshold value (everything below this goes to left split, 
                    everything above goes to right)
            dataset = array (n_samples,n_features+1) 
                    rows are examples 
                    last column indicates class membership
                    remaining columns reflect features/attributes of data
                             
        output:
            left,right: data arrays reflecting data split into left and right branches, respectively
    """
    
    # create empty list that you will populate with rows of dataset 
    left=[]
    right = []
    # the loop below will slice rows from data set
    for row in dataset:
        # if the value of this feature for this row is less than  
        # the (threshold) value split into left branch, else split into right
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
	
    return np.asarray(left), np.asarray(right)

Let's now estimate a split for the 

1. ***first*** feature.  
2. . Propose a threshold based on the value at the 7th row of our data matrix $\mathbf{X}$ (***remember in both cases that python indexes from zero!!***)
2. Look at $\mathbf{X}$ - is this correct?

In [18]:
index=0
rowindex=6
threshold=X[rowindex,index]
print('the value of the feature {} at row {} of the data set is {}'.format(index,rowindex,threshold))

the value of the feature 0 at row 6 of the data set is 9.00220326


Thus now estimating the split of the data by thresholding on this value gives us

In [28]:
branches=test_split(index, X[rowindex,index], X)

print('Our left branch is \n {}'.format(branches[0]))
print('Our right branch is \n {}'.format(branches[1]))

Our left branch is 
 [[2.77124472 1.78478393 0.        ]
 [1.72857131 1.16976141 0.        ]
 [3.67831985 2.81281357 0.        ]
 [3.96104336 2.61995032 0.        ]
 [2.99920892 2.20901421 0.        ]
 [7.49754587 3.16295355 1.        ]
 [7.44454233 0.47668338 1.        ]
 [6.64228735 3.31998376 1.        ]]
Our right branch is 
 [[ 9.00220326  3.33904719  1.        ]
 [10.12493903  3.23455098  1.        ]]


Thus, branches here is a 2 object tuple, with the first object reflecting the left branch and the second object reflecting the right branch. Note, that this specific notation will be useful for the functions that follow.

###  Exercise 1.3: Estimate total cost of split


So in step 1 we calculate the Gini coefficient for one branch of a split. Let's call this $Gini_i$ (taken notation from [above](#gini)), and in step 2 we defined a function to propose a potential split of the data. 

We now need a function that will estimate the Gini coefficient for both branches of the split and combine to give a final cost:

$$
\begin{align}
I(S_j,\theta_j) && =  && \sum_{i \in {L,R}}  && \frac{|S_j^i|}{|S_j|} && Gini_i \\
&&=  && \sum_{i \in {L,R}} && \frac{total\_examples\_branch_j}{total\_examples\_node} && Gini_i 
\end{align}
$$

Thus clearly, to estimate the full cost of any split, we must now estimate the total cost for both (left and right branches) and sum weighted by their relative proportions:

The below function first estimates the total number of examples reaching the node (by summing total rows from both branches). Then it loops over each branch to estimate the cost of each. 

***Note*** For this it needs to create the list read, which is read as input argument  ```branch ``` in step 1 [```gini_coefficient```](#step1). In the below function this variable is ```class_counts_for_branch```

It then calculates the ```ginicoefficient``` for the given branch, and sums the result weighted by the proportion of data in that branch relative to the total reaching the node.

Let's implement this function. 

***To Do***

1. Variable ```branch_per_class ```: For each ```branch ``` of the proposed split, slice all data examples of the given class for that loop (line 31, given by variable ```class_val```). 
  -  Remember the last column of our dataset ```X``` is the labels and following ```test_split```, ```X``` is split into two branches - these get passed to the ```split_cost``` function as the ```split``` argument
  - split is iterated over (line 21), so for each loop the data array looks at each branch in turn, where the data is passed to the ```branch``` variable
  - ``` branch_per_class``` wants all the rows from ```branch``` corresponding to a specific class (given by variable ```class_val``` (you might see that it does this for both classes - line 25)
2.  count the total number of rows for this slice (line 33) 
3. Use this is estimate the gini coefficient for that branch - save to variable ```gini_split``` (line 37, using the function estimated from step 1) 
4. Weight this by sample size (relative to the ```total_samples``` (line 39) . This is then summed over loops (line 40) to to estimate gini coefficient for this branch 

add to total cost weighted by the proportion of the total samples, which reach this branch 
    
**More hints in Hidden cell below**

In [37]:
# Hints

# 1. to return all rows in ```branch ``` corresponding to a specific ```class_val``` 
#     you need to consider the last column ```branch[:,-1]``` and slice all rows where this equals ``class_val```
 

In [38]:
def split_cost(split,classes): 
    
    """
        Estimates the cost for a proposed split 
        input:
            splits: tuple or form (L,R) where L reflects the data for the left split and
                    R reflects data for left split
            classes: list of class values i.e. [0,1]
                             
        output:
            cost: sum of gini coefficient for left and right sides of the split
    """
    cost=0
    total_samples=0
    
    # estimate the relative size of each branch
    for branch in split:
        total_samples+=branch.shape[0]
    
    # for each (left/right) split on the proposed tree
    for br_index,branch in enumerate(split):
        # initialise list of class counts for this branch
        class_counts_for_branch=[]
        # for each class value, count total of data examples (rows) that have for this class, in this branch 
        for class_val in classes:
            
            if branch.shape[0] == 0: # don't continue if size of split is 0
                continue
           
            # slice data to return only rows from branch which have this specific class value  
            branch_per_class=branch[branch[:,-1]==class_val]
            # count the number of rows in for this class in this branch and append 
            total_rows=branch_per_class.shape[0]
            class_counts_for_branch.append(total_rows)

        # estimate the gini coefficient for this split 
        gini_split=gini_coefficient(class_counts_for_branch)
        # estimated the weighted contribution for this split 
        weighted_by_sample_size=gini_split*(branch.shape[0]/total_samples)
        cost+=weighted_by_sample_size
                        
        
    return cost

Let's estimate the total cost of the split proposed from step 2:

In [39]:
class_values=[0,1]
splitcost=split_cost(branches,class_values)

print('The cost of the proposed split is: ', splitcost)

The cost of the proposed split is:  0.375


###  Exercise 1.4: Choose optimal feature/threshold split 

<a id='step4'></a>

Finally, we now need to put this all together by looping through all possible features (all but the last column of our data matrix), and all possible thresholds to determine the best split for this node.

The **output of this function is a dictionary** (see return statement). This saves all variables required for later predicting on that node, specifically: 1) the feature ```index``` that the node is split on; 2) the threshold ```value``` on which the data is split; and 3) the tuple of data arrays reflecting the resulting split (```branches```)

Edit the below function to:

1. edit line 28 to loop over all features (all columns of the dataset, except the last, which instead reflects the class). **hint** define correct range of values  
2. for each feature index (line 28), try returning the branches corresponding to different thresholds by proposing splits corresponding to thresholding on the value of that feature for each row (line 31) 
  - **hint*** look back over the prevous exercises - which proposes splits of the data? What arguments does it require 
  - **hint*** line 32 iterates over rows of the data set - how can you use this to propose a threshold value? 
  - **note** this returns tuple ```branches``` with two data arrays corresponding to the left and right branches for that split

3. Given ```branches ``` estimate the ```cost``` of the proposed split (**hint** step 3)
4. write an if statement that updates the variables ```best_cost ```, ```best_split ```,  ```best_index ``` and  ```best_value ```  _provided_ that ```cost``` (given in 3.) is lower than the one held previously



In [None]:
# Hints 

# 1 . ```test_split``` splits the data according to a given threshold on a specific features(Ex 1.2); 
#  -  here the feature is given by ```index```
#  - and the threshold value corresponds to the value of that row at location of ```index``` (see also)





In [31]:

def get_best_split(dataset):
    """
        Search through all attributes and all possible thresholds to find the best split for the data
        input:
            dataset = array (n_samples,n_features+1) 
                    rows are examples 
                    last column indicates class membership
                    remaining columns reflect features/attributes of data
                             
        output:
            dict containing: 1) 'index' : index of feature used for splittling on
                             2) 'value': value of threshold split on
                             3) 'branches': tuple of data arrays reflecting the optimal split into left and right branches
                             
    """
    
    # estimating the total number of classes by looking for the total number of different unique values 
    # in the final column of the data set (which represents class labels)
    class_values=np.unique(dataset[:,-1])
    
    # initalising optimal values prior to refinment
    best_cost=sys.float_info.max # initialise to max float
    best_value=sys.float_info.max # initialise to max float
    best_index=dataset.shape[1]+1 # initialise as greater than total number of features
    best_split=tuple() # the best_split variable should contain the output of test_split that corresponds to the optimal cost

    #iterating over all features/attributes (columns of dataset)
    for index in np.arange(dataset.shape[1]-1):

        #Trialling splits defined by each row value for this attribute
        for r_index,row in enumerate(dataset):
            branches=test_split(index, row[index], dataset)

            cost=split_cost(branches,class_values)
            if cost < best_cost:
                best_cost=cost
                best_split=branches
                best_index=index
                best_value=row[index]
                print('Best cost={}; Best feature={}; Best row={}'.format(best_cost,index,r_index) )
                
    return {'index':best_index, 'value':best_value, 'branches':best_split}


Now, our functions for splitting on a single node are complete, let's find the best split of our toy dataset 

In [169]:
split = get_best_split(dataset)

print('The optimal left branch is \n {}'.format(split['branches'][0]))
print('The optimal right branch is \n {}'.format(split['branches'][1]))



Best cost=0.5; Best feature=0; Best row=0
Best cost=0.4444444444444444; Best feature=0; Best row=1
Best cost=0.375; Best feature=0; Best row=2
Best cost=0.2857142857142857; Best feature=0; Best row=3
Best cost=0.1666666666666666; Best feature=0; Best row=4
Best cost=0.0; Best feature=0; Best row=5
The optimal left branch is 
 [[1 0]
 [2 0]
 [3 0]
 [4 0]
 [5 0]]
The optimal right branch is 
 [[ 6  1]
 [ 7  1]
 [ 8  1]
 [ 9  1]
 [10  1]]


## Decision Trees

A decision tree is a hierarchy of decision stumps:

<img src="imgs/decisiontree.png" style="max-width:100%; width: 50%; max-width: none">

The top node is the root node and the terminal nodes are the leaf nodes; in between we refer to the input node of each decision stump as the parent nodes, which splits data down left and right branches to two child nodes

As before, for decision stumps, nodes reflect questions we ask of the data – e.g. threshold we choose (or, in the case of regression, constant functions we fit). Typically, fit on a single features. Edges then reflect the answers to that question – binary choices – as for classification stumps –if a feature values Is less than a threshold it takes the left branch, if it’s more it takes the right branch

This simplicity confers certain advantages:
- it returns learning models which are easy to interpret 
- Requires little data preparation (no normalization)
- Is able to handle both numerical and categorical data. 
- Is able to handle multi-output (multi-class) problems.

And very importantly, the general approach can be applied for classification, where at each node an axis aligned classifier (threshold) is fit to optimally separate the data which reaches that node. For example, in the below (as explained in the video lecture). The algorithm splits first on feature $x_1$ to largely separate pink, red and yellow crosses from the light blue and dark green crosses (note crosses are chan ged to different shapes in video lectures to improve differentiation for those with colour blindness).

The second decision stumps splits on $x_2$, and the third on $x_1$ again, all so as to partition the featurespace up into blocks which optimally separate the different classes. In this way, hopefully you can see it's able to learn a non-linear decision boundary.

In [2]:
import io
import base64
from IPython.display import HTML

video = io.open('RFclassifier.mp4', 'r+b').read()
encoded = base64.b64encode(video)
# HTML(data='''  <video   alt="test" controls>
#                 <source src="data:video/mp4;base64,{0}" type="video/mp4" />
#              </video>'''.format(encoded.decode('ascii')))

HTML("""
<video width="1000" height="500" controls>
  <source src="RFclassifier.mp4" type="video/mp4">
</video>
""")

For regression, the approach is similar: a series of thresholds are made on the x-axis, and for each split a function is fit to minimise the error between true and predicted y values. In this case (and as standard for scikit learn implementations) the function fits a constant prediction at each split i.e. y = 0.6 and -1.1 for the left and right branches of the first split. As more and more splits are made the tree is able to estimate a closer and closer fit to this sinusoidal function.

In [3]:
HTML("""
<video width="1000" height="500" controls>
  <source src="RFregression.mp4" type="video/mp4">
</video>
""")

In what follows we will build from our decision stump classiifer to create a full decision tree classifier learning algorithm. This section is optional but incorporates the ideas of learning a nested dictionary which stores the parameters of the weak learning rules for each stump in the tree:

## (optional) Exercise 2: building and testing a complete decision tree

<a id='Ex2'></a>

Once we are able to evaluate the best split on a single node of a tree we can then start to think about building nodes together in order to refine prediction of class labels from our data. 

To complete this we need three more things:

1. A function that assigns each leaf (terminal) node a label (associated with the most common label of training points reaching that node)
2. A recursive function that decides (based on the configuration of data reaching each node) on whether to continue splitting the data or to assign that node a terminal node
3. A function that recursively splits data (according to pt 2) in order to build a  tree


***2.1  ***

The goal of our tree is to make a prediction of class labels for unseen data. For classification problems, this means that each terminal node must have an assigned class. We do this by picking the most popular label from the training data that reach that class.

EDIT the below function to:
2. from ```outcomes``` (class labels of the training set) estimate ```counts```: the total number of instances of each class (***HINT*** see numpy documentation for np.bincounts)
3. finally return ```most_common_class```: the class with the biggest contribution to ```counts``` (***HINT***  np.argmax)

In [170]:
# Create a terminal node value
def to_terminal(group):
    
    """
        Assigns a label according to the most common class label of the data
        input:
            group = array (n_samples,n_features+1) 
                    rows are examples 
                    last column indicates class membership
                    remaining columns reflect features/attributes of data
                             
        output:
            class label for this terminal node
    """
    
    # set outcomes equal to the final column of the input array  
    # - as this indicates the labels of the training data 
    outcomes = group[:,-1]
    print(outcomes)
    counts = np.bincount(outcomes.astype(int))
    most_common_class=np.argmax(counts)
    return most_common_class

Now test on the split received by the left branch of the above example

In [171]:
left_label=to_terminal(split['branches'][0])

print('The label assigned to the left branch is:', left_label)

[0 0 0 0 0]
The label assigned to the left branch is: 0


*** 2.2  ***

The next step is to generate a function (```run_split```) that will recursively split the data until a termination criterion is met. Termination criteria include:
- the case where all examples are assigned to a single branch (and thus cannot be further subdivided)
- the case where the node has reach a predefined ```max_depth``` for the tree
- the case where the total number of examples reaching the node as reached or exceeded a pre-defined ``min_size```

The input to the below function is a dictionary (here ```node``` ) representing the results output from *** STEP 4 *** containing the keys: ```index``` (feature that the node is split on), ```value``` (threshold on which the data is split) and ```branches``` (the tuple of data arrays reflecting the resulting split).

The function starts by first extracting the split data from the node dictionary (key='```branches```') and then deleting this information from the node, such that the node can then be cleanly updated as a terminal node or split again.

Then the code checks serially through all possible outcomes for the node:
1. In the case that either the left or right branch of the split is empty, then set this node as a terminal node and estimate the label for this prediction using ```to_terminal``` (STEP 1, above)
2. In the case that the node is at the ```max_depth``` allowed for this tree then assign as terminal and estimate the label for this prediction using ```to_terminal``` (again STEP 1, above)
3. In the case that the number of examples reaching the node is equal to or less than the ```min_size``` then, again, assign as terminal estimate the label for this prediction using ```to_terminal``` from STEP 1 above
4. Finally, assuming that we instead have nodes that support further splits, for each ```left``` and ```right``` branches in turn, estimate ```get_best_split``` and, recursively, ```run_split```

Please familiarise yourself with the code of this function, to be sure that you understand exactly what each line of code is doing. Identify which lines correspond to each of points 1-4 above.

In [172]:
              
def run_split(node, max_depth, min_size, depth):
     
    """
        Recursively splits nodes until termination criterion is met
        input:
            node = dict containing: 1) 'index' : index of feature used for splittling on
                             2) 'value': value of threshold split on
                             3) 'branches': tuple of data arrays reflecting the optimal split into left and right branches
            max_depth: int determining max allowable depth for the tree
            min_size : int determining minimum number of examples allowed for any branch
            depth: current depth of tree              
            
            
        Output:
            node: is returned by value and returns a recursion of dicts representing the structure of the whole tree
    """
    left, right = node['branches']
    del(node['branches'])
    # check for whether all data has been assigned to one branch; if so assign both branches the same label
    if left.shape[0]==0 :
        node['left'] = node['right'] = to_terminal(right)       
        return
    if right.shape[0]==0 :
        node['left'] = node['right'] = to_terminal(left)       
        return
    # check for max depth; if exceeded then estimate labels for both branches
    if depth >= max_depth:
        node['left'], node['right'] = to_terminal(left), to_terminal(right)
        return
    # process left child
        # in first instance check whether the number of examples reaching the left node are less than the allowed limit
        # if so assign as a terminal node, if not then split again
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        node['left'] = get_best_split(left)
        run_split(node['left'], max_depth, min_size, depth+1)
    
    # process right child as for left
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        node['right'] = get_best_split(right)
        run_split(node['right'], max_depth, min_size, depth+1)

*** 2.3  *** 

Finally, pool everything together through a top level ```build_tree``` function. Replace each ```None``` in this function with the correct code for complete construction of a tree through:

1. create a root node split by calling ```get_best_split``` ([STEP 4](#step4)) on the full training set
2. recursively building the rest of the tree by calling ```run_split```. 

These  completely specify the tree.

In [177]:
def build_tree(train, max_depth, min_size):
    """
    Builds and returns final decision tree
    
    input:
        train : training data array (n_samples,n_features)
        max_depth: user defined max tree depth (int)
        min_size: user defined minimum number of examples per tree tree depth (int)
    """
    # create a root node split by calling get_best_split on the full training set
    root = get_best_split(train)
    # now build the tree using run_split
    run_split(root, max_depth, min_size, 1)
    return root

Let's test our code on our full data set. Please edit the below call so that it trains on the full data set ```dataset``` with ```max_depth```= 3 and ```min_size```= 1

In [178]:
tree = build_tree(np.asarray(dataset), 3, 1)
print('Decision Tree: \n {}'.format(tree))

Best cost=0.5; Best feature=0; Best row=0
Best cost=0.4444444444444444; Best feature=0; Best row=1
Best cost=0.375; Best feature=0; Best row=2
Best cost=0.2857142857142857; Best feature=0; Best row=3
Best cost=0.1666666666666666; Best feature=0; Best row=4
Best cost=0.0; Best feature=0; Best row=5
Best cost=0.0; Best feature=0; Best row=0
[0 0 0 0 0]
Best cost=0.0; Best feature=0; Best row=0
[1 1 1 1 1]
Decision Tree: 
 {'index': 0, 'value': 6, 'left': {'index': 0, 'value': 1, 'left': 0, 'right': 0}, 'right': {'index': 0, 'value': 6, 'left': 1, 'right': 1}}


As you can hopefully see, the left and right keys of the dictionary, also contain dictionaries at nodes where the data can be split further. 

### Part 3: Making Predictions

Now that we have our tree we can make predictions. For this we require a function that recursively checks whether the ```left``` and ```right``` branches at each new depth reflect new node dicts or terminal nodes. Once a terminal node is reached the function returns a predicted class label:


In [179]:
def predict_row(node, row):
    
    """
    Predict from a decision tree, by interogating node branches recursively
    
    input:
        node = decision tree represented as dict containing: 
                1) 'index' : index of feature used for splittling on
                2)  'value': value of threshold split on
                3) 'branches': tuple of data arrays reflecting the optimal split into left and right branches
        row: - single row of test data matrix    
       
    """
    if row[node['index']] < node['value']:
         # if the result for the left branch returns another dictionary then repeat
        if isinstance(node['left'], dict):
            return predict_row(node['left'], row)
        else:
            # else if it's an integer you've reached a terminal node so return label
            return node['left']
    else:
         # if the result for the right branch returns another dictionary then repeat
        if isinstance(node['right'], dict):
            return predict_row(node['right'], row)
        else:
            # else if it's an integer you've reached a terminal node so return label
            return node['right']

Now taking a new data example ```[8.5,4.32,1]```, let us predict the correct label:

In [180]:
testdata=np.asarray([8.5,4.32,1])

prediction=predict_row(tree, testdata)
print('Expected={}, Got={}'.format(testdata[-1], prediction))


Expected=1.0, Got=1


## Tree Pruning

Decision Trees are prone to overfitting, as increasing the number of splits means that data becomes subdivided into leaves at ever finer levels of granularity - increasing the chance that the decision function becomes fit to noise in the data. This will reduce the generalisation performance, leading to lower test accuracies.

For this reason all standard implementations of decision trees also offer pruning. This reduces overfitting by removing branches that contribute least to the prediction accuracy. Example methods for pruning include:

- ***Reduced error pruning:*** starting from leaves, nodes are removed whilst prediction accuracy is unaffected
-  ***Cost complexity pruning:*** This generates a series of trees where  𝑇_0  is the initial tree and 𝑇_𝑀  represents the result of pruning everything away and leaving the root alone. The algorithm obtains these trees through iterative process. At each step:
<br>
    - Remove a subtree from tree 𝑖−1, where the specific subtree to remove is chosen by minimizing: 
    
    <br>
    $$\frac{𝑒𝑟𝑟(𝑇_𝑖 )−𝑒𝑟𝑟(𝑇_{i−1}}{|𝑇_{𝑖−1} | − |𝑇_𝑖|} $$
    <br>
    Then the best tree is selected from the list: 𝑇_0 …. 𝑇_𝑀  so as to optimise training accuracy


## Decision Trees for Regression

The basic algorithm for Regression Trees is the same for classification; however we must modify our cost and change how we make predictions from leaf nodes. 

For cost we can use Mean Squared Error (MSE), which minimises the L2 loss relative to the prediction. The prediction can be made several ways from the data reaching the node (see below figure): either through a constant function (that just fits the mean); a polynomial function (straight line or curve) fit more closely to the data. It is even possible to use a probabilistic model [Criminisi 2013] 

<img src="imgs/DT_regression_predictor_models.png" style="max-width:100%; width: 70%; max-width: none">

In Scikit_Learn and most standard implementations simply the mean (case (a)) is used. MSE is then estimated as:

$$ MSE= \frac{1}{N} \sum_k ^N (y_k-\bar{y}_k)^2 $$

Where, $y_i$ is the true label and $\bar{y}_i$ is the mean of all data samples reaching that child node. The full cost of any split is again modelled as a weighted sum of costs for all child nodes:

$$ I(S_j,\theta_j) = \sum_{i \in {L,R}} \frac{|S_j^i|}{|S_j|}MSE_i $$

# Further Exercises

Try constructing a regression tree from scratch; using the above classification tree as the basis but:
1.  creating a new MSE cost, and 
2.  editing the prediction function accordingly); 

Try it out the following toy dataset (Taken from:
http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html#sphx-glr-auto-examples-tree-plot-tree-regression-py)

Compare your result


# Further resources

Note these examples are inspired by the following on-line tutorial https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/ 

# References

[Criminisi 2013]  Criminisi, Antonio, and Jamie Shotton, eds. Decision forests for computer vision and medical image analysis. Springer Science & Business Media,