## Helper Notebook 7.2: Decision tree midpoint split for Boston dataset

This notebook will continue to help you to wrap your head around how a decision tree might best be implemented. In this lab we take a quick look at a basic `TreeNode` class, and then do a quick review of recursion.

## Tools

#### Libraries:

- numpy: for processing
- pandas
- lolviz: for visualization of graphs

#### Datasets:

Boston housing 

## Setup

For this lab you will need another library called `lolviz` which you can install within the Jupyter notebook, but you first should first install `graphviz`, which is easy to do on a Mac.

```
brew install graphviz
```

In [None]:
import pandas as pd
import numpy as np

from lolviz import treeviz

from types import SimpleNamespace
def load_boston(return_X_y=False):
    """Replacement function for loading in Boston House Prices"""
    df = pd.read_csv('boston_house_prices.csv')
    X = df.drop(columns=['MEDV'])
    y = df['MEDV'].to_numpy()

    if return_X_y:
        return X, y 
    
    dataset  = SimpleNamespace(data=X, target=y)
    
    return dataset

### Tree Node class

Let's take our tree node and enhance it a little with some extra attributes to make it more like a decision tree node.

In [None]:
class TreeNode: # acts as decision node and leaf. it's a leaf if split is None
  def __init__(self, split=None, prediction=None, left=None, right=None):
    self.split = split
    self.prediction = prediction
    self.left = left
    self.right = right
  def __repr__(self):
    return str(self.value)
  def __str__(self):
    return str(self.value)

In [None]:
boston = load_boston()
X = boston.data
y = boston.target
X.head()

### Boston midpoint stump

The following demonstrates a simple stump: all of the data passed to `stumpfit` is simply split into two pieces using the midpoint of `x`, as long as there is more than one observation (or more than one unique value of `x`).

In [None]:
def stumpfit(x, y):
    if len(x)==1 or len(np.unique(x))==1: # if one x value, make leaf
        return TreeNode(prediction=y[0])
    split = (min(x) + max(x)) / 2 # midpoint
    t = TreeNode(split=split)
    t.left = TreeNode(prediction=np.mean(y[x<split]))
    t.right = TreeNode(prediction=np.mean(y[x>=split]))
    return t

In [None]:
print(len(X), "records")
age = X.AGE#[:,6]
stump = stumpfit(age,y)
treeviz(stump)

### Boston midpoint tree

Now we can modify `stumpfit` a little bit so that instead of only splitting the data once, we will recursively split the data, using the midpoint of `x`, as long as there is more than one observation or more than one unique value in `x`. This will continue splitting the data into two pieces until the stopping condition is met, and will result in a much larger tree.

In [None]:
def treefit(x, y):
    if len(x)==1 or len(np.unique(x))==1: # if one x value, make leaf
        return TreeNode(prediction=y[0])
    split = (min(x) + max(x)) / 2 # midpoint
    t = TreeNode(split=split)
    t.left  = treefit(x[x<split],  y[x<split])
    t.right = treefit(x[x>=split], y[x>=split])
    return t

In [None]:
root = treefit(age,y)
treeviz(root)

## Dynamic method call demo

This is simply an example of how you can have methods of the same name for different classes, and the correct one will be called depending on the object.

In [None]:
class DecisionNode:
    def hello(self):
        print("decision")

class LeafNode:
    def hello(self):
        print("leaf")

In [None]:
d = DecisionNode()
d.hello()
l = LeafNode()
l.hello()

In [None]:
def foo(x):
    x.hello()

In [None]:
foo(d)

In [None]:
foo(l)

## Getting back to the beginning

When doing recursion it can be confusing to track how results are returned back to earlier calls of the function. Sometimes it is easiest to draw a picture, or print something to the screen, to track the function calls and returns.

In [None]:
def f():
    g()
    print("back from g()")
    
def g():
    h()
    print("back from h()")

def h():
    print("hi I'm h!")

In [None]:
f()
print("back from f()")

f calls g calls h and it remembers where it came from. Just imagine that f, g, and h are the same function and you'll see that recursion also remembers where it came from.

Where to return is tracked per function **call** not per function **definition**.