In [33]:
import numpy as np
from task1 import *

## Our Algorithm

The core idea behind our algorithm is using a binary search tree, where each node has a condition property. When making a prediction, if the condition property is true for some given input $\vec{x}$, then it will traverse down the right subtree. Otherwise, it will traverse down the left subtree.

A few fundamental things to consider are
1. How easy will it be to build this tree?
2. How likely will the tree be balanced, thus increasing prediction performance?

For 1, building the tree will be fairly simple if we use partitioning. We will have to write logic for determining if a node should be a leaf node. We have three conditions under which this will happen.
1. The impurity of the dataset for the node is 0.
2. The maximum number of leaf nodes have already been created.
3. The height of the tree has reached the maximum value.

## Understanding partitioning



The goal of using partitioning is to avoid splitting a copy of the entire dataset to each individual node while building the tree.

Let $X$ denote our dataset and let $\vec{x}$ denote a sample in $X$.

Assume we've found a sample $\vec{x}^{(s)} \in X$ to split the data. This means we want to split the data along at the sample $\vec{x}^{(s)} along the feature $f$. One way we could do this is to partition the dataset into two separate datasets $X_{l}$ and $X_{r}$.

The way we do this is by computing the following. For each sample in $X$, if $\vec{x}_{f} < \vec{x}^{(s)}_{f}$ then it goes into $X_{l}$, otherwise, it goes into $X_{r}$. Where this becomes efficient is we can simply store the
indices for these samples instead of creating copies of the original dataset. For numpy, the easier way is to create a boolean mask for the rows we want in each partition.

The following code is an example of how this works.

In [36]:
# creating an array of random integers
X = np.random.default_rng().integers(20, size=(4, 4))
X

array([[ 5, 18,  7, 10],
       [15, 13,  2, 16],
       [12,  5,  6,  2],
       [18,  4,  1,  0]])

In [None]:
# let's say we want to partition the data along the 2nd feature
# and the sample we are splitting the data at is the second sample
f = 1 # second feature
split_value = X[1, f] # second sample (second row)
mask = X[:, f] < split_value
X_l = X[mask]
X_l

array([[12,  5,  6,  2],
       [18,  4,  1,  0]])

## Determining the best split

To determine the best split, we will simply use a combination of a function that calculates the sum of squares for each possible split of the data and numpy's `argmin()` function.

First, we start off with a dataset $X$. Then for each sample $\vec{x} \in X$, we split the data into two subsets $X_l$ and $X_r$ using our partitioning method. Then, we compute the sum
$$
\sum_{i=1}^{n_{l}}{ (y_{i} - \bar{y}_{l})^{2} } + \sum_{j=1}^{n_{r}}{ (y_{j} - \bar{y}_{r})^{2} }
$$
`argmin()` will return the index of the sample which minimizes this sum.