In [33]:
import numpy as np
from task1 import *

## Understanding partitioning



The goal of using partitioning is to avoid splitting a copy of the entire dataset to each individual node while building the tree.

Let $X$ denote our dataset and let $\vec{x}$ denote a sample in $X$.

Assume we've found a sample $\vec{x}^{(s)} \in X$ to split the data. This means we want to split the data along at the sample $\vec{x}^{(s)} along the feature $f$. One way we could do this is to partition the dataset into two separate datasets $X_{l}$ and $X_{r}$.

The way we do this is by computing the following. For each sample in $X$, if $\vec{x}_{f} < \vec{x}^{(s)}_{f}$ then it goes into $X_{l}$, otherwise, it goes into $X_{r}$. Where this becomes efficient is we can simply store the
indices for these samples instead of the samples themselves. For numpy, the easier way is to create a boolean mask for the rows we want in each partition.

The following code is an example of how this works.

In [36]:
# creating an array of random integers
X = np.random.default_rng().integers(20, size=(4, 4))
X

array([[ 5, 18,  7, 10],
       [15, 13,  2, 16],
       [12,  5,  6,  2],
       [18,  4,  1,  0]])

In [37]:
# let's say we want to partition the data along the 2nd feature
# and the sample we are splitting the data at is the second sample
f = 1 # second feature
split_value = X[1, f] # second sample (second row)
mask = X[:, f] < split_value
mask
X_l = X[mask]
X_l

array([[12,  5,  6,  2],
       [18,  4,  1,  0]])