# Lecture worksheet 20 solutions

## Question 1

### Q1.1 The Two Cultures

What did you think of the assigned reading? Did you find Breiman's arguments in favor of "algorithmic modeling" persuasive? If you read any of the responses, did you find them persuasive?

*Please see the discussion video recording for Ramesh and Yan Shuo's thoughts on the paper.*

### Q1.2 True/False

For each of the following, specify whether it's true or false. 

*Optional: If it's False, explain why.*

(a) All nonparametric models are hard to interpret.

**False: some nonparametrics (e.g., small decision trees) are very easy to interpret.**

(b) Nonparametric methods are always faster than parametric methods.

**False: both parametric methods and nonparametric methods can be slow. For example, a large, complex (parametric) Bayesian model could be much slower than a simple (nonparametric) decision tree. On the other hand, (parametric) linear regression could be much faster than fitting a large (nonparametric) random forest.**

(c) The term "nonparametric" can mean either: methods that don't make assumptions about the distribution of the data/parameters; **_or_** methods where the number of parameters (e.g., coefficients) is infinite or increases with the number of data points.

**True**

(d) Decision trees can only be used for classification.

**False: decision trees can be used for regression or classification.**

(e) Random forests use a collection of decision trees where each tree is trained on fewer data points.

**False: each tree is trained on a _bootstrap sample_ of the original data set, which samples the same number of points with replacement.**

## Question 2: Implementing (part of) a random forest

Fill in the blanks in the following code that implements a random forest for regression, based on what was discussed in lecture. Assume the rest of the code is already written (including the parts that set up the instance variables, etc.).

In [None]:
from sklearn.tree import DecisionTreeRegressor

class RandomForestRegressor:
    
    def fit(self, X, y):
        """Fits the model
        
        X: array, num_pts x num_features
        y: array, num_pts
        """
        # A list of all the trees in the forest
        self.trees = []
        
        # Loop over the trees in the forest
        for i in self.n_estimators:
            tree = DecisionTreeRegressor(...) # No need to fill this in
            
            ### SOLUTION
            N, K = X.shape
            
            data_indices = np.arange(N)
            feature_indices = np.arange(K)
            
            rows = np.random.choice(indices, N, replace=True)
            columns = np.random.choice(feature_indices, K//3, replace=False)
            
            X_for_tree = X[rows, columns] 
            y_for_tree = y[rows] 
            ### END SOLUTION
            
            tree.fit(X_for_tree, y_for_tree)
            self.trees.append(tree)
            
    def predict(self, X):
        # Array that will hold the sum of the predictions from all trees
        prediction_sum = np.zeros(X.shape[0])
        
        # Loop over the trees in the forest
        for tree in self.trees:
        ### SOLUTION
            one_tree_prediction = tree.predict(X)
            prediction_sum += one_tree_prediction
        return prediction_sum / len(self.trees)
        ### END SOLUTION

## Question 3: Trees, Forests, Bias, and Variance

Recall from much earlier in the semester that we can write the expected squared loss (i.e., the risk) of a decision $\delta(x)$ and a true value of the parameter $\theta$ as

$$
E[(\delta(x) - \theta)^2] = \underbrace{E\left[\left(\delta(x) - E[\delta(x)]\right)^2\right]}_{\text{variance of } \delta(x)} + \underbrace{(E[\delta(x)] - \theta)^2}_{\text{bias}^2}
$$

If our decision is a prediction for $y$ that we call $\hat{y}$, then $\theta = y$ and $\delta(x) = \hat{y}(x)$:

$$
E[(\hat{y}(x) - y)^2] = \underbrace{E\left[\left(\hat{y}(x) - E[\hat{y}(x)]\right)^2\right]}_{\text{variance of prediction } \hat{y}(x)} + \underbrace{(E[\delta(x)] - y)^2}_{\text{bias}^2}
$$

Fill in each blank in the following statement with either "bias" or "variance", and explain:

**Decision trees (with no limit on depth) have high __Variance__, and low __bias__. By averaging them in a random forest, we lower the variance.**

#### SOLUTION

* Decision trees have high variance: as the data shift, the possible tree could be very different.
* Decision trees have low bias: the "average decision tree" will end up pretty close to the true parameter. This is because decision trees can capture complex structure in the data.



Here's an example that illustrates that. The first two cells are duplicated from the lecture notes:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.tree import DecisionTreeClassifier

In [None]:
### This cell is exactly the same as the version from the notes
N_test = 500
N_train = 150

np.random.seed(2021)

# Create a training dataset
x1_train = np.random.uniform(-1, 1, N_train)
x2_train = np.random.uniform(-1, 1, N_train)

y_train = (x1_train * x2_train > 0).astype(np.int64)


# Create a feature matrix that we can use for classification
X_train = np.vstack([x1_train, x2_train]).transpose()


# Create a test dataset
x1_test = np.random.uniform(-3, 3, N_test)
x2_test = np.random.uniform(-3, 3, N_test)

y_test = (x1_test * x2_test > 0).astype(np.int64)


# Create a feature matrix that we can use to evaluate
X_test = np.vstack([x1_test, x2_test]).transpose()

def draw_results(x1, x2, color, plot_title=''):
    plt.figure()
    plt.scatter(x1, x2, c=color, cmap='viridis', alpha=0.7);
    plt.colorbar()
    plt.title(plot_title)
    plt.axis('equal')
    plt.xlabel('$x_1$')
    plt.ylabel('$x_2$')
    plt.tight_layout()
    
draw_results(
    x1_train, x2_train, color=y_train,
    plot_title="Training data"
)

This cell randomly flips a few of the $y$-values in the training dataset and then trains a decision tree. Try running the cell several times and look at the results. The `flip_p` parameter, which controls what proportion of the $y$-values get flipped, is initially 0. 

Try running the cell a few times with `flip_p = 0`. How well does the decision tree do when there's no noise in the training dataset?

In [None]:
flip_p = 0.0  # probability of flipping y for training

flipped_y = y_train.copy()
flips = np.random.random(N_train) < flip_p
flipped_y[flips] = 1 - flipped_y[flips]

tree = DecisionTreeClassifier()
tree.fit(X_train, flipped_y)

probs = tree.predict_proba(X_test)[:, 1]
y_hat = (probs > 0.5).astype(np.int64)

draw_results(
    x1_train, x2_train, color=flipped_y,
    plot_title="Training data (with noise)"
)

draw_results(
    x1_test, x2_test, color=probs, 
    plot_title="Predicted probability of y=1"
)

accuracy = np.mean(y_test == y_hat)
print(f"Accuracy on test set: {accuracy}")

Now, try setting `flip_p` to 0.1 and see how the results change. As the data $x$ vary, you should observe a lot of variance in the predictions $\hat{y}(x)$.