In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

# Lab 7: Nonparametric methods

The goal of this lab is to explore and interpret several nonparametric methods for regression.

The code you need to write is indicated with `...`. There is additional documentation for each part as you go along.

In preparation for this lab, we would recommend that you review the slides and demos from Lectures 13 and 14.

## Collaboration Policy
You can submit the lab in pairs (groups of two, no more than two). **If you choose to work in a pair, please make sure to add your group member on Gradescope for both written and code submission.**

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually and do not share your code with anyone other than your partner**. If you do discuss the assignments with people other than your partner please **include their names** in the cell below.

`<Collaborator Name> <Collaborator e-mail>`

## Submission
**For full credit, this assignment should be completed and submitted before Wednesday, Mar 12th, 2025 at 05:00 PM PST.**

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns

from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

sns.set()

import hashlib
def get_hash(num, significance = 4):
    num = round(num, significance)
    """Helper function for assessing correctness"""
    return hashlib.md5(str(num).encode()).hexdigest()

# 1. Model comparison

In this lab, we'll be working with the hybrid car dataset ([which you may remember from Data 8](https://inferentialthinking.com/chapters/15/1/Correlation.html)).

It contains data on 153 different models of hybrid car from 1997 to 2013, with the price (`msrp`), gas mileage (`mpg`), type of car (`class`), and how fast the car accelerates in km/hour/second (`acceleration`).

We're going to try to predict the price using other features of the car.

In [None]:
hybrid = pd.read_csv('hybrid.csv')

X_cols = ["year", "acceleration", "mpg"] # Columns used for prediction
y_col = "msrp" # The column we're trying to predict

hybrid

This cell generates all pairs of scatterplots for numerical variables in the data. You should see the same trends discussed in the chapter of the Data 8 textbook linked above.

In [None]:
sns.pairplot(hybrid)

### 1(a) Splitting the data

We'll start by splitting the data into training and test sets. Use the scikit-learn function `train_test_split` to make two dataframes called `train` and `test`. The test set should have $30\%$ of the data (46 rows).

The `train_test_split` function has an argument called `random_state` that lets you ensure that it uses the same random split every time: you should set that argument to `101` to pass the tests.

In [None]:
from sklearn.model_selection import train_test_split

train, test = ...

In [None]:
grader.check("q1a")

### 1(b) Predicting the output

#### 1.b.(i) Linear regression

Use linear regression to predict the MSRP from year, acceleration, and MPG. Add a new column to the `train` and `test` dataframes called `linear_pred` with the predictions from linear regression.

*Hint: throughout this lab, you should use the default values of all parameters for all models we're experimenting with.*

*Hint: for this lab, you don't need to worry about pandas warnings about setting a value on a copy of a slice.*

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

...

train["linear_pred"] = ...
test["linear_pred"] = ...

In [None]:
grader.check("q1b_i")

Run the following cell that computes the training set error and test set error.

In [None]:
train_rmse = np.mean((train["linear_pred"] - train["msrp"]) ** 2) ** 0.5
test_rmse = np.mean((test["linear_pred"] - test["msrp"]) ** 2) ** 0.5

print("Training set error for linear model:", train_rmse)
print("Test set error for linear model:    ", test_rmse)

#### 1.b.(ii) Decision Trees

Recall that a decision tree is a method for classification and regression that uses a tree-like structure to decide what value to predict for a point.


In this question, we'll use a decision tree for regression instead of classification. When we built a decision tree for classification in lecture, we made decisions about splitting based on how homogeneous the $y$-values were. Now, we'll instead make splits based on the residuals for predicting at that node. 

Let's look at an example, assuming we're using mean squared error as our loss. For example, if we make our first split based on whether or not `mpg <= M`, we'll have some average MSRP for the low-MPG cars (below `M`), along with residuals if we used that average to predict the MSRP for all the low-MPG cars. Similarly, we have the same information for the high-MPG cars (above `M`). A good value of `M` will make the mean squared residuals for the two groups as small as possible. So, at each node, we choose a split that makes the mean squared error on each side as small as possible.

Compute the prediction from a decision tree with the default parameters for scikit-learn's `DecisionTreeRegressor` (i.e., no limit on tree depth). Add a new column to the `train` and `test` dataframes called `tree_pred` with the predictions from the decision tree.

*Hint: your code should look very similar to your answer from 1.b.(i).*

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()

...

train["tree_pred"] = ...
test["tree_pred"] = ...

In [None]:
grader.check("q1b_ii")

Run the following cell that computes the training set error and test set error.

In [None]:
train_rmse = np.mean((train["tree_pred"] - train["msrp"]) ** 2) ** 0.5
test_rmse = np.mean((test["tree_pred"] - test["msrp"]) ** 2) ** 0.5

print("Training set error for decision tree:", train_rmse)
print("Test set error for decision tree:    ", test_rmse)

#### 1.b.(iii) Random Forest

Recall that a random forest is the combination of a large number of decision trees.

Compute the prediction from a decision tree using scikit-learn's `RandomForestRegressor`, with the following parameters:
* 100 trees (default)
* no limit on each tree's depth (default)
* Use the `max_features` parameter to only use one feature for each tree. (*The recommended value for random forests is for each tree to only use 1/3 of the features, and in this case we have 3 features.*)

Add a new column to the `train` and `test` dataframes called `forest_pred` with the predictions from the random forest.

*Hint: your code should look very similar to your answers from 1.b.(i) and 1.b.(ii).*

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(max_features=1)

...

train["forest_pred"] = ...
test["forest_pred"] = ...

In [None]:
grader.check("q1b_iii")

Run the following cell that computes the training set error and test set error.

In [None]:
train_rmse = np.mean((train["forest_pred"] - train["msrp"]) ** 2) ** 0.5
test_rmse = np.mean((test["forest_pred"] - test["msrp"]) ** 2) ** 0.5

print("Training set error for random forest:", train_rmse)
print("Test set error for random forest:    ", test_rmse)

<!-- BEGIN QUESTION -->

### 1(c) accuracy comparison

Of the Decision Tree model and the Random Forest model, which one does best on the training set? Why? Which model does best on the test set? Why?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 1.d Interpretability

#### 1.d.i Linear Regression

Let's look at the coefficients from the linear regression model:

In [None]:
# You can just run this cell to print out the coefficients for each feature:
print(X_cols)
linear_model.coef_

<!-- BEGIN QUESTION -->

Using this result, fill in the blanks in the following two statements:

"Each year, linear regression predicts that the average price changes by $\$ \rule{1cm}{0.15mm}$ ". (Your answer should be a positive or negative number)

"Linear regression predicts that cars with better gas mileage are $\rule{1cm}{0.15mm}$ expensive." (Your answer should be either 'more' or 'less')

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### 1.d.ii Decision trees

We'll use the `plot_tree` function to draw the decision tree:

In [None]:
from sklearn.tree import plot_tree
plt.figure(figsize=(16, 4))
plot_tree(tree_model);

We can see that the tree is quite deep and complex. Let's take a closer look at the nodes at the top:

In [None]:
plt.figure(figsize=(12, 7))
plot_tree(tree_model, max_depth=2, fontsize=14, feature_names=X_cols);

There are a few things we can see right away:


* The first line tells us which feature to split on: values below the threshold go to the left, and values above go to the right.
* The third line tells us the number of training samples that made it that far into the tree.
* The fourth line tells us the average $y$-value (in this case, MSRP) of all the training samples that made it that far into the tree.

<!-- BEGIN QUESTION -->

Just by looking at the first few layers, we can already see that the decision tree has pulled out the most expensive cars into some of the branches, and the less expensive ones into other branches.

Suppose we had stopped growing the tree at this point. That would have given us four leaf nodes, each with very different mean MSRP. Describe the node that contains the most expensive cars in plain English.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Random Forests

Unfortunately, random forests are much harder to interpret than either of the other two methods that we've tried. In this case, with so few features, we might be able to look at the top of each tree and find similarities across most or all of the trees, but in high-dimensional problems, each tree should see a very different subset of features, and this becomes much harder.

# 2. Explanations

Many methods for explainable ML use the following setup to explain a specific prediction from a complex model:

1. Construct a simpler, easier-to-explain model (e.g., linear regression, decision tree, etc.) that behaves similarly to the complex model for data points near the specific point we're trying to explain.
2. Interpret the simpler model.

In this question, we'll try to see if we can come up with an explanation for the worst predictions from each model.

### 2.a Linear Regression

#### 2.a.i Finding the worst predictions

Find the two cars in the test set where linear regression does the worst (i.e., has the highest absolute error). Your answer should be a dataframe with two rows from `test`, one for the worst error followed by one for the second-worst error. While solving this problem, you can add extra columns to `test` if you wish.


In [None]:
# Hint: There are a couple of ways to do this. One suggestion is to first create new column with the 
# absolute linear error and then sort the dataframe based on this absolute linear error.

...
worst_linear_predicted_cars_df = ...

In [None]:
grader.check("q2a_i")

In [None]:
worst_linear_predicted_cars_df

<!-- BEGIN QUESTION -->

#### 2.a.ii Explanation

Using the coefficients of the linear model that we found earlier, explain why linear regression's predictions for these two cars were the way they were. Is the explanation from the linear model consistent with the trends you observed at the beginning in the visualizations?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 2.b: Random Forests

#### 2.b.i Finding the worst predictions for the random forest

Find the two cars in the test set where random forest does the worst (i.e., has the highest absolute error). Your answer should be a dataframe with two rows from `test`, one for each of the two cars. You can add extra columns to `test` if you wish.

*Hint: your code should be very similar to your code for 2.a.i.*

In [None]:
...
worst_forest_predicted_cars_df = ...

In [None]:
grader.check("q2b_i")

In [None]:
worst_forest_predicted_cars_df

# 3 Feature Engineering and Interpretability

In this question, we will be exploring the effect of feature engineering on the interpretability of a given model using a toy dataset. Let's start by loading the data, which has already been split for you into train and test sets:

In [None]:
# This is the same plotting function from lecture

def draw_results(x1, x2, color, plot_title=''):
    plt.figure()
    plt.scatter(x1, x2, c=color, cmap='viridis', alpha=0.7);
    plt.colorbar()
    plt.title(plot_title)
    plt.axis('equal')
    plt.xlabel('$x_1$')
    plt.ylabel('$x_2$')
    plt.tight_layout()

In [None]:
# Import datasets
ring_train = pd.read_csv('ring_train.csv')
ring_test = pd.read_csv('ring_test.csv')
draw_results(ring_train['x1'], ring_train['x2'], color=ring_train['y'])

We know from lecture that without any additional features, logistic regression will use a line as a decision boundary. 
Where would you draw the best line to classify these points? (No need to answer, but please think about it.)

We are now going to fit a simple logistic regression model on the data using the following lines of code. 

In [None]:
# No need to write any code here: just understand.
X_train = ring_train[['x1', 'x2']].values
y_train = ring_train['y'].values

X_test = ring_test[['x1', 'x2']].values
y_test = ring_test['y'].values

In [None]:
# No need to write any code here: just understand.
from sklearn.linear_model import LogisticRegression

X_train = ring_train[['x1', 'x2']].values
y_train = ring_train['y'].values

X_test = ring_test[['x1', 'x2']].values
y_test = ring_test['y'].values

model_simple_features = LogisticRegression(
    penalty='none', solver='lbfgs'
)

model_simple_features.fit(X_test, y_test)

probs = model_simple_features.predict_proba(X_test)[:, 1]
y_hat = (probs > 0.5).astype(np.int64)

draw_results(
    X_test[:, 0], X_test[:, 1], color=probs, 
    plot_title="Logistic regression predicted probs (no feature eng)"
)
draw_results(
    X_test[:, 0], X_test[:, 1], color=y_hat, 
    plot_title="Logistic regression prediction (no feature eng)"
)

accuracy = np.mean(y_test == y_hat)
print(f"Accuracy on test set: {accuracy}")

Comparing the labels classified by the model with the true labels, we notice that the simple logistic regression model is not ideal partly because the true decision boundary is nonlinear. In order to improve on the model, we will engineer new features. 

With the checkerboard dataset, we engineered a new feature, $x_1 \times x_2$, which was just what we needed. For this dataset, the feature that we need is a little more complicated.

Instead, we'll take inspiration from neural networks, and add many random features, where each is a random linear combination of the inputs, where the coefficients will be random numbers between -1 and 1.

Don't forget that we also need to apply a nonlinearity, or else the linear combinations won't help us when applying logistic regression. In this example, we'll use the sigmoid function. For example, one feature might be $\sigma(-0.37x_1 + 0.82x_2)$.

### 3.a Add random features

Complete the cell below to add random features to the dataset. As described above, we first generate a pair of coefficients $(c_1, c_2)$ uniformly random from $(-1, 1)$ and then for both the training set and the test set, add an additional column whose values are $\sigma(c_1x_1 + c_2 x_2)$.

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def add_random_feature(train_data, test_data):
    # Returns the modified train_data and test_data
    coeffs = ...
    # This code gives the feature a convenient name
    feat_name = f"σ({coeffs[0]:0.2f}x1 + {coeffs[1]:0.2f}x2)"

    for dataset in (train_data, test_data):
        linear_combination = np.dot(dataset[['x1', 'x2']], coeffs)
        feature = ...
        dataset[feat_name] = feature
    return train_data, test_data

In [None]:
grader.check("q3a")

Using the code you completed in 3a, we can now add 10 random features to both the training set and the test set using the following code:

In [None]:
# This cell uses the code you wrote to add 10 random features to the
# train and test sets.

ring_train_feats = ring_train.copy()
ring_test_feats = ring_test.copy()
for i in range(10):
    ring_train_feats, ring_test_feats = (
        add_random_feature(ring_train_feats, ring_test_feats)
    )
ring_train_feats.head()

We can now train a new logistic regression model with these 10 additional features.

In [None]:
# No need to write any code here: just understand.
X_train = ring_train_feats.iloc[:, 1:].values
y_train = ring_train_feats['y'].values

X_test = ring_test_feats.iloc[:, 1:].values
y_test = ring_test_feats['y'].values

model_features = LogisticRegression(
    penalty='none', solver='lbfgs'
)

model_features.fit(X_train, y_train)

probs = model_features.predict_proba(X_test)[:, 1]
y_hat = (probs > 0.5).astype(np.int64)

draw_results(
    X_test[:, 0], X_test[:, 1], color=probs, 
    plot_title="Logistic regression predicted probs (random features)"
)

draw_results(
    X_test[:, 0], X_test[:, 1], color=y_hat, 
    plot_title="Logistic regression prediction (random features)"
)

accuracy = np.mean(y_test == y_hat)
print(f"Accuracy on test set: {accuracy}")

You should see that the accuracy is already better. Now that our model has improved, let's try to interpret it.

### 3.b Coefficients

Fill in the blanks in the following code that creates a dataframe with the coefficients from the logistic regression model.

*Hint: you may find it helpful to refer to the demo from lecture.*

In [None]:
feature_names = ring_train_feats.columns[1:]
logistic_coeff_vals = ...
len(logistic_coeff_vals)
coefficient_df = pd.DataFrame(
    {'feature': feature_names, 'coefficients': logistic_coeff_vals}
)

coefficient_df

In [None]:
grader.check("q3b")

<!-- BEGIN QUESTION -->

### 3c Interpretibility

This model has better performance on the data but at the same time, it loses some interpretability. Explain why this logistic regression model is harder to interpret than the simpler (and worse-performing) one from earlier.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Congratulations! You have finished Lab 7!

In [None]:
import matplotlib.image as mpimg
from otter.export import export_notebook
from os import path
from IPython.display import display, HTML
export_notebook("lab07.ipynb", filtering=True, pagebreaks=True)
if(path.exists('lab07.pdf')):
    img = mpimg.imread('cute_animal_dot_jpg.jpg')
    imgplot = plt.imshow(img)
    imgplot.axes.get_xaxis().set_visible(False)
    imgplot.axes.get_yaxis().set_visible(False)
    plt.show()
    display(HTML("Download your PDF <a href='lab07.pdf' download>here</a>."))

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)