# Homework 4 Supplemental Notebook

## DSC 40A, Fall 2021

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from itertools import combinations

### Helper Functions

Here, we'll define several functions that you'll need to use in this notebook. **Don't reinvent the wheel, use the functions that are here!**

In [None]:
def solve_normal_equations(X, y):
    '''Returns the optimal parameter vector, w*, given a design matrix X and observation vector y.'''
    return np.linalg.solve(X.T @ X, X.T @ y)

def create_design_matrix(df, columns, intercept=True):
    '''Creates a design matrix by taking the specified columns from the DataFrame df.
       Adds a column of all 1s as the first column if intercept is True, which is the default.
       The argument columns should be a list.
    '''
    df = df.copy()
    df['1'] = 1
    if intercept:
        return df[['1'] + columns].values
    else:
        return df[columns].values
    
def mean_squared_error(X, y, w):
    '''Returns the mean squared error of the predictions Xw and observations y.'''
    return np.mean((y - X @ w)**2)

## Problem 2 – Billy's Back!

**Disclaimer:** While this problem seems quite long, the amount of work you have to do is quite minimal. Most of the code has already been implemented for you, you will generally just need to tweak a few things and interpret the results. You will see the text <a style="color:red"><b>Your Job</b></a> next to each of your action items. You will not have to submit this notebook anywhere; each subpart will specify what you need to include in your PDF writeup.
 

Run the cell below to load in a dataset containing information about the tips Billy, our avocado farmer from Homework 3 and the Midterm, received over the last month as a waiter at Dirty Birds.

In [None]:
tips = pd.read_csv('data/billy_tips.csv')
tips

Each row corresponds to a single table that he served. Throughout this question, our goal will be to predict `tip` using some or all of the other features in the DataFrame.

Let's start by just using `total_bill` to predict `tip`. Here's a scatter plot showing the relationship between the two variables:

In [None]:
px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')

The functions defined in the **Helper Functions** section make it easy to fit a linear prediction rule:

In [None]:
X_one_feature = create_design_matrix(tips, ['total_bill'])
y = tips['tip']

# Notice that X_one_feature has two columns
X_one_feature

In [None]:
# Finding w*
w_one_feature = solve_normal_equations(X_one_feature, y)
w_one_feature

I can now use this prediction rule to make predictions:

In [None]:
# Dot product of an augmented feature vector for a total bill of 15 with the optimal parameter vector
np.array([1, 15]) @ w_one_feature

In [None]:
px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')

x_range = np.linspace(0, 60)

fig = go.Figure()
fig.add_trace(go.Scatter(x = tips['total_bill'], y = y, mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_one_feature[0] + w_one_feature[1] * x_range, 
                         name = 'linear prediction rule', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title = 'Total Bill', yaxis_title = 'Tip')

The mean squared error of this prediction rule is as follows:

In [None]:
mse_one_feature = mean_squared_error(X_one_feature, y, w_one_feature)
mse_one_feature

We'll define the DataFrame `prediction_rules` solely to keep track of the prediction rules we've used so far along with their MSEs. You will not need to interface with it at all in this assignment.

In [None]:
prediction_rules = pd.DataFrame(index=['total_bill'], columns=['MSE'])
prediction_rules.loc['total_bill'] = mse_one_feature

prediction_rules

### Part A – Making predictions using the single-feature model (2 Points)


Let's suppose Billy works for a day as a waiter at [Nobu San Diego](https://www.noburestaurants.com/sandiego/home/), a very expensive sushi restaurant. He waits a table whose total bill is \$350. He decides to use the above linear prediction rule to predict the tip that he will receive.

<p style="color:red"><b>Your Job</b></p> What tip would the above single-feature model predict for a total bill of 350? Is this prediction likely to be accurate? Why or why not? Report your answers to these questions in your PDF writeup.

In [None]:
# Calculate the prediction here.
... # TODO (Hint: See the example prediction for a total bill of 15 above)

### Part B – Using two features (5 Points)

Now, let's suppose we want to use `total_bill` AND `table_size` to predict `tip`.

<p style="color:red"><b>Your Job</b></p> 

Below, complete the following tasks:

- i. Assign `X_two_features` to the design matrix for this new prediction rule.
- ii. Assign `w_two_features` to the optimal parameter vector for this new prediction rule. Write the resulting vector in your PDF writeup.
- iii. Assign `mse_two_features` to the mean squared error of this new prediction rule. Write the result in your PDF writeup.
- iv. Write the resulting prediction rule as a formula in your PDF writeup, using the numbers you found in task ii. The only variables in your formula should be `total_bill` and `table_size` (or, if you prefer, $x^{(1)}$ and $x^{(2)}$).
- v. Did adding `table_size` as a feature make our prediction rule significantly more accurate as compared to the prediction rule that used just `total_bill`? How can you tell? Write your answer in your PDF writeup.

Tasks i, ii, and iii should each only take line; remember to use the functions defined for you at the start of the notebook. This subpart should not take very long.

In [None]:
X_two_features = ... # TODO
w_two_features = ... # TODO
w_two_features

In [None]:
mse_two_features = ... # TODO
mse_two_features

In [None]:
# Don't change this cell, just run it
prediction_rules.loc['total_bill, table_size'] = mse_two_features
prediction_rules

If you completed tasks i-iii correctly, you should see a scatter plot of the original data points and your prediction rule below.

In [None]:
XX, YY = np.mgrid[0:60:2, 0:8:2]
Z = w_two_features[0] + w_two_features[1] * XX + w_two_features[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z)

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=tips['total_bill'], 
                           y=tips['table_size'], 
                           z=tips['tip'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title = 'Total Bill',
    yaxis_title = 'Table Size',
    zaxis_title = 'Tip'))

### Part C – Comparing coefficients (2 Points)

Which feature is more important in predicting tip – `total_bill` or `table_size`?

Assuming you answered Part B correctly, run the cell below to create a standardized design matrix, where the two columns for `total_bill` and `tip` are standardized to have mean 0 and standard deviation 1.

In [None]:
X_two_features_standardized = X_two_features.copy()
X_two_features_standardized[:, 1:] = (X_two_features[:, 1:] - np.mean(X_two_features[:, 1:], axis=0)) / X_two_features[:, 1:].std(axis=0, ddof=0)

<p style="color:red"><b>Your Job</b></p> 

Determine `w_two_features_standardized`, the standardized regression coefficients for our two-feature prediction rule. In your PDF writeup, provide the value of `w_two_features_standardized` as well as which feature is more important for predicting `tip`.

In [None]:
w_two_features_standardized = ... # TODO
w_two_features_standardized

### Part D – Using polynomial features (Points: 3)

Let's revisit the scatter plot of tip vs. total bill:

In [None]:
px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')

As we did in class, let's see if using higher-degree polynomial features yields a more accurate prediction rule. Specifically, let's try and create a degree 4 polynomial prediction rule, using the features `total_bill`, `total_bill^2`, `total_bill^3`, and `total_bill^4`.

In [None]:
# Making a copy of the tips DataFrame so that we don't modify the original data
tips_with_poly_features = tips.copy()

In [None]:
# Computing total_bill^2
tips_with_poly_features['total_bill^2'] = tips_with_poly_features['total_bill']**2

<p style="color:red"><b>Your Job</b></p>

- i. Add columns `total_bill^3` and `total_bill^4` to the DataFrame `tips_with_poly_features`. Provide a screenshot of your code in your PDF writeup.
- ii. Define `X_poly`, `w_poly`, and `mse_poly` to be the design matrix, optimal parameter vector, and mean squared error of our new 4th degree polynomial prediction rule. In your PDF writeup, include the values of `w_poly` and `mse_poly`.
- iii. Write the resulting prediction rule as a formula in your PDF writeup, using the numbers you found in task ii. The only variable in your formula should be `total_bill`, and powers of it (or $x$, if you prefer).

Again, this subpart should only take a few minutes.

In [None]:
tips_with_poly_features['total_bill^3'] = ... # TODO
tips_with_poly_features['total_bill^4'] = ... # TODO
tips_with_poly_features

In [None]:
X_poly = ... # TODO
w_poly = ... # TODO
w_poly

In [None]:
mse_poly = ... # TODO
mse_poly

In [None]:
# Don't change this cell, just run it
prediction_rules.loc['total_bill 4th degree poly'] = mse_poly
prediction_rules

### Part E – Interpreting the model with polynomial features (Points: 2)

Assuming you completed Part D correctly, run the following cell to see a visualization of our 4th degree polynomial prediction rule.

In [None]:
x_range = np.linspace(0, 50)

fig = go.Figure()
fig.add_trace(go.Scatter(x = tips['total_bill'], y = tips['tip'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_poly[0] + w_poly[1] * (x_range) + w_poly[2] * (x_range**2) + \
                             w_poly[3] * (x_range**3) + w_poly[4] * (x_range**4),
                         name = '4th degree polynomial prediction rule', 
                         line=dict(color='#F7CF5D', width=5)))

fig.update_layout(xaxis_title = 'Total Bill', yaxis_title = 'Tip')

As you saw, the 4th degree polynomial prediction rule seems to fit the data the best so far, since its MSE is the lowest.

In [None]:
prediction_rules

But let's see what happens when we "zoom out" and look at how this prediction rule behaves.

In [None]:
x_range = np.linspace(0, 70)

fig = go.Figure()
fig.add_trace(go.Scatter(x = tips['total_bill'], y = tips['tip'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_poly[0] + w_poly[1] * (x_range) + w_poly[2] * (x_range**2) + \
                             w_poly[3] * (x_range**3) + w_poly[4] * (x_range**4),
                         name = '4th degree polynomial prediction rule', 
                         line=dict(color='#F7CF5D', width=5)))

fig.update_layout(xaxis_title = 'Total Bill', yaxis_title = 'Tip')

Let's again suppose Billy works for a day as a waiter at [Nobu San Diego](https://www.noburestaurants.com/sandiego/home/). He waits a table whose total bill is \$350. He decides to use the above 4th degree polynomial prediction rule to predict the tip that he will receive.

<p style="color:red"><b>Your Job</b></p> What tip would the above polynomial model predict for a total bill of 350? Why is a prediction rule with a lower MSE not necessarily better than a prediction rule with a higher MSE, as is the case here? Report your answers to these questions in your PDF writeup.

In [None]:
# Calculate the prediction here.
... # TODO

### Part F – Using categorical features (Points: 4)

There was another column in our original DataFrame, `tip`, that we haven't yet looked at: `day`.

In [None]:
tips

There are three possible values of `day`: `'Thur'`, `'Sat'`, and `'Sun'`.

In [None]:
px.bar(tips.groupby('day').count()['total_bill'].loc[['Thur', 'Sat', 'Sun']])

Note that unlike `total_bill` and `table_size`, `day` is **categorical**. This means there's no easy way to put it in our design matrix or find the best prediction rule.

A naïve solution would be to encode `'Thur'` as 1, `'Sat'` as 2, and `'Sun'` as 3, but this would make it seem like Sunday is "more" than Saturday or Thursday in some regard, which it is not – these are all just different days of the week.

A more robust and common solution is called **one-hot encoding** (OHE). You will be exposed to it in more detail in DSC 80, but we want to show you an example of how it works now since it's a natural extension of what we've already covered.

Let's first get it working on a toy example. Let's pretend we have a DataFrame with just 5 rows and 2 columns, `total_bill` and `day`. Call it `mini_tips`.

In [None]:
# Don't worry about what this code is doing, just run the cell.
mini_tips = pd.DataFrame()
mini_tips['total_bill'] = tips['total_bill'].iloc[:5]
mini_tips['day'] = ['Sat', 'Sun', 'Sun', 'Thur', 'Sat']

mini_tips

When we **one-hot encode** a categorical variable, we create a new column for each unique value of that categorical variable. In this case, we'd create three new columns, one each for `'Thur'`, `'Sat'`, and `'Sun'`.

Each of these new columns is binary, meaning they only contain the values 1 and 0. 
- The new column for `'Thur'`, which we'll call `is_thur`, will contain a 1 for rows where the value of `day` is `'Thur'`, and 0 for all other rows. 
- Similarly, the new column for `'Sun'`, which we'll call `is_sun`, will contain a 1 for rows where the value of day is `'Sun'`, and 0 for all other rows.

Again, you'll see more efficient ways to do this in later courses, but here's one way to one-hot encode using a technique you saw in DSC 10 – boolean comparisons.

In [None]:
(mini_tips['day'] == 'Thur')

Repeating this for all columns:

In [None]:
mini_tips['is_thur'] = (mini_tips['day'] == 'Thur').astype(int)
mini_tips['is_sat'] = (mini_tips['day'] == 'Sat').astype(int)
mini_tips['is_sun'] = (mini_tips['day'] == 'Sun').astype(int)

# Dropping the day column. We've encoded it numerically, we don't need it anymore.
mini_tips = mini_tips.drop(columns=['day'])

mini_tips

Now we've converted a categorical feature into three numerical features, so we're good to go!

**There's just one more thing.** Since we're used to fitting linear prediction rules with an intercept term, our design matrix generally has a column of all 1s in it. In the case of `mini_tips`, which contains three binary columns, this would look like:

In [None]:
create_design_matrix(mini_tips, list(mini_tips.columns))

This design matrix contains redundant information! Specifically, we can recreate the column of all 1s by adding together the three one-hot encoded columns:

In [None]:
# Note that the 0, 1, 2, 3, 4 that you see is the index of this Series, which is irrelevant for our purposes
mini_tips['is_thur'] + mini_tips['is_sat'] + mini_tips['is_sun']

What this means is that our design matrix $X$ suffers from multicollinearity, and is not **full rank**. There are multiple nasty side effects of this – there is no unique solution for $\vec{w}^*$ and it makes our optimal parameters more difficult to interpret.

You'll explore this problem in later statistics and data science courses, so don't worry if this is a bit confusing. **For now, know this – the way to avoid this problem is to drop one of the one-hot encoded columns.** That way, there is no redundant information in the design matrix, and we don't run into any issues. (As you'll see later on, this is not "getting rid" of any information, so it will not impact our predictions.)

In [None]:
# We've arbitrarily chosen to drop is_thur, but it would make no difference if we instead dropped is_sat or is_sun.
mini_tips = mini_tips.drop(columns=['is_thur'])
mini_tips

In [None]:
create_design_matrix(mini_tips, list(mini_tips.columns))

Now we have a design matrix that is ready to go. Let's replicate this process on our full dataset.

In [None]:
# Run this cell.
tips_ohe = tips.copy()
tips_ohe['is_sat'] = (tips_ohe['day'] == 'Sat').astype(int)
tips_ohe['is_sun'] = (tips_ohe['day'] == 'Sun').astype(int)

# Design matrix with two one-hot encoded columns.
X_ohe = create_design_matrix(tips_ohe, ['total_bill', 'is_sat', 'is_sun'])

In [None]:
w_ohe = solve_normal_equations(X_ohe, y)
w_ohe

Let's now plot the resulting prediction rule. We've zoomed into the region where the total bills are less than 30 to make the prediction rule more clear.

In [None]:
x_range = np.linspace(0, 30)

under_30 = tips[tips['total_bill'] < 30]

fig = go.Figure()
fig.add_trace(go.Scatter(x = under_30['total_bill'], y = under_30['tip'], mode = 'markers', name = 'actual'))

# Line for Thursday
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_ohe[0] + w_ohe[1] * x_range, 
                         name = 'Thursday', 
                         line=dict(color='red')))

# Line for Saturday
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_ohe[0] + w_ohe[2] + w_ohe[1] * x_range, 
                         name = 'Saturday', 
                         line=dict(color='gold')))

# Line for Sunday
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_ohe[0] + w_ohe[3] + w_ohe[1] * x_range, 
                         name = 'Sunday', 
                         line=dict(color='green')))

fig.update_layout(xaxis_title = 'Total Bill', yaxis_title = 'Tip')

It looks like the prediction rule is actually three separate lines, each of which have the same slope but different intercepts!

Let's try and understand why this is the case.

In [None]:
w_ohe

Our prediction rule is of the following form:

$$\text{predicted tip} = 0.908 + 0.105 (\text{total bill}) - 0.069 (\text{is saturday}) + 0.091 (\text{is sunday})$$

<p style="color:red"><b>Your Job</b></p>

- What is the intercept of the line for when `day` is Thursday?
- What is the intercept of the line for when `day` is Saturday?
- What is the intercept of the line for when `day` is Sunday?

Write the numerical answers for all three questions in your writeup PDF. That is the only action item you have for Part F, but please ask questions if any of this subpart was unclear.

Just for completeness, we'll also compute the MSE of this prediction rule:

In [None]:
mse_ohe = mean_squared_error(X_ohe, y, w_ohe)
prediction_rules.loc['total_bill + OHE day'] = mse_ohe
prediction_rules

This new prediction rule didn't have a much lower MSE than the prediction rule that used `total_bill` only. That's not all that surprising, since the three lines above look quite similar.

That's it for Problem 2! We hope you now have a better understanding of multiple linear regression and feature engineering.

## Supplement for Problem 3

Note that this question is entirely in the PDF of the homework. The code we've written here just serves to help you understand that the sum of the residuals when we have an intercept term is truly 0, using the data from Problem 2 as an example. Feel free to experiment here.

In [None]:
np.sum(y - X_one_feature @ w_one_feature)

In [None]:
np.sum(y - X_two_features @ w_two_features)

In [None]:
np.sum(y - X_two_features @ w_two_features)

## Problem 4 – Least Absolute Deviations Regression

### Part B

We're providing you with all of the following functions.

In [None]:
def generate_all_combinations(data, k=2):
    """Returns the unique sets of length k from the dataset."""
    return list(combinations(data, k))

In [None]:
def plane_mae(a, b, c, data):
    """Computes the mean absolute error for a given plane."""
    loss = 0
    n = len(data)
    for i in range(n):
        x_i, y_i, z_i = data[i]
        loss += abs(z_i - (a * x_i + b * y_i + c))
    return loss / n

In [None]:
def find_best_plane(planes, data):
    """Finds the best plane given a list of planes and the dataset."""
    lowest_mae = float("inf")
    best_plane = None

    for plane in planes:
        a, b, c = plane
        mae = plane_mae(a, b, c, data)
        if mae < lowest_mae:
            lowest_mae = mae
            best_plane = plane

    return best_plane

Once again, we'll use the tips dataset. We'll use `total_bill` and `table_size` to predict `tip`, as we did in Problem 2. Below, we create a matrix in the form that the above functions expect. The first column, $x$, contains total bills, the second column, $y$, contains table sizes, and the third column, $z$, contains tips.

(Note that we're also only taking the first 50 rows of `tips`, since the process we're going to implement is very slow.)

In [None]:
tips_for_p4 = tips[['total_bill', 'table_size', 'tip']].iloc[:50].values.tolist()
tips_for_p4

Note that the code `generate_all_combinations(tips_for_p4, 3)` will return the following:

In [None]:
generate_all_combinations(tips_for_p4, 3)

<p style="color:red"><b>Your Job</b></p>

Complete the implementation of the function `generate_all_planes`, which takes in a list of point triplets in the above format and returns a list of $(a, b, c)$ triplets, such that the $i$th $(a, b, c)$ triplet defines a plane that contains three points in the $i$th element of the list `triplets`.

For instance, the first element of `generate_all_planes(generate_all_combinations(tips_for_p4, 3))` should be an $(a, b, c)$ triplet defining the plane $z = ax + by + c$ that passes through the three points `[16.99, 2.0, 1.01], [10.34, 3.0, 1.66], [21.01, 3.0, 3.5]`.

In [None]:
def generate_all_planes(triplets):
    '''Returns an (a, b, c) triplet defining a plane for every triplet of points in the list triplets.'''
    planes = []
    for triplet in triplets:
        A, B, C = np.array(triplet) # Unpacks all three points
        A_to_B = ... # TODO (Hint: see the hint in the PDF for part (a))
        A_to_C = ... # TODO (Hint: see the hint in the PDF for part (a))
        normal = ... # TODO (Hint: This should be a vector normal to the plane. Look into np.cross.)
        point = np.dot(normal, A)
        x, y, z = normal
        a, b, c = ... # TODO (Hint: Try determining what a, b, and c should be on paper first.)
        planes.append((a, b, c))
    return planes

If you implemented `generate_all_planes` correctly, the following cell should print out the best $a, b, c$ triplet (i.e. the LAD plane) for the tips dataset. You may see some `RuntimeWarning`s; you can ignore those.

In [None]:
triplets = generate_all_combinations(tips_for_p4, 3)
planes = generate_all_planes(triplets)
a_best, b_best, c_best = find_best_plane(planes, tips_for_p4)
print("The best (a, b, c) triplet for the tips data is", (a_best, b_best, c_best))

Here's a plot of the resulting plane. Remember that we only used the first 50 rows of `tips` to fit this prediction rule.

In [None]:
XX, YY = np.mgrid[0:60:2, 0:8:2]
Z = c_best + a_best * XX + b_best * YY
plane = go.Surface(x=XX, y=YY, z=Z)
fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=np.array(tips_for_p4)[:, 0], 
                           y=np.array(tips_for_p4)[:, 1], 
                           z=np.array(tips_for_p4)[:, 2], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title = 'Total Bill',
    yaxis_title = 'Table Size',
    zaxis_title = 'Tip'))

Nice work! 💪