In [None]:
# Please don't change this cell, but do make sure to run it.
import otter
grader = otter.Notebook()

# Homework 4 Supplemental Notebook

## DSC 40A, Spring 2024

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

pd.options.plotting.backend = "plotly"

# DSC 40A preferred styles.
pio.templates["dsc40a"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+dsc40a"

### Helper Functions

Here, we'll define several functions that you'll need to use in this notebook. **Don't reinvent the wheel, use the functions that are here!**

In [None]:
def solve_normal_equations(X, y):
    '''Returns the optimal parameter vector, w*, given a design matrix X and observation vector y.'''
    return np.linalg.solve(X.T @ X, X.T @ y)

def create_design_matrix(df, columns, intercept=True):
    '''Creates a design matrix by taking the specified columns from the DataFrame df.
       Adds a column of all 1s as the first column if intercept is True, which is the default.
       The argument columns should be a list.
    '''
    df = df.copy()
    df['1'] = 1
    if intercept:
        return df[['1'] + columns].values
    else:
        return df[columns].values
    
def mean_squared_error(X, y, w):
    '''Returns the mean squared error of the predictions Xw and observations y.'''
    return np.mean((y - X @ w) ** 2)

## Problem 5: Billy the Waiter 🧑‍🍳

**Disclaimer:** While this problem seems quite long, the amount of work you have to do is quite minimal. Most of the code has already been implemented for you, you will generally just need to tweak a few things and interpret the results. You will see the text <a style="color:red"><b>Your Job</b></a> next to each of your action items.

**Ultimately, you will submit this notebook to the Homework 4, Problem 5 autograder on Gradescope. It is entirely autograded, and is worth 14 points.**

Run the cell below to load in a dataset containing information about the tips Billy received over the last month as a waiter at Dirty Birds.

In [None]:
tips = px.data.tips().rename(columns={'size': 'table_size'}).replace('Fri', 'Thur')
tips

Each row corresponds to a single table that he served. Throughout this question, our goal will be to predict `'tip'` using some or all of the other features in the DataFrame.

Let's start by just using `'total_bill'` to predict `tip`. Here's a scatter plot showing the relationship between the two variables:

In [None]:
# pio.renderers.default = 'browser' # If the plot doesn't load in your notebook, uncomment this line and run again.

fig = px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')
fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')

The functions defined in the **Helper Functions** section make it easy to fit a linear hypothesis function:

In [None]:
X_one_feature = create_design_matrix(tips, ['total_bill'])
y = tips['tip']

# Notice that X_one_feature has two columns.
X_one_feature

In [None]:
# Finding w*.
w_one_feature = solve_normal_equations(X_one_feature, y)
w_one_feature

I can now use this hypothesis function to make predictions:

In [None]:
# Dot product of an augmented feature vector for a total bill of 15 with the optimal parameter vector.
np.array([1, 15]) @ w_one_feature

In [None]:
px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')

x_range = np.linspace(0, 60)

fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=y, mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_one_feature[0] + w_one_feature[1] * x_range, 
                         name='Linear Hypothesis Function', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')

The mean squared error of this hypothesis function is as follows:

In [None]:
mse_one_feature = mean_squared_error(X_one_feature, y, w_one_feature)
mse_one_feature

We'll define the DataFrame `hypothesis_functions` solely to keep track of the hypothesis functions we've used so far along with their MSEs. (We'll update this DataFrame for you.)

In [None]:
hypothesis_functions = pd.DataFrame(index=['total_bill'], columns=['MSE'])
hypothesis_functions.loc['total_bill'] = mse_one_feature
hypothesis_functions

<!--
BEGIN QUESTION
name: q5a
points: 2
-->

### Problem 5(a): Making predictions using the single-feature model (2 points)

Let's suppose Billy works for a day as a waiter at [Nobu San Diego](https://www.noburestaurants.com/sandiego/home/), a very expensive sushi restaurant. He waits a table whose total bill is \$350. He decides to use the above linear hypothesis function to predict the tip that he will receive.

<p style="color:red"><b>Your Job</b></p>

1. What tip would the above single-feature model predict for a total bill of \$350? In the cell below, assign the answer to the variable `prediction_for_350`. (Try and use the `@` symbol as part of your answer!)
1. Is this prediction likely to be accurate? If so, in the cell below, assign the variable `is_accurate` to `True`, otherwise, assign it to `False`. Before assigning `is_accurate` to either `True` or `False`, you should think about what makes a prediction about the future likely to be accurate vs. not.

**Note**: You should not round any numbers at any point in this notebook!

In [None]:
prediction_for_350 = ...
is_accurate = ...

# Don't change the line below.
print(f'The predicted tip for a total bill of $350 is ${round(prediction_for_350, 2)}, and we {"do" if is_accurate else "do not"} think this prediction is likely to be accurate.')

In [None]:
grader.check("q5a")

<!--
BEGIN QUESTION
name: q5b
points: 2
-->

### Problem 5(b): Using two features (2 points)

Now, let's suppose we want to use `'total_bill'` AND `'table_size'` to predict `'tip'`.

<p style="color:red"><b>Your Job</b></p> 

Below, complete the following tasks:

1. Assign `X_two_features` to the design matrix for this new hypothesis function.
1. Assign `w_two_features` to the optimal parameter vector for this new hypothesis function.
1. Assign `mse_two_features` to the mean squared error of this hypothesis function.
1. Did adding `'table_size'` as a feature make our hypothesis function significantly more accurate as compared to the hypothesis function that used just `'total_bill'`? If so, assign `much_more_accurate` to `True`, otherwise assign it to `False`.

Tasks 1, 2, and 3 should each only take line; remember to use the functions defined for you at the start of the notebook. Problem 5(b) as a whole should not take very long.

In [None]:
X_two_features = ...
w_two_features = ...
mse_two_features = ...
much_more_accurate = ...

# Don't change the lines below.
print('first five rows of design matrix:\n', X_two_features[:5])
print('optimal parameter vector:', w_two_features)
print('MSE:', mse_two_features)
print('much more accurate:', 'yes' if much_more_accurate else 'no')

In [None]:
grader.check("q5b")

If you completed Problem 5(b) correctly, you should see a 3D scatter plot of the original data points and your hypothesis function below.

In [None]:
XX, YY = np.mgrid[0:60:2, 0:8:2]
Z = w_two_features[0] + w_two_features[1] * XX + w_two_features[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Reds')

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=tips['total_bill'], 
                           y=tips['table_size'], 
                           z=tips['tip'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title='Total Bill',
    yaxis_title='Table Size',
    zaxis_title='Tip'), title='Tip vs. Total Bill')

Don't change this cell, just run it.

In [None]:
hypothesis_functions.loc['total_bill and table_size'] = mse_two_features
hypothesis_functions

### Problem 5(c): Comparing coefficients (2 points)

Which feature is more important in predicting tip – `'total_bill'` or `'table_size'`?

Assuming you answered Problem 5(b) correctly, run the cell below to create a standardized design matrix, where the two columns for `'total_bill'` and `'tip'` are standardized to have mean 0 and standard deviation 1.

In [None]:
X_two_features_standardized = X_two_features.copy()
X_two_features_standardized[:, 1:] = (X_two_features[:, 1:] - np.mean(X_two_features[:, 1:], axis=0)) / X_two_features[:, 1:].std(axis=0, ddof=0)
X_two_features_standardized[:5]

<!--
BEGIN QUESTION
name: q5c
points: 2
-->

<p style="color:red"><b>Your Job</b></p> 

1. Assign `w_two_features_standardized` to an array containing the standardized regression coefficients for our two-feature hypothesis function.
1. Assign `more_important` to either `'total_bill'` or `'table_size'`, depending on which of the two features you think is more important in predicting `'tip'`.

In [None]:
w_two_features_standardized = ...
more_important = ...
w_two_features_standardized, more_important

In [None]:
grader.check("q5c")

Don't change this cell, just run it.

In [None]:
hypothesis_functions.loc['total_bill and table_size std'] = mean_squared_error(X_two_features_standardized, y, w_two_features_standardized)
hypothesis_functions

The MSEs of the last two hypothesis functions were the same! The only difference is that when we standardized the features in creating the most recent hypothesis function, we were able to compare the coefficients directly.

### Problem 5(d): Using polynomial features (3 points)

Let's revisit the scatter plot of `'tip'` vs. `'total bill'`:

In [None]:
fig = px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')
fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')

As we did in class, let's see if using higher-degree polynomial features yields a better hypothesis function. Specifically, let's try and create a degree 4 polynomial hypothesis function, using the features `'total_bill'`, `'total_bill^2'`, `'total_bill^3'`, and `'total_bill^4'`.

In [None]:
# Making a copy of the tips DataFrame so that we don't modify the original data.
tips_with_poly_features = tips.copy()

In [None]:
# Computing total_bill^2.
tips_with_poly_features['total_bill^2'] = tips_with_poly_features['total_bill'] ** 2
tips_with_poly_features.head()

<!--
BEGIN QUESTION
name: q5d
points: 3
-->

<p style="color:red"><b>Your Job</b></p>

1. Add columns `'total_bill^3'` and `'total_bill^4'` to the DataFrame `tips_with_poly_features`.
1. Define `X_poly`, `w_poly`, and `mse_poly` to be the design matrix, optimal parameter vector, and mean squared error of our new 4th degree polynomial hypothesis function. Note that this hypothesis function should be of the form:

    $$H(x) = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + w_4 x^4$$

    where $x$ is the `'total_bill'`.

Again, this subpart should only take a few minutes.

In [None]:
tips_with_poly_features = ...
X_poly = ...
w_poly = ...
mse_poly = ...

# Don't change the lines below.
print('first five rows of design matrix:\n', X_poly[:5])
print('optimal parameter vector:', w_poly)
print('MSE:', mse_poly)

In [None]:
grader.check("q5d")

Don't change this cell, just run it.

In [None]:
hypothesis_functions.loc['total_bill 4th degree poly'] = mse_poly
hypothesis_functions

### Problem 5(e): Interpreting the model with polynomial features (2 points)

Assuming you completed Problem 5(d) correctly, run the following cell to see a visualization of our 4th degree polynomial hypothesis function.

In [None]:
x_range = np.linspace(0, 50)

fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=tips['tip'], mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_poly[0] + w_poly[1] * (x_range) + w_poly[2] * (x_range**2) + \
                             w_poly[3] * (x_range**3) + w_poly[4] * (x_range**4),
                         name='4th Degree Polynomial Hypothesis Function', 
                         line=dict(color='red', width=5)))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip', title='Tip vs. Total Bill')

As you saw, the 4th degree polynomial hypothesis function seems to fit the data the best so far, since its MSE is the lowest.

In [None]:
hypothesis_functions

But let's see what happens when we "zoom out" and look at how this hypothesis function behaves.

In [None]:
x_range = np.linspace(-20, 70)

fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=tips['tip'], mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_poly[0] + w_poly[1] * (x_range) + w_poly[2] * (x_range**2) + \
                             w_poly[3] * (x_range**3) + w_poly[4] * (x_range**4),
                         name='4th Degree Polynomial Hypothesis Function', 
                         line=dict(color='red', width=5)))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip', title='Tip vs. Total Bill')

<!--
BEGIN QUESTION
name: q5e
points: 2
-->

Let's again suppose Billy works for a day as a waiter at [Nobu San Diego](https://www.noburestaurants.com/sandiego/home/). He waits a table whose total bill is \$350. He decides to use the above 4th degree polynomial hypothesis function to predict the tip that he will receive.

<p style="color:red"><b>Your Job</b></p>

What tip would the above polynomial model predict for a total bill of \$350? In the cell below, assign the answer to the variable `poly_prediction_for_350`.

Then, think about **why** a hypothesis function with a lower MSE is not necessarily better than a hypothesis function with a higher MSE. You don't need to write your answer anywhere, but discuss it with someone (either a tutor or a peer) before submitting Homework 4.

In [None]:
poly_prediction_for_350 = ...

# Don't change the line below.
print(f'The predicted tip for a total bill of $350 is ${round(poly_prediction_for_350, 2)}.')

In [None]:
grader.check("q5e")

### Problem 5(f) – Using categorical features (3 points)

There was another column in our original DataFrame, `tips`, that we haven't yet looked at: `'day'`.

In [None]:
tips.head()

In [None]:
px.bar(tips['day'].value_counts().loc[['Thur', 'Sat', 'Sun']])

Note that unlike `'total_bill'` and `'table_size'`, `'day'` is **categorical**. This means there's no easy way to put it in our design matrix or find the best hypothesis function.

A naïve solution would be to encode `'Thur'` as 1, `'Sat'` as 2, and `'Sun'` as 3, but this would make it seem like Sunday is "more" than Saturday or Thursday in some regard, which it is not – these are all just different days of the week.

A more robust and common solution is called **one hot encoding** (OHE). You will be exposed to it in more detail in DSC 80, but we want to show you an example of how it works now since it's a natural extension of what we've already covered.

Let's first get it working on a toy example. Let's pretend we have a DataFrame with just 5 rows and 2 columns, `'total_bill'` and `'day'`. Call it `mini_tips`.

In [None]:
# Don't worry about what this code is doing, just run the cell.
mini_tips = pd.DataFrame()
mini_tips['total_bill'] = tips['total_bill'].iloc[:5]
mini_tips['day'] = ['Sat', 'Sun', 'Sun', 'Thur', 'Sat']
mini_tips

When we **one hot encode** a categorical variable, we create a new column for each unique value of that categorical variable. In this case, we'd create three new columns, one each for `'Thur'`, `'Sat'`, and `'Sun'`.

Each of these new columns is binary, meaning they only contain the values 1 and 0. 
- The new column for `'Thur'`, which we'll call `'is_thur'`, will contain a 1 for rows where the value of `'day'` is `'Thur'`, and 0 for all other rows. 
- Similarly, the new column for `'Sun'`, which we'll call `'is_sun'`, will contain a 1 for rows where the value of day is `'Sun'`, and 0 for all other rows.

Again, you'll see more efficient ways to do this in later courses, but here's one way to one hot encode using a technique you saw in DSC 10 – Boolean comparisons.

In [None]:
(mini_tips['day'] == 'Thur')

Repeating this for all columns:

In [None]:
mini_tips['is_thur'] = (mini_tips['day'] == 'Thur').astype(int)
mini_tips['is_sat'] = (mini_tips['day'] == 'Sat').astype(int)
mini_tips['is_sun'] = (mini_tips['day'] == 'Sun').astype(int)

# Dropping the 'day' column. We've encoded it numerically, we don't need it anymore.
mini_tips = mini_tips.drop(columns=['day'])
mini_tips

Now we've converted a categorical feature into three numerical features, so we're good to go!

**There's just one more thing.** Since we're used to fitting linear hypothesis functions with an intercept term, our design matrix generally has a column of all 1s in it. In the case of `mini_tips`, which contains three binary columns, this would look like:

In [None]:
create_design_matrix(mini_tips, list(mini_tips.columns))

This design matrix contains redundant information! Specifically, we can recreate the column of all 1s by adding together the three one-hot encoded columns:

$$X^TX\vec{w} = X^Ty$$

$$\vec{w}^* = (X^TX)^{-1}X^Ty$$

In [None]:
X_not_full_rank = create_design_matrix(mini_tips, list(mini_tips.columns))
X_not_full_rank

In [None]:
# Note that the 0, 1, 2, 3, 4 that you see is the index of this Series, which is irrelevant for our purposes.
mini_tips['is_thur'] + mini_tips['is_sat'] + mini_tips['is_sun']

What this means is that our design matrix $X$ suffers from multicollinearity, and is not **full rank**. There are multiple nasty side effects of this – there is no unique solution for $\vec{w}^*$ and it makes our optimal parameters more difficult to interpret.

You'll explore this problem in later statistics and data science courses, so don't worry if this is a bit confusing. **For now, know this – the way to avoid this problem is to drop one of the one hot encoded columns.** That way, there is no redundant information in the design matrix, and we don't run into any issues. This is not "getting rid" of any information, so it will not impact our predictions – if we know it is not Saturday or Sunday, it must be Thursday.

In [None]:
# We've arbitrarily chosen to drop 'is_thur', but it would make no difference if we instead dropped 'is_sat' or 'is_sun'.
mini_tips = mini_tips.drop(columns=['is_thur'])
mini_tips

In [None]:
create_design_matrix(mini_tips, list(mini_tips.columns))

Now we have a design matrix that is ready to go. Let's replicate this process on our full dataset.

In [None]:
# Run this cell.
tips_ohe = tips.copy()
tips_ohe['is_sat'] = (tips_ohe['day'] == 'Sat').astype(int)
tips_ohe['is_sun'] = (tips_ohe['day'] == 'Sun').astype(int)

# Design matrix with two one-hot encoded columns.
X_ohe = create_design_matrix(tips_ohe, ['total_bill', 'is_sat', 'is_sun'])
print('first five rows of design matrix:\n', X_ohe[:5])

In [None]:
w_ohe = solve_normal_equations(X_ohe, y)
w_ohe

Let's now plot the resulting hypothesis function. We've zoomed into the region where the total bills are less than 30 to make the hypothesis function more clear.

In [None]:
x_range = np.linspace(0, 30)

under_30 = tips[tips['total_bill'] < 30]

fig = go.Figure()
fig.add_trace(go.Scatter(x=under_30['total_bill'], y=under_30['tip'], mode='markers', name='actual'))

# Line for Thursday.
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_ohe[0] + w_ohe[1] * x_range, 
                         name='Thursday', 
                         line=dict(color='blue', width=4)))

# Line for Saturday.
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_ohe[0] + w_ohe[2] + w_ohe[1] * x_range, 
                         name='Saturday', 
                         line=dict(color='orange', width=4)))

# Line for Sunday.
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_ohe[0] + w_ohe[3] + w_ohe[1] * x_range, 
                         name='Sunday', 
                         line=dict(color='red', width=4)))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip', title='Tip vs. Total Bill')

It looks like the hypothesis function is actually three separate lines, each of which have the same slope but different intercepts!

Let's try and understand why this is the case.

In [None]:
w_ohe

Our hypothesis function is of the following form:

$$\text{predicted tip} = 0.908 + 0.105 (\text{total bill}) - 0.069 (\text{is saturday}) + 0.091 (\text{is sunday})$$

<!--
BEGIN QUESTION
name: q5f
points: 3
-->

<p style="color:red"><b>Your Job</b></p>

Below, assign `intercept_thur`, `intercept_sat`, and `intercept_sun` to the **$y$-intercepts** of the three lines above, corresponding to when the `'day'` is Thursday, Saturday, or Sunday. You should do this using code,  pulling values from `w_ohe`, but you should think conceptually about where each of the three intercepts are coming from.

In [None]:
intercept_thur = ...
intercept_sat = ...
intercept_sun = ...

# Don't change the lines below.
print('Intercept for Thursday:', intercept_thur)
print('Intercept for Saturday:', intercept_sat)
print('Intercept for Sunday:', intercept_sun)

In [None]:
grader.check("q5f")

Just for completeness, we'll also compute the MSE of this hypothesis function:

In [None]:
mse_ohe = mean_squared_error(X_ohe, y, w_ohe)
hypothesis_functions.loc['total_bill + OHE day'] = mse_ohe
hypothesis_functions

This new hypothesis function didn't have a much lower MSE than the hypothesis function that used `total_bill` only. That's not all that surprising, since the three lines above look quite similar.

<hr>

## Ready to Submit?

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells. 
1. Read through the notebook to make sure all cells ran and all tests passed.
1. Run the cell below to run all tests, and make sure that they all pass.
1. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
1. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.

Remember that we will run hidden test cases on your submission after the due date.

In [None]:
grader.check_all()