# Lecture 8 Supplementary Notebook

## DSC 40A, Summer 2024

The following cell sets up the necessary imports – don't worry too much about it.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns

from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg")

pd.options.plotting.backend = "plotly"

# DSC 80 preferred styles
pio.templates["dsc80"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+dsc80"

from IPython.display import HTML

Let's load in the commute times dataset as a `pandas` DataFrame.

In [None]:
df = pd.read_csv('data/commute-times.csv')
df.head()

There are many columns in here, but the only ones we're interested in for now are `'departure_hour'` and `'minutes'`.

In [None]:
df[['departure_hour', 'minutes']]

In [None]:
pio.renderers.default = 'plotly_mimetype+notebook' # If the plot doesn't load for you, run this first.

In [None]:
fig = px.scatter(df,
           x='departure_hour',
           y='minutes',
           size=np.ones(len(df)) * 50,
           size_max=8)
fig.update_xaxes(title='Home Departure Time (AM)')
fig.update_yaxes(title='Minutes to School')
fig.update_layout(title='Commuting Time vs. Home Departure Time')
fig.update_layout(width=700)

## Finding the Regression Line, Using the Old Formulas

Recall, the formulas for the optimal intercept ($w_0^*$) and slope ($w_1^*$) are

$$w_1^* = r \frac{\sigma_y}{\sigma_x}$$

$$w_0^* = \bar{y} - w_1^* \bar{x}$$

In [None]:
def slope(x, y):
    return np.corrcoef(x, y)[0, 1] * np.std(y) / np.std(x)

In [None]:
def intercept(x, y):
    return np.mean(y) - slope(x, y) * np.mean(x)

In [None]:
w0_star = intercept(df['departure_hour'], df['minutes'])
w1_star = slope(df['departure_hour'], df['minutes'])

# Just fancy printing – ignore these next two lines.
rule_string = ('$$\\text{Predicted Commute Time (in Minutes)} = ' + 
               f'{round(w0_star, 2)} + {round(w1_star, 2)}' + 
               '\cdot \\left( \\text{Departure Hour} \\right)$$')
display(HTML(f'<h4>The best linear predictor for this dataset is</h4><br><center>{rule_string}</center>'))

In [None]:
hline = px.line(x=[5.5, 11.5], y=[97.405, 48.265]).update_traces(line={'color': 'red', 'width': 4})
fline1 = go.Figure(fig.data + hline.data)
fline1.update_xaxes(title='Home Departure Time (AM)')
fline1.update_yaxes(title='Minutes to School')
fline1.update_layout(title='<span style="color:red">Predicted Commute Time</span> = 142.45 - 8.19 * Departure Hour')
fline1.update_layout(width=700, margin={'t': 60})

Now that we have $w_0^*$ and $w_1^*$, we can use them to make predictions.

In [None]:
# The predicted commute time if I leave at 8:30AM.
w0_star + w1_star * 8.5

## Finding the Regression Line, Using the Normal Equations

Using our linear algebraic formulation, the optimal intercept and slope are given by the vector $\vec{w}^*$, where:

$$\vec{w}^* = ({X^TX})^{-1} X^T\vec{y}$$

Here:
- $X$ is a $n \times 2$ matrix, called the **design matrix**, defined as:

$${ X} = \begin{bmatrix} { 1} & { x_1} \\ { 1} & { x_2} \\ \vdots & \vdots \\ { 1} & { x_n} \end{bmatrix}$$

- $\vec{y}$ is a $n$-dimensional vector, called the **observation vector**, defined as:

$$\vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}$$

Let's construct $X$ and $y$ in code.

First, the design matrix.

In [None]:
# Create a new DataFrame by taking the 'departure_hour' column from df.
X = df[['departure_hour']].copy()
X

In [None]:
# Add a column of all 1s to X.
X['1'] = 1
X

In [None]:
# Change the order of the columns and convert to an array.
X = X[['1', 'departure_hour']].to_numpy()
X

$\vec{y}$ is already created for us: it's just the `'minutes'` column in `df`.

In [None]:
y = df['minutes'].to_numpy()
y

Now, let's implement:

$$\vec{w}^* = ({X^TX})^{-1} X^T\vec{y}$$

In [None]:
# The @ symbol performs matrix multiplication!
w_star_linalg = np.linalg.inv(X.T @ X) @ X.T @ y
w_star_linalg

These numbers look familiar!

In [None]:
# Old formulas.
w0_star, w1_star

Indeed, they're exactly the same as the `w0_star` and `w1_star` we found using our old formulas.

## Making Predictions

We know how to make predictions with the old formulas:

In [None]:
# The predicted commute time if I leave at 8:30AM.
w0_star + w1_star * 8.5

How do we make predictions with the new formulas?

To find the predicted commute time for every departure hour in our dataset, we can multiply $X$ by the optimal parameter vector, $\vec{w}^*$.

$$\vec{h}^* = X \vec{w}^*$$

$\vec{h}^*$ above is the optimal **hypothesis vector**.

In [None]:
all_preds = X @ w_star_linalg
all_preds

To make a prediction for a single data point, we must take the **dot product** of the optimal parameter vector, $\vec{w}^*$ (`w_star_linalg`) with a vector of the form $\begin{bmatrix} 1 & x_\text{new} \end{bmatrix}^T$, since this is what the rows of $X$ look like.

In [None]:
# Also the predicted commute time if I leave at 8:30AM.
np.dot(w_star_linalg, np.array([1, 8.5]))

This gives us the same prediction as before!

## Multiple Linear Regression

In [None]:
df['day_of_month'] = df['date'].str.extract(r'/(\d+)/').astype(int)

In [None]:
df[['departure_hour', 'day_of_month', 'minutes']]

Let's create our new design matrix, $X$:

$$X = \begin{bmatrix}
		1      & \text{departure hour}_1 & \text{day}_1   \\
		1      & \text{departure hour}_2 & \text{day}_2    \\
		... & ... & ... \\
		1      & \text{departure hour}_n & \text{day}_n
	\end{bmatrix}$$

In [None]:
X = df[['departure_hour', 'day_of_month']].copy()
X['1'] = 1
X = X[['1', 'departure_hour', 'day_of_month']].to_numpy()
X

In [None]:
w_star_multiple = np.linalg.inv(X.T @ X) @ X.T @ y
w_star_multiple

What do our predictions look like, for each row of the dataset?

In [None]:
XX, YY = np.mgrid[5:14:1, 0:31:1]
Z = w_star_multiple[0] + w_star_multiple[1] * XX + w_star_multiple[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Reds')

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=df['departure_hour'], 
                           y=df['day_of_month'], 
                           z=df['minutes'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene=dict(xaxis_title='Departure Hour',
                             yaxis_title='Day of Month',
                             zaxis_title='Commute Time'),
                  title='Commute Time vs. Departure Hour and Day of Month',
                  width=1000, height=500)

How do we make predictions for new datapoints?

In [None]:
# The predicted commute time if I leave at 8:30AM on the 15th of the month.
np.dot(w_star_multiple, np.array([1, 8.5, 15]))

In [None]:
# The predicted commute time if I leave at 8:30AM on the 30th of the month.
np.dot(w_star_multiple, np.array([1, 8.5, 30]))