# Lecture 9 Supplementary Notebook

## DSC 40A, Summer 2024

The following cell sets up the necessary imports – don't worry too much about it.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns

from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg")

pd.options.plotting.backend = "plotly"

# DSC 80 preferred styles
pio.templates["dsc80"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+dsc80"

from IPython.display import HTML

Let's load in the sales dataset as a `pandas` DataFrame.

In [None]:
data = pd.read_csv('data/sales.csv')
data.head()

## Multiple Linear Regression

Using our linear algebraic formulation, the optimal intercept and slope are given by the vector $\vec{w}^*$, where:

$$\vec{w}^* = ({X^TX})^{-1} X^T\vec{y}$$

Here:
- $X$ is a $n \times 2$ matrix, called the **design matrix**, defined as:

$${ X} = \begin{bmatrix} { 1} & { x_1} \\ { 1} & { x_2} \\ \vdots & \vdots \\ { 1} & { x_n} \end{bmatrix}$$

- $\vec{y}$ is a $n$-dimensional vector, called the **observation vector**, defined as:

$$\vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}$$

### Using just one feature

Before we perform multiple linear regression, let's first just perform simple linear regression. We'll try and use square footage to predict net sales; our hypothesis function will be:

$$
\text{predicted net sales} = w_0 + w_1 \cdot \text{square feet}
$$

In [None]:
# pio.renderers.default = 'plotly_mimetype+notebook' # If the plot doesn't load for you, run this first.

In [None]:
fig = px.scatter(data, x='sq_ft', y='net_sales')
fig.update_layout(xaxis_title='Square Feet', yaxis_title='Net Sales', title='Net Sales vs. Square Footage')

It seems like $w_1^*$, the optimal slope, should be positive. To find $w_0^*$ and $w_1^*$, we'll solve the normal equations.

In [None]:
def solve_normal_equations(X, y):
    '''Returns the optimal parameter vector, w*, given a design matrix X and observation vector y.'''
    return np.linalg.solve(X.T @ X, X.T @ y)

In [None]:
data['1'] = 1

X_one_feature_model = data[['1', 'sq_ft']]
X_one_feature_model.to_numpy()

In [None]:
y = data['net_sales']

In [None]:
w_one_feature_model = solve_normal_equations(X_one_feature_model, y)
w_one_feature_model

This is telling us that the best-fitting line to this dataset is

$$\text{predicted net sales} = 2.577 + 85.389 \cdot \text{square feet}$$

To get predictions for all observations in my dataset:

In [None]:
X_one_feature_model @ w_one_feature_model

Let's draw a plot of our hypothesis function.

In [None]:
px.scatter(data, x='sq_ft', y='net_sales', title='Net Sales vs. Square Feet')

x_range = np.linspace(0, 10)

fig = go.Figure()
fig.add_trace(go.Scatter(x=data['sq_ft'], y=y, mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_one_feature_model[0] + w_one_feature_model[1] * x_range, 
                         name='Linear hypothesis Function', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title='Square Feet', yaxis_title='Net Sales', title='Net Sales vs. Square Footage')

It's also worth calculating the mean squared error of this hypothesis function, so that we can compare it to our later hypothesis functions.

In [None]:
def mean_squared_error(X, y, w):
    return np.mean(np.sum((y - X @ w)**2))

mean_squared_error(X_one_feature_model, y, w_one_feature_model)

### Using two features

Let's now try to predict net sales from two variables: the square footage (size) of the store, and the number of competing stores in the area. Our model will be:

$$
\text{predicted net sales} = w_0 + w_1 \cdot \text{square feet} + w_2 \cdot \text{competitors}
$$

Suppose $w_0^*$, $w_1^*$, and $w_2^*$ are our hypothesis function's optimal parameters. Do you expect $w_1^*$ to be positive or negative? What about $w_2^*$?

In [None]:
fig = px.scatter(data, x='sq_ft', y='net_sales')
fig.update_layout(xaxis_title='Square Feet', yaxis_title='Net Sales', title='Net Sales vs. Square Footage')

In [None]:
fig = px.scatter(data, x='competing_stores', y='net_sales')
fig.update_layout(xaxis_title='Square Feet', yaxis_title='Net Sales', title='Net Sales vs. Number of Competing Stores')

Looking at separate scatter plots only tells part of the story. Let's look at a 3D scatter plot, with one axis for square footage, one axis for competing stores, and one axis for net sales.

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter3d(x=data['sq_ft'], 
                           y=data['competing_stores'], 
                           z=data['net_sales'], mode='markers'))

fig.update_layout(scene=dict(
    xaxis_title='Square Footage',
    yaxis_title='Competing Stores',
    zaxis_title='Net Sales'),
    title='Net Sales vs. Square Footage and Number of Competing Stores')

Our goal is to find the best fitting **plane** to this set of points.

### Question 🤔

At the start of this notebook, we fit a hypothesis function with a single feature, square feet, and got that the weight of that feature was $w_1^* = 85.389$.

We are about to fit a hypothesis function with two features, square feet and competing stores.

**Question:** Is the weight of the square feet feature, $w_1^*$, for this **new** hypothesis function guaranteed to be equal to 85.389?

A. Yes

B. No

Our design matrix is:
    
$$
\begin{pmatrix}
 1 & s_1 & c_1\\
 1 & s_2 & c_2\\
 \vdots & \vdots & \vdots\\
 1 & s_n & c_n
\end{pmatrix}
$$

where $s_i$ is the size of the $i$th store, and $c_n$ is the number of competitors. In code:

In [None]:
X_two_feature_model = data[['1', 'sq_ft', 'competing_stores']].to_numpy()
X_two_feature_model

Using the function `solve_normal_equations` that we already built:

In [None]:
w_two_feature_model = solve_normal_equations(X_two_feature_model, y)
w_two_feature_model

This is telling us that the best-fitting plane to this dataset is

$$\text{predicted net sales} = 303.491 + 45.151 \cdot \text{square feet} - 21.585 \cdot \text{competing stores}$$

**Note that the weight of $\text{square feet}$ in this hypothesis function is different than the weight of $\text{square feet}$ in the hypothesis function that only had one feature!**

In [None]:
XX, YY = np.mgrid[-1:10:2, 0:16:2]
Z = w_two_feature_model[0] + w_two_feature_model[1] * XX + w_two_feature_model[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Reds')

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=data['sq_ft'], 
                           y=data['competing_stores'], 
                           z=data['net_sales'], mode='markers', marker={'color': '#656DF1'}))

fig.update_layout(scene=dict(
    xaxis_title='Square Footage',
    yaxis_title='Competing Stores',
    zaxis_title='Net Sales'),
    title='Net Sales vs. Square Footage and Number of Competing Stores')

As before, let's calculate the MSE:

In [None]:
mean_squared_error(X_two_feature_model, y, w_two_feature_model)

Note that this is significantly lower than the MSE of the model with just one feature:

In [None]:
mean_squared_error(X_one_feature_model, y, w_one_feature_model)

### All features

Let's fit a hypothesis function using all of the features.

In [None]:
column_order = ['1', 'sq_ft', 'competing_stores', 'inventory', 'advertising', 'district_size']
X_all_features = data[column_order].to_numpy()
X_all_features

In [None]:
w_all_features = solve_normal_equations(X_all_features, y)
w_all_features

In [None]:
for i, feature in enumerate(column_order):
    if feature == '1':
        print(f'intercept:\t{w_all_features[0]:0.3f}')
    else:
        print(f'{feature}:\t{w_all_features[i]:0.3f}')

The MSE of this model is even lower!

In [None]:
mean_squared_error(X_all_features, y, w_all_features)

Note that I can't visualize this hypothesis function, since I would need to be able to visualize in 6D, but I can still find this hypothesis function's predictions:

In [None]:
X_all_features @ w_all_features

## Interpreting parameters

### Which feature is most "important"?

We should standardize in order to account for the difference in units and scale between the features.

**Question:** What would happen if I try to standardize the column of all 1s? 🧐

In [None]:
features = data[column_order].iloc[:, 1:].to_numpy()

In [None]:
standardized_features = (features - features.mean(axis=0)) / features.std(axis=0)

In [None]:
X_standardized = np.column_stack([
    np.ones(data.shape[0]),
    standardized_features
])

In [None]:
w_standardized = solve_normal_equations(X_standardized, y)
w_standardized

In [None]:
for i, feature in enumerate(column_order):
    if feature == '1':
        print(f'intercept:\t{w_standardized[0]:0.3f}')
    else:
        print(f'{feature}:\t{w_standardized[i]:0.3f}')

The district size appears to have the largest effect on the net sales.

In [None]:
mean_squared_error(X_standardized, y, w_standardized)

Note that standardizing has no impact on the actual predictions made by our hypothesis function, and hence the MSE – it just makes the weights more interpretable.

## Feature engineering and transformations

### Example: Quadratic hypothesis functions

Let's look at a new dataset of cars.

In [None]:
cars = sns.load_dataset('mpg').dropna()
cars.head()

In [None]:
px.scatter(cars, x='horsepower', y='mpg', title='MPG vs. Horsepower')

A regular linear model here isn't great.

In [None]:
cars['1'] = 1
w_cars_one_feature = solve_normal_equations(cars[['1', 'horsepower']], cars['mpg'])
w_cars_one_feature

In [None]:
px.scatter(cars, x='horsepower', y='mpg', title='MPG vs. Horsepower')

x_range = np.linspace(40, 220)

fig = go.Figure()
fig.add_trace(go.Scatter(x=cars['horsepower'], y=cars['mpg'], mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_cars_one_feature[0] + w_cars_one_feature[1] * x_range, 
                         name = 'Linear Hypothesis Function', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title='Horsepower', yaxis_title='MPG', title='MPG vs. Horsepower')

What if we add $\text{horsepower}^2$ as a feature? This would mean fitting a hypothesis function of the form

$$\text{predicted MPG} = w_0 + w_1 \cdot \text{horsepower} + w_2 \cdot \text{horsepower}^2$$

In [None]:
cars['horsepower^2'] = cars['horsepower']**2

In [None]:
cars[['1', 'horsepower', 'horsepower^2']]

In [None]:
w_cars_squared = solve_normal_equations(cars[['1', 'horsepower', 'horsepower^2']], cars['mpg'])
w_cars_squared

Let's look at the resulting hypothesis function.

In [None]:
px.scatter(cars, x='horsepower', y='mpg', title='MPG vs. Horsepower')

fig = go.Figure()
fig.add_trace(go.Scatter(x=cars['horsepower'], y=cars['mpg'], mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_cars_one_feature[0] + w_cars_one_feature[1] * x_range, 
                         name='Linear Hypothesis Function', 
                         line=dict(color='red')))
fig.add_trace(go.Scatter(x=np.linspace(40, 220), 
                         y=w_cars_squared[0] + w_cars_squared[1] * x_range + w_cars_squared[2] * x_range**2, 
                         name='Quadratic Hypothesis Function', 
                         line=dict(color='#F7CF5D', width=5)))

fig.update_layout(xaxis_title='Horsepower', yaxis_title='MPG', title='MPG vs. Horsepower')

Note: this hypothesis function is **quadratic as a function of horsepower**, but it's still **linear as a function of the parameters**. This means we can still use the normal equations to find $\vec{w}^*$*.

### Example: Amdahl's Law

In [None]:
X_amdahl = np.array([[1, 1],
                     [1, 1/2],
                     [1, 1/4]])

y_amdahl = np.array([8, 4, 3])

In [None]:
solve_normal_equations(X_amdahl, y_amdahl)

### Example: Transformations

In [None]:
# This cell generates our dataset.
np.random.seed(28)
x_fake = np.linspace(0, 20, 50) + np.random.normal(loc=0, scale=0.5, size=50)
y_fake = 0.5*np.random.normal(loc=2, scale=0.5, size=50) * np.e**(0.2 * x_fake)

In [None]:
px.scatter(x=x_fake, y=y_fake)

As per the lecture slides, we're trying to find a hypothesis function of the form

$$H(x) = w_0 e^{w_1 x}$$

We re-wrote this as

$$\log H(x) = \log w_0 + w_1 x$$

As a result, our design matrix $X$ is still 

$$X = \begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}$$ but our observation vector is now

$$\vec{z} = \begin{bmatrix} \log y_1 \\ \log y_2 \\ \vdots \\ \log y_n \end{bmatrix}$$

and our parameter vector is $$\vec{b} = \begin{bmatrix} b_0 \\ b_1 \end{bmatrix} = \begin{bmatrix} \log w_0 \\ w_1 \end{bmatrix}$$

In [None]:
X_trans = np.vstack([
    np.ones_like(x_fake),
    x_fake
]).T

z_trans = np.log(y_fake)

In [None]:
b_trans = solve_normal_equations(X_trans, z_trans)
b_trans

Now that we have $\vec{b}^*$, we need to solve for $\vec{w}^*$:

In [None]:
b0, b1 = b_trans

In [None]:
w0_star = np.e**b0
w1_star = b1

In [None]:
w0_star, w1_star

Let's look at a plot of the resulting hypothesis function, $H(x) = 0.965 e^{0.196 x}$, to make sure it looks reasonable.

In [None]:
x_range = np.arange(0, 25)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_fake, y=y_fake, mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w0_star * np.e**(w1_star * x_range), 
                         name='Exponential Hypothesis Function', 
                         line=dict(color='red')))