<a href="https://colab.research.google.com/github/dlsun/Data402-F21/blob/main/Optimization_for_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Read in housing prices data set.

In [1]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/dlsun/Data402-F21/main/data/housing-prices.txt",
                 sep="\t")
df

Unnamed: 0,Price,Living.Area,Bathrooms,Bedrooms,Fireplaces,Lot.Size,Age,Fireplace
0,142212,1982,1.0,3,0,2.00,133,False
1,134865,1676,1.5,3,1,0.38,14,True
2,118007,1694,2.0,3,1,0.96,15,True
3,138297,1800,1.0,2,2,0.48,49,True
4,129470,2088,1.0,3,1,1.84,29,True
...,...,...,...,...,...,...,...,...
1052,107695,1802,2.0,4,1,0.97,56,True
1053,236737,3239,3.5,4,2,2.50,1,True
1054,154829,1440,2.0,2,1,0.61,66,True
1055,179492,2030,2.5,3,1,1.00,3,True


Let's fit a linear regression model, using the number of bedrooms and number of bathrooms as a feature. We can use linear algebra to get a closed-form solution for $\alpha$ and $\beta$.

Recall from class that if we know $\beta$, then the optimal intercept $\alpha$ is $\bar y - \bar{\bf x} \beta$. Therefore, we can write the objective in terms of $\beta$ as:

\begin{align}
J(\beta) &= \frac{1}{n} \sum_{i=1}^n ((y_i - \bar y) - ({\bf x}_i - \bar {\bf x})\beta)^2 \\
&= \frac{1}{n} \sum_{i=1}^n (\tilde y_i - \tilde {\bf x}_i\beta)^2
\end{align}

Now, we take the derivative:

\begin{align}
\frac{\partial J}{\partial \beta} &= \frac{1}{n} \sum_{i=1}^n -2 \tilde {\bf x}_i^T (\tilde y_i - \tilde {\bf x}_i\beta).
\end{align}

Setting this equal to 0 and solving, we obtain:

$$ \sum_{i=1}^n \tilde {\bf x}_i^T \tilde {\bf x}_i\beta = \sum_{i=1}^n \tilde {\bf x}_i^T \tilde y_i.$$

We can rewrite these expressions in terms of matrices as:

$$ \tilde X^T \tilde X \beta = \tilde X^T \tilde {\bf y}. $$

At this point, we can solve a system of linear equations to obtain $\beta$. But we can also obtain an expression for $\beta$ using linear algebra (matrix inverses):

$$ \beta = (\tilde X^T \tilde X)^{-1} \tilde X^T \tilde {\bf y}. $$

In [None]:
import numpy as np

n = len(df)
X = df[["Bedrooms", "Bathrooms"]]
y = df["Price"]

Xc = X - X.mean()
yc = y - y.mean()

beta = np.linalg.solve(Xc.T @ Xc, Xc.T @ yc)
alpha = y.mean() - X.mean() @ beta

alpha, beta

Alternatively, we can use gradient descent to estimate $\beta$.

The gradient (which we calculated above) is:

$$ \frac{\partial J}{\partial\beta} = \frac{1}{n} (-2) X^T({\bf y} - X\beta). $$



In [None]:
learning_rate = 0.1 # fiddle around with this number until convergence 

# start with initial guess of beta
beta = np.array([0, 0])
for _ in range(1000):
  grad = 1/n * -2 * Xc.T @ (yc - Xc @ beta)
  beta = beta - learning_rate * grad

beta

## Practice Exercises

Continue with the data example above.

1. In the example above, we did gradient descent on $\beta$ only, assuming that we would calculate $\alpha$ separately after we have estimated $\beta$. Implement gradient descent on all parameters $(\alpha, \beta)$ simultaneously.

2. Use linear algebra to come up with a closed-form solution to ridge regression, which minimizes:

$$ \underset{\alpha, {\bf \beta}}{\text{minimize}}\ \frac{1}{n}\sum_{i=1}^n (y_i - \alpha - {\bf x}_i \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2. $$

3. Use gradient descent to estimate $\beta$.

4. Use gradient descent to estimate $\alpha$ and $\beta$ together.