Welcome to Lab 2! The goal of this lab is to learn how to implement linear regression. (This will also be helpful for homework 1!)

This class will simulate a coding interview, where you will be asked to code up solutions on the fly, often without the aid of a code compiler or an IDE.

The class will be divided into 4 groups. Each group will present their solutions at the board. Within each group, assign one person to be "coder," who will code up your solution after you present the solution and run it to see if it works. If the code does not work as expected, work together to debug the code.

## Let's implement linear regression!

To start, we'll need some data. So let's first think about how we would simulate some data.

Q1: We would like to simulate some data to fit a linear regression model. What's a common distributional assumption for $Y$ given $X$? Formally define your probability model and point out which values correspond to the parameters of the probability model.

A common assumption is $Y=X \beta + \beta_0 + \epsilon$ where $\epsilon$ is a mean-zero normally distributed RV with some unknown variance $\sigma^2$. The parameters of the probability model are the intercept $\beta_0$, coefficients $\beta$, and the variance of the noise term $\sigma^2$.

Q2: Let's simulate $n=100$ observations with $p=3$ variables. Let the coefficients of the linear model be $\beta=(1,0.5,0)^\top$ and let the intercept be zero. Simulate the data such that $Y = X \beta + \epsilon$ where $\epsilon \sim N(0, 0.1)$. Simulate $X_j \sim N(0,4)$ for all $j=1,2,3$. Simulate data according to the settings defined above. (Code Required)

In [3]:
import numpy as np
import pandas as pd

np.random.seed(10)
beta = np.array((1,0.5,0)).T

def simulate_data(n):
    X = np.random.normal(loc=0, scale=2, size=(n, 3))
    eps = np.random.normal(loc=0, scale=np.sqrt(0.1), size=n)
    mu = X @ beta #alternatively you can use np.dot(X, beta). Same result
    y = mu + eps
    return X, y

datX, datY = simulate_data(100)

Q3: Read through the linear regression documentation on sklearn's website: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html. Explain what each of the arguments in the LinearRegression class are for. (Written response)

Answers discussed in class.

Q4: Fit linear regression using sklearn on your simulated data. What values are you going to use for the LinearRegression parameters? How will you obtain the estimated coefficients and intercept? Map/draw out your plan, then write up the code. (Code Required)

In [8]:
import sklearn.linear_model


reg = sklearn.linear_model.LinearRegression().fit(datX, datY)

coefficients = reg.coef_
intercept = reg.intercept_

print("Intercept:", intercept)
print("Coefficients:", coefficients)


Intercept: 0.027279474080975058
Coefficients: [ 0.99507555  0.49232657 -0.0131667 ]


Q5: Let's recall lecture. What optimization problem do we try to solve when fitting linear regression? (Written Response and Equation)

The optimization problem for linear regression is to find the line that best fits the data while minimizing error. 

Equation: $\min_\beta || Y -X\beta ||^2$

Q6: How do we derive the closed form solution to the optimization problem? Describe the procedure. Derive the closed form solution step by step. (Written Response and equations.)

To find the closed form solution to the optimization problem we have to:
1. Take the gradient of the loss with respect to the model parameters. 
2. Set the gradient equal to zero. Find the value for the model parameter such that the gradient is equal to zero.

Q7: Implement the closed form solution. (Code Required)

In [12]:
import numpy as np

# augment X matrix with a ones column
tildeX = np.c_[np.ones((datX.shape[0], 1)), datX]  

# Apply the closed form solution to estimate coefficients
beta_closed_form = np.linalg.inv(tildeX.T @ tildeX) @ tildeX.T @ datY

# Extract intercept and coefficients
intercept_closed_form = beta_closed_form[0]
coefficients_closed_form = beta_closed_form[1:]

print("Closed-form solution intercept:", intercept_closed_form)
print("Closed-form solution coefficients:", coefficients_closed_form)

Closed-form solution intercept: 0.027279474080975114
Closed-form solution coefficients: [ 0.99507555  0.49232657 -0.0131667 ]


Q8: Run your code on the simulated dataset. Do you get the same answer as scikit-learn?

(Bonus) Q9: Alternatively, we can perform linear regression by running gradient descent to solve the optimization problem. Your task is to implement the gradient descent procedure and check your answers. (Code Required)