# Lab 3: Linear Regression and the Gauss-Markov theorem
Welcome to DS102 lab!

The goals of this lab are to get some practice with applying linear regression, and to observe what happens when the Gauss-Markov theorem is applicable in practice.

The code you need to write is commented out with a message "TODO: fill in". There is additional documentation for each part as you go along.

##  Course Policies
### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually.** If you do discuss the assignments with others please include their names in the cell below.

**Submission:** to submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope.

**This assignment should be completed and submitted before Thursday February 13, 2020 at 11:59 PM.**

# Collaborators
Write the names of your collaborators in this cell.

# Setup
Let's begin by importing the libraries we will use

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Part 1: Ordinary Least Squares estimator with zero-mean errors
In the first part of this lab, we will apply the ordinary least squares (OLS) estimator in the context of linear regression. The objective is to use linear regression to understand how the number of hours a student spends per week spent on Data 102 affects the student's grade. 

## Ground truth model
Let $y$ be a student's grade, let $x$ be the number of hours the student spends per week on the class, and let $\epsilon$ be some zero-mean normally distributed random error. For this lab, we provide the ground truth that $y = \beta x + \epsilon$ for some true $\beta$. Specifically, we set the true $\beta$ to be $\beta= 5$, which means each additional hour spent on Data 102 increases a student's grade by 5 points (aside from random error).


## Simulate dataset
Here we simulate the dataset described above. We set the true $\beta$ to be 5.0. We draw $n$ samples where $x^{(1)},...,x^{(n)}$ are drawn from a uniform distribution between 0 and 20 hours per week, and $\epsilon^{(1)},...,\epsilon^{(n)}$ are normally distributed with mean 0. The ground truth labels are $y^{(i)} = \beta x^{(i)} + \epsilon^{(i)}$.

In [None]:
# No TODOs here, just run this cell and observe the data being generated.
TRUE_BETA=5.0

def simulate_data_unbiased(n):
    # Sample Xs.
    X = np.random.uniform(0,20,n)
    # Sample epsilon errors.
    epsilons = np.random.normal(0,10,n)
    # Calculate ground truth ys.
    y = TRUE_BETA * X + epsilons
    return X, y

X, y = simulate_data_unbiased(100)

def plot_data(input_xs, input_ys, predictions = None):
    plt.figure(figsize=(16, 8))
    plt.scatter(input_xs, input_ys)
    if predictions is not None:
        plt.plot(input_xs, predictions, c='blue', linewidth=2)
    plt.xlabel("Number of hours per week spent on Data 102", fontsize=14)
    plt.ylabel("Grade", fontsize=14)
    plt.show()

plot_data(X, y)

## 1a) Calculate the Ordinary Least Squares (OLS) estimator
First we are interested in finding the best approximation $y^{(i)} \approx \hat{\beta} x^{(i)}$, where $\hat{\beta}$ is our estimate of how much a student's grade goes up for every extra hour they spend on Data 102. We want to find the constant $\hat{\beta}$ that minimizes $\sum_{i=1}^n (y^{(i)} - \hat{\beta} x^{(i)})^2$. This estimator $\hat{\beta}$ is known as the **Ordinary Least Squares (OLS)** estimator.

Specificaly, given a dataset of $n$ samples with $d$ features, let $X$ be an $n \times d$ matrix where each row is a feature vector $x^{(i)} $ in $\mathbb{R}^d$, and let $y$ be a vector in $\mathbb{R}^n$ where each entry is a label $y^{(i)}$. Recall from class that the OLS estimator that minimizes $\sum_{i=1}^n (y^{(i)} - (\hat{\beta})^Tx^{(i)})^2$ is:
$$\hat{\beta} = (X^T X)^{-1} X^T y$$

In [None]:
def calculate_beta_hat(X, y):
    beta_hat = # TODO: calculate beta_hat as the OLS estimator. 
    return beta_hat

beta_hat = calculate_beta_hat(X, y)
print("The OLS estimator beta_hat is: %.5f" % (beta_hat))
print("The linear model is: y = {:.5}X".format(beta_hat))

### Compute OLS predictions
Given this $\hat{\beta}$, we compute the predictions for the points $x^{(i)}$ we have in the data set. For each data point $x^{(i)}$, the prediction we compute is $\hat{\beta} x^{(i)}$. 

In this part, we will compute these predictions, and then plot the predictions to see how well they fit the data.

In [None]:
# TODO: calculate the predictions from the beta_hat estimated above.
predictions = # TODO: calculate the predictions.
plot_data(X, y, predictions=predictions)

## 1b) Gauss-Markov Theorem:
### Question: If we use the OLS estimator $\hat{\beta}$ to estimate the true $\beta$ from the dataset we simulated above, does the Gauss-Markov theorem apply?

TODO: fill in your answer.

If the Gauss-Markov theorem applies, then averaged over the randomness from the random draws of data, the mean of the OLS estimator $\hat{\beta}$ should equal the true $\beta$. Here, we calculate the mean and standard deviation of $\hat{\beta}$ by conducting multiple trials with sample size $n=100$, and averaging over the OLS estimator $\hat{\beta}$ found in each trial. 

In [None]:
def OLS_estimate_mean_and_std(num_trials=50, n=100):
    """Calculates the mean and standard deviation of the OLS estimate beta_hat over 50 random trials.
    
    Hint: For each trial in num_trials, you should draw n samples of data, and calculate beta_hat 
    from those n samples. After that, you should end up with num_trials different beta_hats. 
    Compute the mean and standard deviation of the resulting beta_hats.

    Args: num_trials: Number of trials. Calculate a different beta_hat for each trial using n samples,
      and find the mean and standard deviation of all of the num_trials different beta_hats you calculated.
    
    Returns: beta_hat_mean, beta_hat_std. The first element is the mean of beta_hat over num_trials 
      trials, and the second element is the standard deviation of beta_hat over num_trials trials. Hint: 
      try the numpy functions of np.mean and np.std. Google these functions if you get stuck.
    
    """
    beta_hats = []
    for _ in range(num_trials):
        beta_hat =  # TODO: calculate beta_hat for this trial.
        
        beta_hats.append(beta_hat)
    beta_hats = np.array(beta_hats)
    return np.mean(beta_hats), np.std(beta_hats)

mean, std = OLS_estimate_mean_and_std()
print("Mean of OLS estimate beta_hat after 50 random trials: %.5f" % (mean))
print("Standard deviation of OLS estimate beta_hat after 50 random trials: %.5f" % (std))

### Question: Does the OLS estimate $\hat{\beta}$ appear to satisfy the Gauss-Markov theorem based on its mean over 50 random trials above?
TODO: fill in your answer.

If the Gauss-Markov theorem applies, then as the sample size $n$ grows, the OLS estimator $\hat{\beta}$ should approach the true $\beta$. Here, we calculate $\hat{\beta}$ for different sample sizes $n$ ranging from small sample sizes (only $n=5$) to large sample sizes ($n = 1028$). 

In [None]:
def OLS_estimates_different_sample_sizes(sample_sizes):
    """Computes OLS estimate beta_hat for each sample size in sample_sizes.
    
    Args: sample_sizes: list of integers, each representing a sample size n.
    
    Returns: beta_hats: list of beta_hats, same length as sample_sizes.
      Each entry in beta_hats corresponds to the OLS estimate beta_hat for a 
      sample size in sample_sizes.
    """
    beta_hats=[]

    for n in sample_sizes:
        beta_hat = # TODO: calculate beta_hat for this sample size.
        
        beta_hats.append(beta_hat)
    
    return beta_hats

In [None]:
# Plot resulting beta_hats for each sample size.
# No TODOs here, just run this cell.
sample_sizes = [4+2**i for i in range(11)]
print("Getting beta_hat estimates for sample sizes:", sample_sizes)
beta_hats = OLS_estimates_different_sample_sizes(sample_sizes)
plt.plot(sample_sizes, beta_hats, label='beta_hat estimate')
plt.plot(sample_sizes, TRUE_BETA*np.ones_like(sample_sizes), label='true beta', color='black')
plt.legend()
plt.xlabel("Sample size n", fontsize=14)
plt.ylabel("beta_hat estimate", fontsize=14)
plt.show()

### Question: Does the OLS estimate $\hat{\beta}$ appear to satisfy the Gauss-Markov theorem based on its behavior as $n$ grows? 

TODO: fill in your answer

# Part 2: Biased $\epsilon$ errors
Previously, the $\epsilon$ errors all had mean 0. Therefore, the Gauss-Markov theorem applied, and the OLS estimator $\hat{\beta}$ worked well on average: it approached the true $\beta$ as the number of samples increased. In this section, we explore what happens where the $\epsilon$ errors do NOT have mean 0.


Suppose students who spent less than (or equal to) 12 hours on the class tended to over-report the number of hours per week they spent on the class. Then the grades we observe for students who reported having spent less than (or equal to) "12 hours per week" on the class will be lower than the true average grades of students who *actually* spent 12 hours per week on the class. To simulate this, for any examples where students spent less than (or equal to) 12 hours per week on the class, we generate the data with an $\epsilon$ error that has a negative mean. We assume students who spent less than 4 hours on the class were actually honest.

In [None]:
# No TODOs here, just run this cell and observe the data with biased errors compared the previous data with
# zero-mean errors.
def simulate_data_biased(n):
    xs = np.random.uniform(0,20,n)
    epsilons = []
    epsilon_mean_less_than_12 = -20
    for x in X: 
        # We assume students who spent less than 4 hours on the class were actually honest.
        if x <= 12 and x > 4:
            epsilon = np.random.normal(epsilon_mean_less_than_12,10)
            epsilons.append(epsilon)
        else: 
            epsilon = np.random.normal(0,10)
            epsilons.append(epsilon)
    y = TRUE_BETA * X + np.array(epsilons)
    return X, y

def plot_data_biased_and_unbiased(X_unbiased, y_unbiased, X_biased, y_biased, predictions = None):
    plt.figure(figsize=(16, 8))
    plt.scatter(X_unbiased, y_unbiased, label="Data with zero mean epsilon errors")
    plt.scatter(X_biased, y_biased, label="Data with biased epsilon errors")
    if predictions is not None:
        plt.plot(input_xs, predictions, c='blue', linewidth=2)
    plt.xlabel("Number of hours per week spent on Data 102", fontsize=14)
    plt.ylabel("Grade", fontsize=14)
    plt.legend()
    plt.show()
    
n = 100
X_biased, y_biased = simulate_data_biased(n)
X_unbiased, y_unbiased = simulate_data_unbiased(n)
plot_data_biased_and_unbiased(X_unbiased, y_unbiased, X_biased, y_biased)

## 2a) Calculate the OLS estimator $\hat{\beta}$ on the biased data.

In [None]:
# Calculate beta_hat from the biased data.
beta_hat = calculate_beta_hat("""TODO""", """TODO""") # TODO: fill in the datasets to use to calculate beta_hat.
print("The OLS estimator beta_hat is: %.5f" % (beta_hat))
print("The linear model is: Y = {:.5}X".format(beta_hat))

### Compute OLS predictions.
Compute the predictions from the OLS estimator calculated from the biased data.

In [None]:
# Compute the predictions from the beta_hat.
predictions = # TODO: calculate beta_hat predictions over the biased data.
plot_data(X_biased, y_biased, predictions=predictions)

## 2b)  Gauss-Markov theorem
### Question: If we use the OLS estimator $\hat{\beta}$ to estimate the true $\beta$ from the dataset with biased errors that we simulated above, does the Gauss-Markov theorem apply? 

TODO: fill in your answer.

If the Gauss-Markov theorem applies, then averaged over the randomness from the random draws of data, the mean of the OLS estimator $\hat{\beta}$ should equal the true $\beta$. Here, we calculate the mean and standard deviation of $\hat{\beta}$ by conducting multiple trials with sample size $n=100$, and averaging over the OLS estimator $\hat{\beta}$ found in each trial. 

In [None]:
def OLS_estimate_mean_and_std_biased(num_trials=50, n=100):
    """Calculates the mean and standard deviation of the OLS estimate beta_hat over 50 random trials.
    This should be the same as your previous function, but using the biased simulated data.
    
    Args: num_trials: Number of trials over which to average beta_hat. 
    
    Returns: beta_hat_mean, beta_hat_std.
    """
    beta_hats = []
    for _ in range(num_trials):
        beta_hat = # TODO: calculate the mean and standard deviation of beta_hat over num_trials random trials.
        # This should be the same as your previous function, but using the biased simulated data.
        
        beta_hats.append(beta_hat)
    beta_hats = np.array(beta_hats)
    return np.mean(beta_hats), np.std(beta_hats)

mean, std = OLS_estimate_mean_and_std_biased()
print("Mean of OLS estimate beta_hat after 50 random trials: %.5f" % (mean))
print("Standard deviation of OLS estimate beta_hat after 50 random trials: %.5f" % (std))

### Question: Does the OLS estimate $\hat{\beta}$ appear to satisfy the Gauss-Markov theorem based on its mean over 50 random trials above?
TODO: fill in your answer.


If the Gauss-Markov theorem applies, then as the sample size $n$ grows, the OLS estimator $\hat{\beta}$ should approach the true $\beta$. Here, we calculate $\hat{\beta}$ for different sample sizes $n$ ranging from small sample sizes (only $n=5$) to large sample sizes ($n = 1028$). 

In [None]:
def OLS_estimates_different_sample_sizes_biased(sample_sizes):
    """Computes OLS estimate beta_hat for each sample size in sample_sizes.
    
    Args: sample_sizes: list of integers, each representing a sample size n.
    
    Returns: beta_hats: list of beta_hats, same length as sample_sizes.
      Each entry in beta_hats corresponds to the OLS estimate beta_hat for a 
      sample size in sample_sizes.
    """
    beta_hats=[]
    
    for n in sample_sizes:
        beta_hat =  # TODO: calculate beta_hat for each sample size.
        # This should be the same as your previous function, but using the biased simulated data.
        
        beta_hats.append(beta_hat)
        
    return beta_hats

In [None]:
# Plot resulting beta_hats for each sample size.
# No TODOs here, just run this cell.
print("Getting beta_hat estimates for sample sizes:", sample_sizes)
sample_sizes = [4+2**i for i in range(11)]
beta_hats = OLS_estimates_different_sample_sizes_biased(sample_sizes)
plt.plot(sample_sizes, beta_hats, label='beta_hat estimate')
plt.plot(sample_sizes, TRUE_BETA*np.ones_like(sample_sizes), label='true beta', color='black')
plt.legend()
plt.xlabel("Sample size n", fontsize=14)
plt.ylabel("beta_hat estimate", fontsize=14)
plt.show()

### Question: Does the OLS estimate $\hat{\beta}$ appear to satisfy the Gauss-Markov theorem based on its behavior as $n$ grows? 

TODO: fill in your answer