In [None]:
# Please don't change this cell, but do make sure to run it.
import otter
grader = otter.Notebook()

# Homework 3 Supplemental Notebook

## DSC 40A, Spring 2024

In this notebook, you'll answer Problems 6(b) and 6(c) in Homework 3. In addition to submitting your answers PDF to the Homework 3 assignment on Gradescope, also submit this notebook to the Homework 3, Problems 6(b) and 6(c) autograder on Gradescope and **wait until you see all public test cases pass!** Note that there **are hidden test cases** for both parts.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from itertools import combinations

Below, we load in the example we're going to use in Problem 6. The $x$ values are stored in the array `x_values`, and the $y$ values are stored in the array `y_values`.

In [None]:
data = np.genfromtxt('data/lad.csv', delimiter=',') # Import the data.

def separate_data(data):
    '''Separate an nx2 array of data into an x array and a y array.'''
    x_values = data[:, 0]
    y_values = data[:, 1]
    return x_values, y_values

x_values, y_values = separate_data(data)

In [None]:
x_values

In [None]:
y_values

<!--
BEGIN QUESTION
name: q6b
points: 3
-->

### Problem 6(b)

Complete the implementation of the function `least_squares_regression`. It takes in two arrays, `x` and `y`, and should return a **tuple** of the form `(w0_star, w1_star)`, where `w0_star` and `w1_star` are the intercept and slope, respectively, of the linear hypothesis function that minimizes mean squared error when using `x` to predict `y`.

Note that your implementation of `least_squares_regression` should work on any arrays `x` and `y`, not just the `x_values` and `y_values` arrays defined above.

In [None]:
def least_squares_regression(x, y):
    ...

In [None]:
grader.check("q6b")

Great! Run the cell below to find the equation of the least squares regression line for this dataset.

In [None]:
intercept_sq, slope_sq = least_squares_regression(x_values, y_values)
from IPython.display import Markdown
Markdown(f'''
The least squares regression line for our dataset is $H^*(x) = {round(intercept_sq, 4)} + {round(slope_sq, 4)}x$.
''')

Now, let's turn our attention to implementing least absolute deviation regression. Recall from the PDF the following statement:

> If you have a dataset with $n$ data points in $\mathbb{R}^k$, where $k \leq n$, then one of the optimal LAD regression lines must pass through $k$ data points.

Our strategy, then, is to:

1. Look at all possible pairs of points in `data`.
1. Calculate the intercept and slope between each pair of points.
1. Compute the mean absolute deviation of the predictions made by each line.
1. Pick the intercept and slope that had the lowest mean absolute deviation.

Steps 1 and 2 are implemented for you below. Try and understand how they work.

In [None]:
def generate_all_unique_pairs(data):
    """Generate a list of all possible pairs of points from the data."""
    return list(combinations(data, 2))

In [None]:
def generate_all_lines(pairs_of_points):
    """ Generate the (w0, w1) pair for the line that goes through each given pair of points.
        Uses the fact that there is a unique line that passes through any two points.
    """
    lines = []

    for pair in pairs_of_points:
        point_1, point_2 = pair
        slope = (point_2[1] - point_1[1]) / (point_2[0] - point_1[0])
        intercept = point_1[1] - slope * point_1[0]
        lines.append((intercept, slope))

    return lines

The cell below defines a list, `lines`, which contains several tuples. Each tuple corresponds to the intercept and slope of a different line, which passes through a different pair of points in `data`.

In [None]:
pairs = generate_all_unique_pairs(data) # Generate all unique pairs of data points from data.
lines = generate_all_lines(pairs) # Generate all (w0, w1) pairs corresponding to each unique pair of data points.
lines

<!--
BEGIN QUESTION
name: q6c
points: 5
-->


### Problem 6(c)

Now, your job is to implement steps 3 and 4 of the process outlined above. That is, you must:

3. Compute the mean absolute deviation of the predictions made by each line.
4. Pick the intercept and slope that had the lowest mean absolute deviation.

To do so, you'll implement the following two functions:

**`mean_absolute_error`**

Complete the implementation of the function `mean_absolute_error`. It should take in an intercept, `w0`, a slope, `w1`, and a 2D array `data`, and return the mean squared error of the predictions made by the line defined by `w0` and `w1` on the given `data`. That is, it should return the value of:

$$R_{\text{abs}}(w_0, w_1) = \frac{1}{n} \displaystyle\sum_{i=1}^{n} \big|y_i - (w_0 + w_1x_i)\big|$$

<br>

**`find_best_mad_line`**

Then, complete the implementation of `find_best_mad_line`. It should take in:

- `lines`, a list of tuples, formatted like the variable `lines` defined above, and
- `data`, a 2D array containing our dataset, formatted like the variable `data` defined above.

It should loop through each `line` in `lines`, and return the `(w0, w1)` pair (as a **tuple**) that defines line with the lowest mean absolute error on the data. If multiple `(w0, w1)` pairs have the same lowest mean absolute error, you can return any one of them.

In [None]:
def mean_absolute_error(w0, w1, data):
    # After calling separate_data(data),
    # x_values and y_values are defined the same way as in 6(b).
    x_values, y_values = separate_data(data)
    
    ...
    
def find_best_mad_line(lines, data):
    ...

In [None]:
grader.check("q6c")

Nice job – you've implemented the necessary steps to perform least absolute deviations regression! Let's call `find_best_mad_line` to see what the best least absolute deviations line is for our dataset.

In [None]:
intercept_abs, slope_abs = find_best_mad_line(lines, data)

In [None]:
from IPython.display import Markdown
Markdown(f'''
The least absolute deviations line for our dataset is $H^*(x) = {round(intercept_abs, 4)} + {round(slope_abs, 4)}x$.
''')

### Problem 6(d)

Now that we have calculated the least squares regression line and the least absolute deviation regression line for our data, let's try plotting them together to see the difference! Generate a scatter plot with the data in black, the least squares line in blue, and the LAD line red. 

Below is some code to get you started. Make use of the functions you have written above to find the slopes and intercepts of the blue and red lines. **Remember to include a picture of your plot in your PDF; this problem is not autograded.**

In [None]:
intercept_abs, slope_abs = find_best_mad_line(generate_all_lines(generate_all_unique_pairs(data)), data)
intercept_sq, slope_sq = least_squares_regression(x_values, y_values)

# Add your code to generate the plot here.
...

<hr>

## Ready to Submit?

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells. 
1. Read through the notebook to make sure all cells ran and all tests passed.
1. Run the cell below to run all tests, and make sure that they all pass.
1. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
1. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.

Remember that we will run hidden test cases on your submission after the due date.

In [None]:
grader.check_all()