# Homework 3 Supplemental Notebook

## DSC 40A, Fall 2021

## Problem 3 – Least Absolute Deviation Regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from itertools import combinations

Below, we've implemented least squares regression, which you will need later on when drawing your graph.

In [None]:
def least_squares_regression(x, y):
    """ Return the intercept and slope (w0, w1) of the least squares regression line for the input data. """
    x = np.array(x)
    y = np.array(y)
    
    r = np.corrcoef(x, y)[0, 1]
    
    w1 = r * np.std(y) / np.std(x)
    w0 = np.mean(y) - w1 * np.mean(x)
    
    return (w0, w1)

Now let's read in the data we'll be working with and use the function above to find the slope and intercept of the least squares regression line.

In [None]:
data = np.genfromtxt("data/hw3data.csv", delimiter=",") # Import the data

def separate_data(data):
    '''Separate an nx2 array of data into an x array and a y array.'''
    x_values = data[:, 0]
    y_values = data[:, 1]
    return x_values, y_values

x_values, y_values = separate_data(data)

print(r"The (w_0^*, w_1^*) pair for the least squares regression line is:", least_squares_regression(x_values, y_values))

<h3> Part B: Least Absolute Deviation Regression </h3>

For this part, you will need to implement two functions. 

The first function will calculate the mean absolute error for a given $(w_0, w_1)$ pair. This function will take in the values of $w_0$ and $w_1$ as well as the data in the form of a list of tuples and will return a float value corresponding to the mean absolute error, defined as $$R_{abs}(w_0, w_1) = \frac{1}{n} \displaystyle\sum_{i=1}^{n} \big|y_i - (w_0 + w_1x_i)\big|$$

The second function will go through all the lines that are generated and pick the one with the lowest mean absolute error. This function will take in a list of $(w_0, w_1)$ pairs, each represented as a tuple, and the data in the same format as the previous function. It should return the $(w_0, w_1)$ pair with the lowest mean absolute error. If multiple $(w_0, w_1)$ pairs have the same lowest mean absolute error, you can select any one of them.

In [None]:
def mean_absolute_error(w0, w1, data):
    """ Return the mean absolute error evaluated at (w0, w1) for the given data.
        Hint: You can do this with a for loop, or you can do this using np.abs and np.mean.
    """
    x_values, y_values = separate_data(data)
    return ... # TODO

In [None]:
def find_best_line(lines, data):
    """ Return the (w0, w1) pair from the list of lines that has the lowest mean absolute error.
        Hint: The structure of this function is not that different than the structure of the function
              you wrote in Homework 2, Problem 4e.
    """
    best_line = ... # TODO
    best_mae = np.inf
    for ...: # TODO
        mae = ... # TODO
        if mae < best_mae:
            best_line = ... # TODO
            best_mae = ...  # TODO
    return ... # TODO

The following functions are being provided for you and they will generate all unique pairs of points from the data and all the lines from those pairs. You don't need to understand the first function, `generate_all_unique_pairs`, but you should understand the second function, `generate_all_lines`.

In [None]:
def generate_all_unique_pairs(data):
    """ Generate a list of all possible pairs of points from the data. """
    return list(combinations(data, 2))

In [None]:
def generate_all_lines(pairs_of_points):
    """ Generate each (w0, w1) pair for the line that goes through each given pair of points.
        Uses the fact that there is a unique line that passes through any two points.
    """
    lines = []

    for pair in pairs_of_points:
        point_1, point_2 = pair
        slope = (point_2[1] - point_1[1]) / (point_2[0] - point_1[0])
        intercept = point_1[1] - slope * point_1[0]
        lines.append((intercept, slope))

    return lines

Now that our procedure for generating the optimal LAD regression line is implemented we can calculate the LAD regression line for our data.

In [None]:
pairs = generate_all_unique_pairs(data) # Generate all unique pairs of data points from data
lines = generate_all_lines(pairs) # Generate all (w0, w1) pairs corresponding to each unique pair of data points 

print("The (w_0^*, w_1^*) pair for the LAD regression line is: ", find_best_line(lines, data)) # Calculate and print intercept and slope

### Part C: Plotting the result

Now that we have calculated the least squares regression line and the least absolute deviation regression line for our data, let's try plotting them together to see the difference! Generate a scatterplot with the data in black, the least squares line in blue, and the LAD line red. 

Below is some code to get you started. Make use of the functions you have written above to find the slopes and intercepts of the blue and red lines. 

In [None]:
w0_abs, w1_abs = ... # TODO (Hint: The answer can be found by piecing together the code in the previous code cell)
w0_sq, w1_sq = ... # TODO

# Add your code to generate the plot here
plt.figure(figsize=(10, 5)) # Don't change this
plt.plot(..., ..., color="blue", label="Least Squares Line") # TODO
plt.plot(..., ..., color="red", label="LAD Line") # TODO
plt.scatter(..., ..., color="black") # TODO
plt.legend()
plt.show();