In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw02.ipynb")

<div class="alert alert-success">

#### Homework 2 Supplemental Notebook
    
# Empirical Risk and Simple Linear Regression

### EECS 245, Fall 2025 at the University of Michigan
    
</div>

### Instructions

Most homeworks will have Jupyter Notebooks, like this one, designed to supplement the theoretical problems. 

To write and run code in this notebook, you have two options:

1. **Use the EECS 245 DataHub.** To do this, click the link provided in the Homework 2 PDF. Before doing so, read the instructions on the [**Tech Support**](https://eecs245.org/tech-support/#option-1-using-the-eecs-245-datahub) page on how to use the DataHub.
1. **Set up a Jupyter Notebook environment locally, and use `git` to clone our course repository.** For instructions on how to do this, see the [**Tech Support**](https://eecs245.org/tech-support) page of the course website.

To receive credit for the programming portion of the homework, you'll need to submit your completed notebook to the autograder on Gradescope. Your submission time for Homework 2 is the **latter** of your PDF and code submission times.

Remember that homework problems have hidden test cases. The public test cases in your notebook only verify that your answer is in the correct format and on the right track; your results on the hidden tests will be available to you on Gradescope after we release grades.

In [None]:
# Run this cell.
import numpy as np

## Problem 5: Simple LADs 🧍

---

In Chapter 1.4, we explored simple linear regression, and defined it as the problem of finding the values of $w_0$ (intercept) and $w_1$ (slope) that minimize mean squared error for the model $h(x_i) = w_0 + w_1 x_i$.

\begin{align*}
R_{\text{sq}}(w_0, w_1) &= \frac{1}{n} \sum_{i=1}^{n} (y_i -(w_0 + w_1x_i))^2
\end{align*}

The optimal slope and intercept were denoted $w_1^*$ and $w_0^*$, respectively, and have closed-form solutions that are stated in Chapter 1.4. When using squared loss to find our optimal parameters, linear regression is often called "least squares regression." 

**What if we used a different loss function instead?**

In this question, we'll implement another type of linear regression: simple least absolute deviation (LAD) regression. LAD regression uses absolute loss to measure the quality of predictions, rather than squared loss. Put another way, to find the optimal slope $w_1^*$ and intercept $w_0^*$ for LAD regression, we minimize mean absolute error:

\begin{align*}
R_{\text{abs}}(w_0, w_1) &= \frac{1}{n} \sum_{i=1}^{n} |y_i -(w_0 + w_1x_i)|
\end{align*}

The "simple" in "simple LAD" refers to the fact that our hypothesis function $h(x_i) = w_0 + w_1 x_i$, like in regular simple linear regression, only uses a single input feature.

Since absolute value functions are not differentiable, we cannot just take partial derivatives of $R_{\text{abs}}$ with respect to $w_0$ and $w_1$, set them equal to zero, and solve for the values of $w_0$ and $w_1$, as we did to minimize $R_{\text{sq}}$. This was more tractable in the constant model case, but in general it'll require techniques beyond the scope of this class.

In order to generate the optimal LAD regression line we are going to leverage the following theorem (which, luckily, we won't need to prove):

> The regression model that minimizes mean absolute error passes directly through at least $k$ points, where $k$ is the number of parameters of the model.

This theorem is useful to us because it allows us to adopt a very conceptually simple, albeit not very efficient, strategy to compute an optimal simple LAD regression line. Since our hypothesis function has $k = 2$ parameters, an intercept $w_0$ and a slope $w_1$, we can simply:

1. Generate all possible pairs of 2 points. We know that the optimal LAD line will pass through at least one of these pairs.
1. For each pair of points:
    1. Find the equation of the line that passes through the pair. Denote the intercept and slope of this line $w_0$ and $w_1$, respectively.
    1. Compute the mean absolute error of the line with intercept $w_0$ and slope $w_1$, i.e. compute $R_\text{abs}(w_0, w_1)$.
1. Return the $(w_0, w_1)$ combination with the minimum value of $R_\text{abs}(w_0, w_1)$. By the above theorem, this line is guaranteed to minimize mean absolute error.

Notice that unlike with simple linear regression, the optimal simple LAD regression line may not be unique!

In this question, you will ultimately complete the implementation of the `SimpleLAD` **class**, which can be used as follows:

```python
>>> model = SimpleLAD()
>>> model.fit([1, 2, -1, 4], [15, 6, 7, 8])
>>> model.intercept_
7.2
>>> model.coef_
0.2
>>> model.predict(5)
8.2
>>> model.predict([5, -3.5, 5])
array([8.2, 6.5, 8.2])
```

You might recognize this syntax from Lab 2, as this is how model classes work in `sklearn`. Indeed, part of the point of this problem is to get you familarized with how to use machine learning models in code (in addition to understanding how they might be implemented).

To help you, we've defined a helper function, `generate_pairs`. Observe how it works below.

In [None]:
def generate_pairs(x, y):
    from itertools import combinations 
    tuples = zip(x, y)
    return list(combinations(tuples, 2))

generate_pairs([1, 2, -1, 4], [15, 6, 7, 8])

Now, it's your turn!

### Problem 5a) (2 pts)

Complete the implementation of the function `generate_lines`. 
- `generate_lines` takes in a list, `pairs`, in which each element is a tuple. Each tuple is itself made up of two tuples, corresponding to a pair of points. The input to `generate_lines` may look like:

```python
    [((1, 2), (3, 7)), ((1, 10), (-4, 20))]
```

- `generate_lines` returns a list with the same length as `pairs`, in which each element is a tuple of the form `(intercept, slope)`. Element `i` of the returned list should be a tuple containing the intercept and slope of the line passing through the two points in `pairs[i]` (the order of the outputted lines should be the same as the order of the inputted pairs).

Example behavior is given below.

```python
>>> generate_lines([((1, 2), (3, 7)), ((1, 10), (-4, 20))])
[(-0.5, 2.5), (12.0, -2.0)]
```

For more context on the example above:
- The input to `generate_lines` contains two pairs of points.
- The first pair of points, $(1, 2) \text{ and } (3, 7)$, sit on the line $y = -0.5 + 2.5x$. The intercept of this line is -0.5 and the slope is 2.5, so the first returned tuple is `(-0.5, 2.5)`.
- The second pair of points, $(1, 10) \text{ and } (-4, 20)$, sit on the line $y = 12 - 2x$. The intercept of this line is 12 and the slope is -2, so the second returned tuple is `(12.0, -2.0)`.


Some guidance:
- A fact from high school algebra is that given any two points, there is exactly one line that passes through them. You'll need to figure out how to programmatically find the intercept and slope of this line, given any two arbitrary points $(x_1, y_1)$ and $(x_2, y_2)$.
- There is theoretically the risk of a `DivisionByZero` error, if a pair of points contains two values with the same $x$-coordinate. We won't test your code on such examples.
- You don't have to manually convert the values in the output tuples to floats – this will likely happen automatically because your calculations will involve division, and if it doesn't, don't worry about it.

In [None]:
def generate_lines(pairs):
    ...
    
# Feel free to change this input to make sure your function works correctly.
generate_lines([((1, 2), (3, 7)), ((1, 10), (-4, 20))])

In [None]:
grader.check("p05_a")

### Problem 5b) (2 pts)

Complete the implementation of the function `mae_of_candidate_line`, which takes in four inputs:
- `intercept`, a float,
- `slope`, a float,
- `x`, a 1D list/array of numbers, and
- `y`, a 1D list/array of numbers.

`mae_of_candidate_line` should return the mean absolute error from using the line with intercept `intercept` and slope `slope` to predict `y` from `x`.

Example behavior is given below.

```python
>>> mae_of_candidate_line(5, 2, [1, 2, -1, 4], [15, 6, 7, 8])
5.0
```

For more context on the example above:

- There are four points in the dataset provided: $(1, 15)$, $(2, 6)$, $(-1, 7)$, and $(4, 8)$.
- The line we're using to make predictions is $h(x_i) = 5 + 2x_i$. This line, and the four points above, are visualized below:

<center><img src="imgs/mae-example.png" width=400></center>

- The absolute errors of the line's predictions are 4, 8, 3, and 5. So, the mean of absolute errors is $\frac{4+8+3+5}{4} = 5$.

Don't use a `for`-loop!

In [None]:
def mae_of_candidate_line(intercept, slope, x, y):
    ...

# Feel free to change this input to make sure your function works correctly.  
mae_of_candidate_line(5, 2, [1, 2, -1, 4], [15, 6, 7, 8])

In [None]:
grader.check("p05_b")

### Problem 5c) (5 pts)

Now, put it all together. Complete the implementation of the `SimpleLAD` class, which has two methods, apart from the constructor.

#### `fit`

`fit` takes in two* 1D list/arrays, `x` and `y`. Using the previously-defined helper functions, `fit` determines the intercept and slope that minimize mean absolute error on the dataset defined by `x` and `y`.
                
`fit` should not return anything, but should instead set the values of `self.intercept_` (the optimal intercept) and `self.coef_` (the optimal slope; we use the attribute name `coef_` instead of `slope_` to match `sklearn`'s naming conventions).

If there are multiple optimal combinations of intercepts and slopes, set `self.intercept_` and `self.slope_` to any one of those combinations.

*As you'll see in the method stub, `fit` takes in a third argument (at the start), named `self`. The role of the `self` argument is to be able to access attributes and methods of the current instance of the `SimpleLAD` class. Read [this article](https://www.geeksforgeeks.org/self-in-python-class/) for more information on the role of the `self` argument in Python.

<br>

#### `predict`

`predict` takes in a single (non-`self`) input, named `x_new`, which can either be a single value or list/array of values.

- If `x_new` is a single value, `predict` should return a single value, corresponding to the predicted $y$-value for the passed in $x$-value, using the already-found `self.intercept_` and `self.coef_`.
- If `x_new` is a list or array, `predict` should return an **array** corresponding to the predict $y$-values for the passed in $x$-values, using the already-found `self.intercept_` and `self.coef_`.

`fit` must be called before `predict`; if not, raise an `AttributeError`.

<br>

Example behavior is given below.

```python
>>> model = SimpleLAD()
>>> model.fit([1, 2, -1, 4], [15, 6, 7, 8])
>>> model.intercept_
7.2
>>> model.coef_
0.2
>>> model.predict(5)
8.2
>>> model.predict([5, -3.5, 5])
array([8.2, 6.5, 8.2])
```

For more context on the example above:

- There are four points in the dataset provided: $(1, 15)$, $(2, 6)$, $(-1, 7)$, and $(4, 8)$.
- The helper functions `generate_pairs`, `generate_lines`, and `mae_of_candidate_line` helped us deduce that the line with the minimum mean absolute error on this dataset is $h(x_i) = 7.2 + 0.2x_i$, so `model.intercept_` is `7.2` and `model.coef_` is `0.2`.
- Using the fit hypothesis function $h(x_i) = 7.2 + 0.2x_i$ on the inputs 5, -3.5, and 5 give us the predictions $h(5) = 8.2$, $h(-3.5) = 6.5$, and $h(5) = 8.2$, so we return an array with those three values. (Note that we return an array even though the inputs were provided as a list.) When using this hypothesis function on the single input 5, we return just the value $h(5) = 8.2$, not as an array.

In [None]:
class SimpleLAD:
    
    def __init__(self):
        """
        __init__ is the name given to the constructor method in a Python class.
        We don't need to do anything to initialize a SimpleLAD object, so this constructor
        doesn't actually do anything.
        """
        pass
    
    def fit(self, x, y):
        if len(x) != len(y):
            raise ValueError(f'Dimension mismatch: x has length {len(x)} while y has length {len(y)}')
            
        ...
        
        # The last two lines in the body of `fit` should be the two below.
        self.intercept_ = ...
        self.coef_ = ...
        
    def predict(self, x_new):
        if isinstance(x_new, list):
            x_new = np.array(x_new)
        try:
            ...
        except AttributeError:
            raise AttributeError('Cannot use `predict` before `fit`.')
            
            
# Feel free to change the inputs below to make sure your class implementation works correctly.
model = SimpleLAD()
model.fit([1, 2, -1, 4], [15, 6, 7, 8])
preds = model.predict([5, -3.5, 5])
print(f'''
model.intercept_ = {model.intercept_}
model.coef_ = {model.coef_}
model.predict([5, -3.5, 5]) = {preds}
''')

In [None]:
grader.check("p05_c")

Now that our implementation of `SimpleLAD` is complete, we can use it to fit real datasets! Run the cell below to load in a table with two columns, `x` and `y`.

In [None]:
import pandas as pd
data_for_lad = pd.read_csv('data/data-for-lad.csv')
data_for_lad.head()

Run the cell below to draw a scatter plot of `y` vs. `x`.

In [None]:
import plotly.express as px

fig = px.scatter(data_for_lad, x='x', y='y', color_discrete_sequence=['#444']).update_layout(width=800, height=400)
fig.show(renderer='png', scale=2)

There's a clear linear association at the bottom, with some outliers spread throughout the top. Let's see how the best-fitting lines look on this dataset, when the lines are chosen by minimizing mean squared error vs. mean absolute error.

First, we'll find the standard simple linear regression line, i.e. the one that minimizes mean squared error. We'll use `sklearn` to do this.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model_mse = LinearRegression()
model_mse.fit(X=data_for_lad[['x']], y=data_for_lad['y'])

In [None]:
model_mse.intercept_

In [None]:
# An array with one optimal parameter.
# sklearn's LinearRegression supports multiple regression, meaning it stores
# the coef_ attribute in a way that is flexible enough to hold multiple slope parameters.
model_mse.coef_

Now, let's compute the least absolute deviations line, i.e. the one that minimizes mean absolute error. **This is where your hard work comes in!**

In [None]:
model_lad = SimpleLAD()
model_lad.fit(data_for_lad['x'].to_numpy(), data_for_lad['y'].to_numpy())

In [None]:
model_lad.intercept_

In [None]:
model_lad.coef_

Let's graph both of these lines!

In [None]:
import plotly.graph_objects as go

fig = px.scatter(data_for_lad, x='x', y='y', color_discrete_sequence=['#888']).update_layout(width=800, height=400)

fig.add_trace(
    go.Scatter(
        x=[-1, 11],
        y=model_mse.predict([[-1], [11]]),
        mode='lines',
        name='Best Line with Minimizing MSE',
        line={'color': '#00274C'}
    )
)

fig.add_trace(
    go.Scatter(
        x=[-1, 11],
        y=model_lad.predict([-1, 11]),
        mode='lines',
        name='Best Line when Minimizing MAE',
        line={'color': '#FFCB05'}
    )
)

fig.show(renderer='png', scale=2)

What do you notice? There's nothing you need to write or comment on here, but you should think about what makes the lines appear so different, and **why** this is happening.

### Problem 5d) (0 pts, optional)

We've built a naïve implementation of simple LAD (least absolute deviations) regression. Suppose $n$ is the number of points in the dataset that we fit a `SimpleLAD` object on. Which of the following most accurately describe the runtime of `SimpleLAD.fit`, in Big-O notation? Assign `naive_lad_runtime` to an integer between 1 and 8, inclusive, corresponding to your answer among the choices below.

1. $O(1)$
2. $O(n)$
3. $O(n^2)$
4. $O(n^3)$
5. $O(\log n)$
6. $O(n \log n)$
7. $O(n!)$
8. $O(2^n)$

There are no hidden tests here, and this part is worth 0 points.

_Hint: When computing the theoretical runtime of an algorithm, it doesn't matter which language or package an operation is implemented in – a fast `numpy` vectorized operation still involves a loop!_

In [None]:
naive_lad_runtime = ...
naive_lad_runtime

In [None]:
grader.check("p05_d")

## Finish Line 🏁

Congratulations! You're ready to submit the programming portion of Homework 2.

To submit your work to Gradescope:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all public tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download`, then upload your notebook to Gradescope under "Homework 2, Problem 5 Code".
5. Stick around for a few minutes while the Gradescope autograder grades your work. Make sure you see that all **public tests** have passed on Gradescope. **Remember that homeworks have hidden tests!**