In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab02.ipynb")

<div class="alert alert-success">

#### Lab 2
    
# Empirical Risk and Simple Linear Regression

### EECS 245, Fall 2025 at the University of Michigan
    
</div>

### Instructions

Most labs will have Jupyter Notebooks, like this one, designed to supplement the in-person worksheet. 

To write and run code in this notebook, you have two options:

1. **Use the EECS 245 DataHub.** To do this, click the "code" link under Lab 2 on the course website. Log in with your uniqname and set a password.
1. **Set up a Jupyter Notebook environment locally, and use `git` to clone our course repository.** For instructions on how to do this, see the [**Tech Support**](https://eecs245.org/tech-support) page of the course website.

To receive credit for the lab, you'll need to submit your notebook with all TODO tasks completed to Gradescope and show your TA that all test cases have passed before the end of the lab session. Instructions on how to do this are at the bottom of the notebook.

## Arrays

---

We use `import` statements to add the objects (values, functions, classes) defined in other modules to our programs. There are a few different ways to `import`.


- **Option 1**: `import module`.
<br><small>Now, everytime we want to use a name in `module`, we must write `module.<name>`.</small>

In [None]:
import math

In [None]:
math.sqrt(15)

In [None]:
sqrt(15) # Errors!

- **Option 2**: `import module as m`.
<br>Now, everytime we want to use a name in `module`, we can write `m.<name>` instead of `module.<name>`.

In [None]:
# This is the standard way that we will import numpy.
import numpy as np

In [None]:
np.pi

In [None]:
# What does this do? We'll learn in a few weeks!
np.linalg.inv([[2, 1], 
               [3, 4]])

- **Option 3**: `from module import ...`.
<br><small>This way, we explicitly state the names we want to import from `module`.<br>To import everything, write `from module import *`.</br>

In [None]:
# Importing a particular function from the requests module.
from requests import get

In [None]:
# This typically fills up the namespace with a lot of unnecessary names, so use sparingly.
from math import *

In [None]:
sqrt

### NumPy

<center>
<img src='imgs/numpy.png' width=300>
</center>

NumPy (pronounced "num pie") is a Python library (module) that provides support for **arrays** and operations on them.

The `pandas` library, which we will sometimes use to work with data in tables, works in conjunction with `numpy`.

To use `numpy`, we need to import it. It's usually imported as `np` (but doesn't have to be!). We also had to install it on your computer first, but you already did that when you set up your environment.

In [None]:
import numpy as np

### Arrays

The core data structure in `numpy` is the array. Moving forward, "array" will always refer to a `numpy` array.

One way to instantiate an array is to pass a list as an argument to the function `np.array`.

In [None]:
np.array([4, 9, 1, 2])

- Arrays, unlike lists, must be **homogenous** – all elements must be of the same type.

In [None]:
# All elements are converted to strings!
np.array([1961, 'michigan'])

### Array-number arithmetic

Arrays make it easy to perform the same operation to every element **without a `for`-loop**.

This behavior is formally known as **broadcasting**. We also often say these operations are **vectorized**.

<center><img src="imgs/broadcasting.jpg" width=800></center>

In [None]:
temps = [68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temps

In [None]:
temp_array = np.array(temps)

In [None]:
# Increase all temperatures by 3 degrees.
temp_array + 3

In [None]:
# Halve all temperatures.
temp_array / 2

In [None]:
# Convert all temperatures to Celsius.
(5 / 9) * (temp_array - 32)

**Note**: In none of the above cells did we actually modify `temp_array`! Each of those expressions created a new array. To actually change `temp_array`, we need to reassign it to a new array.

In [None]:
temp_array

In [None]:
temp_array = (5 / 9) * (temp_array - 32) # Only run this once!

In [None]:
# Now in Celsius!
temp_array

### ⚠️ The dangers of unnecessary `for`-loops

Under the hood, `numpy` is implemented in C and Fortran, which are compiled languages that are much faster than Python. As a result, these **vectorized** operations are much quicker than if we used a vanilla Python `for`-loop.

We can time code in a Jupyter Notebook. Let's try and square a long sequence of integers and see how long it takes with a Python loop:

In [None]:
%%timeit
squares = []
for i in range(1_000_000):
    squares.append(i * i)

In vanilla Python, this takes about 0.03 seconds per loop.<br><br>In `numpy`:

In [None]:
%%timeit
squares = np.arange(1_000_000) ** 2

Only takes about 0.0008 seconds per loop, approximately 40x faster! (These numbers will likely be different on your computer or DataHub, but you'll see the second technique is much, much quicker.)

### Element-wise arithmetic

We can apply arithmetic operations to multiple arrays, provided they have the same length.

The result is computed **element-wise**, which means that the arithmetic operation is applied to one pair of elements from each array at a time.

<center><img src="imgs/elementwise.jpg" width=900></center>

In [None]:
a = np.array([4, 5, -1])
b = np.array([2, 3, 2])

In [None]:
a + b

In [None]:
a / b

In [None]:
a ** 2 + b ** 2

In [None]:
# Broadcasting implicitly replaces the 3 with np.array([3, 3, 3]),
# so that the shapes match up.
a + 3

### Array methods

Arrays come equipped with several handy methods; some examples are below, but you can read about them all [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

In [None]:
arr = np.array([3, 8, 4, -3.2])

In [None]:
(2 ** arr).sum()

In [None]:
(2 ** arr).mean()

In [None]:
(2 ** arr).max()

In [None]:
(2 ** arr).argmax()

In [None]:
# An attribute, not a method.
arr.shape

<div class="alert alert-info" markdown="1">

### Task 1

</div>


Congrats – you won the lottery! 🎉 

Here's how your payout works: on the first day of September, you are paid \\$0.01. Every day thereafter, your pay doubles, so on the second day you're paid \\$0.02, on the third day you're paid \\$0.04, on the fourth day you're paid \\$0.08, and so on.

September has 30 days.

Assign `total_winnings`

Write a **one-line expression** that uses the numbers `2` and `30`, along with the function `np.arange` and at least one array method, that computes the total amount **in dollars** you will be paid in September.

No loops allowed – before answering, experiment with how `np.arange` works.
    
_Hint: We have a [🎥 walkthrough video](https://youtu.be/w_witptT6Ts?si=1g42U-wIITfuax_a) of this problem, but don't watch it until you're stuck!_

In [None]:
total_winnings = ...

In [None]:
grader.check("task01")

<div class="alert alert-info" markdown="1">

### Task 2

</div>

In Activity 1 of the lab, you derived the formula for the harmonic mean of a collection of numbers, $y_1, y_2, ..., y_n$. You were first exposed to the harmonic mean in last week's lab, when you discovered that it's used for finding the average of several rates of change (e.g. speeds).

$$\text{harmonic mean}(y_1, y_2, \ldots, y_n) = \frac{n}{\sum_{i = 1}^n \frac{1}{y_i}}$$

Complete the implementation of the function `harmonic_mean`, which takes in **an array of positive numbers** and returns their harmonic mean. Assume that the array is non-empty and correctly formatted. Example behavior is given below.

```python
>>> harmonic_mean(np.array([1, 2, 3]))
1.6363636363636365

>>> harmonic_mean(np.array([3, 10]))
4.615384615384615
```

Your solution should only be one line, and shouldn't involve any loops.

In [None]:
def harmonic_mean(vals):
    ...
    
# Feel free to change this input to make sure your function works correctly.
harmonic_mean(np.array([1, 2, 3]))

In [None]:
grader.check("task02")

## Simple Linear Regression

---

As we'll see in Chapter 1.4 (and tomorrow's lecture), the formulas for the optimal slope, $w_1^*$, and optimal intercept, $w_0^*$, that minimize mean squared error for the simple linear regression model, $h(x_i) = w_0 + w_1 x_i$, are:

$$w_1^* = \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}$$

Let's load in the commute times dataset from lecture and implement these formulas to see how they work. Run the cell below.

In [None]:
import pandas as pd
commutes = pd.read_csv('data/commute-times.csv')
commutes

The true dataset above has several columns, but for now, we only care about the following two.

In [None]:
commutes[['departure_hour', 'minutes']]

In [None]:
# Run this cell to visualize the data.
# Don't worry about how the plotting code works for now;
# most of it handles formatting.

import plotly.express as px

fig = px.scatter(
    commutes,
    x='departure_hour',
    y='minutes',
    size=np.ones(len(commutes)) * 50,
    size_max=8
)

fig.update_xaxes(
    title='Home Departure Time (AM)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    title='Commute Time (Minutes)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_traces(marker_color="#3D81F6", marker_line_width=0)
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    width=700,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    )
)

fig.show(renderer='notebook')

The first column, `departure_hour`, is our $x$ variable, while `minutes` is our $y$ variable. Below, we save both columns as arrays.

In [None]:
commutes_x = commutes['departure_hour'].to_numpy()
commutes_y = commutes['minutes'].to_numpy()

In [None]:
commutes_x

<div class="alert alert-info" markdown="1">

### Task 3

</div>

Complete the implementations of the functions `optimal_slope` and `optimal_intercept`. Both functions should take in two arrays, `x` and `y`, and should return the corresponding optimal parameter. Example behavior is given below.

```python
>>> optimal_slope(commutes_x, commutes_y)
-8.186941724265552

>>> optimal_intercept(commutes_x, commutes_y)
142.4482415877287

>>> optimal_slope(np.array([1, 2, 3]), np.array([3, 6, 9]))
3.0

>>> optimal_intercept(np.array([1, 2, 3]), np.array([3, 6, 9]))
0.0
```

For your convenience, the formulas for the optimal slope, $w_1^*$, and optimal intercept, $w_0^*$, are given below.

$$w_1^* = \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}$$

As before, no looping allowed! You will need to piece together several array operations to compute the optimal slope; approach your answer one step at a time.

In [None]:
def optimal_slope(x, y):
    ...
    
def optimal_intercept(x, y):
    ...
    
# Feel free to change these inputs to make sure your functions work correctly.
optimal_slope(commutes_x, commutes_y)

In [None]:
grader.check("task03")

If you completed Task 3 correctly, you should see the line with the optimal intercept and slope for the commutes time data below. If the line doesn't look quite right, re-check your work in Task 3.

In [None]:
def data_and_line(x, y):
    '''
    Draws a scatter plot of (x[0], y[0]), (x[1], y[1]), ..., (x[n-1], y[n-1])
    along with the optimal simple linear regression line.
    '''
    fig = px.scatter(
        x=x,
        y=y,
        size=np.ones(len(x)) * 50,
        size_max=8
    )

    fig.update_xaxes(
        title='Home Departure Time (AM)',
        gridcolor='#f0f0f0',
        showline=True,
        linecolor="black",
        linewidth=1,
    )
    fig.update_yaxes(
        title='Commute Time (Minutes)',
        gridcolor='#f0f0f0',
        showline=True,
        linecolor="black",
        linewidth=1,
    )
    fig.update_traces(marker_color="#3D81F6", marker_line_width=0)
    fig.update_layout(
        plot_bgcolor='white',
        paper_bgcolor='white',
        margin=dict(l=60, r=60, t=60, b=60),
        width=700,
        font=dict(
            family="Palatino Linotype, Palatino, serif",
            color="black"
        ),
        xaxis_range=[5.5, 11.5],
        yaxis_range=[45, 140]
    )

    w1_star, w0_star = optimal_slope(x, y), optimal_intercept(x, y)

    fig.add_traces(
        px.line(
            x=np.linspace(6, 11),
            y=w0_star + w1_star * np.linspace(6, 11)
        ).data
    )

    fig.update_traces(line_color='orange', line_width=4, selector=dict(mode='lines'))

    fig.show(renderer='notebook')
    
data_and_line(commutes_x, commutes_y)

As you work through Activity 5 on the worksheet, feel free to use the functions `optimal_slope`, `optimal_intercept`, and `data_and_line` to validate your guesses.

In [None]:
# If we double each x-coordinate, and add 100 to each y-coordinate, what happens to the slope? Why?
optimal_slope(2 * commutes_x, commutes_y + 100)

Finally, we'll comment that it's rare to implement such formulas by hand. Instead, a more common approach is to use a package like `sklearn`, which has implementations of lots of common machine learning models, including (but not limited to) linear regression.

All machine learning models in `sklearn` work using the same four basic steps.

1. Import the relevant class.

In [None]:
from sklearn.linear_model import LinearRegression

2. Instantiate an object from that class.

In [None]:
model = LinearRegression()

3. Fit the model. In other words, this is saying "go minimize mean squared error".<br><small>The <code>.reshape(-1, 1)</code> reformats <code>commutes_x</code> into a 2D array, which <code>sklearn</code> expects as the first argument to <code>fit</code>, since theoretically we could have multiple input variables.</small>

In [None]:
model.fit(commutes_x.reshape(-1, 1), commutes_y)

4. Make predictions, and optionally, look at the optimal parameters.

In [None]:
# Predicted commute time if we leave at 8:30AM.
model.predict([[8.5]])

In [None]:
model.intercept_, model.coef_[0]

In [None]:
# Same as us!
optimal_intercept(commutes_x, commutes_y), optimal_slope(commutes_x, commutes_y)

Notice that `sklearn` found the same optimal slope and optimal intercept as we did! That's because under the hood, `sklearn` is following the same three-step modeling recipe that we discussed in lecture. It also minimizes mean squared error. [The documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) tells us this!

> LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual **sum of squares** between the observed targets in the dataset, and the targets predicted by the linear approximation.

We'll use `sklearn` more heavily in a few weeks, when we return to multiple linear regression. For now, it's just worth pointing out that it follows the same math that we have been following.

## Finish Line 🏁

Congratulations! You're ready to submit the programming portion of Lab 2.

To submit your work to Gradescope:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all public tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download`, then upload your notebook to Gradescope under "Lab 2".
5. Stick around for a few minutes while the Gradescope autograder grades your work. Make sure you see that all **public tests** have passed on Gradescope.