In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab09.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2025</p>
</td>
</tr>


# Lab 9: Linear Regression

<hr style="margin: 0px; border: 3px solid #500082;"/>

<h2>Instructions</h2>

- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. For problems asking you to write explanations, you **must** provide your answer in the designated space. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- This lab has hidden tests on it. That means even though tests may say 100% passed, doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the lab.
- To use one or more late days on this lab, please fill out our [late day form](https://forms.gle/4sD16h3hN1xRqQM27) **before** the due date.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Triple Jump Distances vs. Vertical Jump Heights (40 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Build intuition for values of Pearson correlation coefficient and calculate this value on real data.
- Fit a linear regression line to data and diagnose it.
- Build intuition for using linear regression for prediction. 
</font>

Does skill in one sport imply skill in a related sport?  The answer might be different for different activities. Let's find out whether it's true for the [triple jump](https://en.wikipedia.org/wiki/Triple_jump) (a horizontal jump similar to a long jump) and the [vertical jump](https://en.wikipedia.org/wiki/Vertical_jump).  Since we're learning about linear regression, we will look specifically for a *linear* association between skill level in the two sports.

The following data was collected by observing 40 collegiate-level soccer players. Each athlete's distances in both events were **measured in centimeters**. Run the cell below to load the data.

In [None]:
# Run this cell to load the data
jumps = Table.read_table('triple_vertical.csv')
jumps

Here is a scatter plot of the data.

In [None]:
jumps.scatter("triple")

#### Part 1.1 Examine the Data (5 pts)


Before computing the actual value, do you predict the correlation coefficient $r$ between `triple` and `vertical` to be closest to 0, 0.5, or -0.5?  Assign the variable `predicted_r` to your answer.

In [None]:
predicted_r = ...

In [None]:
grader.check("p1.1")

#### Part 1.2 Correlation Coefficient (5 pts)


Examine the documentation for the `pearson_correlation` function in our [inference library](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html) and use it to compute the correlation between `triple` and `vertical`.  

In [None]:
jumps_r = ...
jumps_r

In [None]:
grader.check("p1.2")

#### Part 1.3 Regression Line (5 pts)


Examine the documentation for the `linear_regression` function in our [inference library](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html) and use it to compute the  slope `jumps_a` and intercept `jumps_b` of the best fitting line for this data.

In [None]:
...
jumps_a = ...
jumps_b = ...
print('jumps_a = ', jumps_a, '; jumps_b = ', jumps_b)

In [None]:
grader.check("p1.3")

<!-- BEGIN QUESTION -->

#### Part 1.4 Plot Regression Line (5 pts)


Plot your regression line on top of the data with `plot_scatter_with_line` from our [inference library](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html).

In [None]:
...

<!-- END QUESTION -->

#### Part 1.5 Diagnostics: R^2 Score (5 pts)


We now wish to see whether this line is a good fit for the data using several diagnostics.  We'll first compute the $R^2$ score. Let $y$ be the observed data, $\bar{y}$ be the mean across all of the observed $y$, and $\hat{y}$ be the predictions given by the linear regression line we fit. Then 

$$R^2 = 1 - \frac{\sum_i{(y_i-\hat{y})^2}}{\sum_i{(y_i- \bar{y})^2}}$$

Implement this equation in the function below. 

In [None]:
def r2_score(table, x_label, y_label, a, b):
    """
    R-squared score (also called the "coefficient of determination")
    for the predictions given y = ax + b
    """ 
    y = table.column(y_label)
    y_mean = np.mean(y)

    y_hat = line_predictions(a, b, table.column(x_label)) 
    residual = y - y_hat # Recall, the residual is y - y_hat
    
    numerator = ...
    denominator = ...
    ...

In [None]:
# Here's a simple table to test your function.
# Do not modify this cell!

example = Table().with_columns('x', make_array(1,2,3),
                               'y', make_array(1,2,3))

# perfect prediction, so R^2 score should be 1.
check(r2_score(example, 'x', 'y', 1, 0) == 1)

# prediction same as mean, so R^2 score should be 0.
check(r2_score(example, 'x', 'y', 0, 2) == 0)

# Better than the mean, but not the best.
check(r2_score(example, 'x', 'y', 0.75, 0) == 0.5625)

In [None]:
# Run the following cell to check your implementation of the R^2 score.  It should be about 0.69559.
jumps_r2_score = r2_score(jumps, 'triple', 'vertical', jumps_a, jumps_b)
jumps_r2_score

In [None]:
grader.check("p1.5")

In general, and assuming that there is in fact a true linear relationship between the variables in question, an $R^2$ score close to 1 (roughly 0.7 and above) indicates that the regression line has good predictive power for new data that would fall within the range of values we've already seen.  Values much less than 0.7 indicate that, while a correlation still exists, there is sufficient variance in the data that any individual prediction may not be very accurate.

<!-- BEGIN QUESTION -->

#### Part 1.6 Diagnostics: Residuals (5 pts)


As a second diagnostic test, we'll plot the residuals for your regression line using `plot_residuals` from our [inference library](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html). 

In [None]:
# Don't change anything in the line below. Just execute the cell. 
plot_residuals(jumps, 'triple', 'vertical', jumps_a, jumps_b)

Describe the residual plot.  Are there any patterns that would lead you to believe the regression line is not a good fit?

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

#### Part 1.7 Predition (5 pts)


The world record for the triple jump distance is 18.29 *meters* by Johnathan Edwards. What is the prediction for Edwards' vertical jump using your regression line?

*Hint:* Make sure to convert units (e.g. inches, centimeters, meters)!

In [None]:
triple_record_vert_est = ...
print("Predicted vertical jump distance: {:f} centimeters".format(triple_record_vert_est))

In [None]:
grader.check("p1.7")

<!-- BEGIN QUESTION -->

#### Part 1.8 Prediction Accuracy (5 pts)


Do you think it makes sense to use this line to predict Edwards' vertical jump? 

*Hint:* Compare Edwards' triple jump distance to the range of triple jump distances in `jumps`. Is it relatively similar to the rest of the data? 

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 2. Cryptocurrencies (30 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Fit a linear regression line to data and diagnose it.
- Write functions to use linear regression for prediction. 
</font>

In this question, we'll examine the relationship between the prices of two of the most valuable cryptocurrencies, Bitcoin (BTC) and Ethereum (ETH). Each one has a dollar price attached to it at any given moment in time. For example, on December 1st, 2017, one BTC cost $\$10,859.56$ and one ETH cost $\$424.64.$

For fun, you can here are the current prices of [Bitcoin](https://www.coinbase.com/price/bitcoin) and [Ethereum](https://www.coinbase.com/price/ethereum)!

**You want to predict the price of ETH at some point in time based on the price of BTC.** Below, we load two [tables](https://www.kaggle.com/jessevent/all-crypto-currencies/data) called `btc` and `eth`. Each has 5 columns:
* `Date`, the date
* `Day`, the number of days since Bitcoin started beind sold as a financial product.
* `Open`, the value of the currency at the beginning of the day
* `Close`, the value of the currency at the end of the day
* `Adj Close`, the adjusted value of the currency at the end of the day
* `Volume`, num of units traded that day

In [None]:
btc = Table.read_table('BTC-USD.csv')
btc

In [None]:
eth = Table.read_table('ETH-USD.csv')
eth

<!-- BEGIN QUESTION -->

#### Part 2.1 Examine the Data (5 pts)


In the cell below, create a line plot that visualizes the BTC and ETH open prices as a function of time. Both BTC and ETH open prices should be plotted on the same graph. 

*Hint:* The table function `.join()` may be helpful here.

In [None]:
# Create a line plot of btc and eth open prices as a function of time
opens = ...

opens.plot("Day")

<!-- END QUESTION -->

#### Part 2.2 Correlation Coefficient (5 pts)


 Now, calculate the correlation coefficient between the opening prices of BTC and ETH. 

In [None]:
crypto_r = ...
crypto_r

In [None]:
grader.check("p2.2")

#### Part 2.3 Regression Line (5 pts)


Compute the slope `crypto_a` and intercept `crypto_b` for the best fitting line for our data.

In [None]:
...
crypto_a = ...
crypto_b = ...

In [None]:
grader.check("p2.3")

#### Part 2.4 Prediction (5 pts)


Write a function `eth_predictor` which takes an opening BTC price and predicts the opening price of ETH.


In [None]:
def eth_predictor(btc_price):
    ...

In [None]:
# Don't change this cell. 
# Does the output make sense? 
eth_predictor(20000)

In [None]:
grader.check("p2.4")

<!-- BEGIN QUESTION -->

#### Part 2.5 Residuals (5 pts)


Use the `plot_regression_and_residuals` function from [inference library](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html) to visualize our regression line and the residuals for it.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 2.6 Does the Linear Model Fit the Data? (5 pts)


 Considering the shape of the scatter plot of the true data, is the linear regression model we used reasonable? If so, what features or characteristics make this model reasonable? If not, what features or characteristics make it unreasonable? 



<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 3. Inference for Prediction (20 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Create confidence intervals for predictions.
- Examine and interpet uncertainty in estimated predictions. 
</font>

One of the primary uses of regression is to make predictions for individuals who were not in our original dataset but who are similar enough to be considered part of the same population. 
For example, in Question 1, we may wish to predict the vertical jump of a new athlete based on their triple jump distance.

Assuming that there is a true linear correlation between the two types of jumps for the whole population, we would use that true line to predict the vertical jump height for a specific triple jump distance. 


Since we don't know the true line, we instead fit a regression line to our sample of jumps.  However, we know that the sample might have been different, which would mean the regression line would have been different too, as would our prediction.  As in other estimation and prediction problems, we should characterize our uncertainty in the sampling process by reporting a prediction as a confidence interval.

<!-- BEGIN QUESTION -->

#### Part 3.1 Prediction from Resamples (5 pts)


Let's first examine the variability in predicting the vertical jump for someone with a 600 cm triple jump. To do so, we'll use boostrapping to generate new samples and look at the variability of different samples.

Complete the code below to plot the regression lines for four bootstrap resamples of the `jumps` table.  Recall that you can create a resample of `jumps` via `jumps.sample()`.

Run this cell several times.

In [None]:
import matplotlib.pyplot as plots

for i in np.arange(0,4):
    # Create a resample of the jumps sample.
    resample = ...
    
    # Compute the slope and intercept for the linear regression line
    # for the resample.
    a,b = ...
    
    # Do not change the lines below. This will plot all four lines from the 
    # four bootstrap resamples on the same plot.
    xlims = make_array(550,650)
    plots.plot(xlims, a * xlims + b, lw=4)

<!-- END QUESTION -->

#### Part 3.2 Boostrapping a Confidence Interval (5 pts)


Let's look at a specific `x` value (for example, 600 in the plot above). We'll call this the `x_target_value`. Our goal is to estimate a confidence interval for the predicted `y`s for this `x_target_value`. 

To do so, we'll use our favorite bootstrapping idiom, with the following steps within the simulation loop:
- Obtain a bootstrap resample of the `observed_sample`
- Fit a linear regression line to that new sample 
- Predict $\hat{y}$ given the fitted linear regression line and the `x_target_value`
- Append this predicted $\hat{y}$ to the `bootstrap_statistics` array. 

Complete the following steps within the function below. 

In [None]:
def bootstrap_prediction(x_target_value, observed_sample, x_label, y_label, num_trials):
    """
    Create boostrap resamples from the given sample (which is represented as a table)
    and predict y given x_target_value. 
    """
    bootstrap_statistics = make_array()
    
    for i in np.arange(0, num_trials): 
        simulated_resample = observed_sample.sample()
        
        a,b = linear_regression(simulated_resample, x_label, y_label)
        
        y_hat = ...
        
        bootstrap_statistics = ...
    
    return bootstrap_statistics

# Do not change the lines below
# This is code to test your function. Do these numbers look reasonable?
three_predictions = bootstrap_prediction(600, jumps, 'triple', 'vertical', 3)
three_predictions

In [None]:
grader.check("p3.2")

<!-- BEGIN QUESTION -->

#### Part 3.3 Confidence Intervals for Predicted Vertical Jump (5 pts)


The following function uses your `bootstrap_prediction` function to create a 95% CI for the vertical jump corresponding to given triple jump target value.
We call that function to predict the vertical jump for a 600 cm triple jump.  We show the confidence interval on a scatter plot that also shows the original sample.  The red dot corresponds to the prediction for the original sample.

In [None]:
# We have provided the function for you. You do not need to change anything. 
def predict_vertical_jump(triple_target_value):
    bootstrap_statistics = bootstrap_prediction(triple_target_value, jumps, 'triple', 'vertical', 1000)

    results = Table().with_columns("Bootstrap distribution for y_hat given x="+str(triple_target_value), bootstrap_statistics)
    plot = jumps.scatter('triple', 'vertical', fit_line=True)
    left_right = confidence_interval(95, bootstrap_statistics)
    plot.set_title('y_hat 95% Confidence Interval: ' + str(np.round(left_right,2)))
    plot.y_interval(left_right, x=triple_target_value)
    plot.dot(triple_target_value, line_predictions(jumps_a, jumps_b, triple_target_value))

In [None]:
# This might take a minute or two to run. 
predict_vertical_jump(600)

Precisely state your vertical jump prediction confidence interval based on the bootstrapped results above. 

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 3.4 Predictions For different Target Values (5 pts)


Run the following cell to repeat the bootstrap process for three different triple jump distances: 400, 600, and 800 cm.

In [None]:
# This might take a minute or two to run. 
with Figure(1,3):
    predict_vertical_jump(400)
    predict_vertical_jump(600)
    predict_vertical_jump(800)

Examine these three plots.  Do the confidence intervals seem consistent with the observed sample we do have?  Do they have the same width?  Why are they the same, or why do they differ?

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Lab N, the assignment will be called "Lab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


**Note:** It may take a couple minutes to run all of the tests before creating the zip file.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)