In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("pre09.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2025</p>
</td>
</tr>


# Prelab 9: Correlation and Linear Regression

**Instructions**
- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- There are no hidden tests in prelabs.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Choose the right statistical inference technique (40 pts)



We have now seen three broad statistical inference problems: 
* **Hypothesis testing**: simulating samples from a null hypothesis and calculating a p-value.
* **Estimation**: estimating a population parameter from a sample via bootstrap resampling and the percentile method. 
* **Association**: quantifying a linear relationship between two variables.

Here is a breakdown of specific techniques used to approach these problems: 

---

* **(A)  Single category hypothesis test.** We simulate a sample from the value of the single category under the null hypothesis. The test statistic is the absolute difference between the observed and null proportions. 
* **(B) Multiple categories hypothesis test.** We simulate a sample from the null hypothesis by sampling from the null distribution. The test statistic is the total variation distance (TVD). 
* **(C) Numeric subgroup hypothesis test.** We have data for the full population, and we simulate from the null hypothesis by random sampling from the full population. The test statistic is the absolute difference between the mean in the observed sample and the mean in the full population. 
* **(D) Permutation hypothesis test.** We simulate from the null hypothesis by shuffling the label of two groups. The test statistic is the absolute difference in means between Group A and Group B (real or shuffled).  
<span>&nbsp;</span>
* **(E) Confidence interval from bootstrapping.** We estimate a confidence interval by bootstrap resampling and the percentile method.  
<span>&nbsp;</span>
* **(F) Correlation coefficients.** We measure of the strength of a linear relationship between two variables. 
* **(G) Linear Regression.** We compute the slope and intercept of the line that best fits a set of data points.

---

Assign the letter 'A' through 'G' to each of the following real-world scenarios. Each letter, 'A' through 'G', may be used more than once. 

#### Part 1.1 Zoos (5 pts)


A small zoo has five different types of animals. They are worried they are not allocating costs according to the popularity of the animals.  They have data on the proportion of costs spent on each of the five types of animals and the percentage of visitors for each of the five types of animals. They want to know: Is there a statistically significant difference between the distribution of costs and the distribution of visitors? 

In [None]:
zoos = ...

In [None]:
grader.check("p1.1")

#### Part 1.2 Braking Distance (5 pts)


A car company is testing a new model.  They measure the distance it takes for that car to come to a stop after applying the brakes when traveling at different speeds.  They determine that braking distance and speed have a Pearson correlation coefficient of 0.9.  They want to know: what is the braking distance when traveling at 57.5 miles per hour?

In [None]:
braking = ...

In [None]:
grader.check("p1.2")

#### Part 1.3 Voter Turnout (5 pts)


In a small town's local election, 60% of registered voters turned out to vote, while historically the voter turnout is around 50%. Local officials want to know: Is this year's voter turnout significantly different than what has been observed historically? 

In [None]:
voters = ...

In [None]:
grader.check("p1.3")

#### Part 1.4 Batteries (5 pts)


A manufacturing company produces batteries and wants to estimate the average lifespan of its batteries. They sample 100 batteries and find the average lifespan is on average 20 days of use. They wish to know: what is an estimate of the average lifespan for all of their batteries?

In [None]:
batteries = ...

In [None]:
grader.check("p1.4")

#### Part 1.5 Teachers (5 pts)


A large public school system has several dozen different instructors teaching high school geometry. They want to know: Does one instructor's teaching method result in significantly different student exam scores compared to the others? 

In [None]:
teachers = ...

In [None]:
grader.check("p1.5")

#### Part 1.6 Smartphones (5 pts)


An electronics company has set an acceptable defect rate of 5% for its smartphones. After quality control testing, they find that only 2% of smartphones have defects. They want to know: Is their manufacturing process significantly different than their acceptable defect rate? 

In [None]:
phones = ...

In [None]:
grader.check("p1.6")

#### Part 1.7 Office Locations (5 pts)


A data science company has offices in two locations: downtown urban offices and suburban offices. They measure employees productivity based on the number of Jupyter Notebook cells coded. They want to determine if there is a statistically significant difference between the mean productivity of employees in their two office locations. 

In [None]:
locations = ...

In [None]:
grader.check("p1.7")

#### Part 1.8 Reaction Time (5 pts)


A medical study on aging measures the reaction time in people who are asked to push a button when they hear a beep.  They conducted this study on 500 individuals ranging in age from 20 to 80.  They wish to know: how strongly is reaction time correlated with age?

In [None]:
reactions = ...

In [None]:
grader.check("p1.8")

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 2. How Faithful is Old Faithful? (50 pts)


Old Faithful is a geyser in Yellowstone National Park that is famous for eruption on a fairly regular schedule. Run the cell below to see Old Faithful in action!

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("wE8NDuzt8eg")

In [None]:
# For the curious: the cell above is how to display a YouTube video in a
# Jupyter notebook.  The argument to YouTubeVideo is the part
# of the URL (called a "query parameter") that identifies the
# video.  For example, the full URL for this video is:
#   https://www.youtube.com/watch?v=wE8NDuzt8eg

Some of Old Faithful's eruptions last longer than others.  Whenever there is a long eruption, it is usually followed by an even longer wait before the next eruption. If you visit Yellowstone, you might want to predict when the next eruption will happen, so that you can see the rest of the park instead of waiting by the geyser.
 
Today, we will use a dataset on eruption durations and waiting times to see if we can make such predictions accurately with linear regression.

The dataset has one row for each observed eruption.  It includes the following columns:
- `duration`: Eruption duration, in minutes
- `wait`: Time between this eruption and the next, also in minutes

Run the next cell to load the dataset.

In [None]:
faithful = Table.read_table("faithful.csv")
faithful

We would like to use linear regression to make predictions, but that won't work well if the data aren't roughly linearly related.  To check that, we should look at the data.

#### Part 2.1 Visualize the Data (5 pts)


 Make a scatter plot of the data to examine whether we can predict wait times from erruption durations.  It's conventional to put the column we want to predict -- wait time -- on the vertical axis and the other column on the horizontal axis.


In [None]:
plot = ...

In [None]:
grader.check("p2.1")

When you examine your scatter plot above, you should see that eruption duration and waiting time seem to have a roughly positive linear association.  Further, the eruption durations seem to cluster; there
are a bunch of short eruptions and a bunch of longer ones. We'll first explore **correlation** to determine whether it is reasonable to apply linear regression to the whole dataset.  

#### Part 2.2 Correlation Warm up (5 pts)


As a quick sanity check, (without running any code) which of the following values do you think will be closest to the correlation coefficient between eruption duration and waiting time in this dataset?

* -1
* 0
* 1

Assign `correlation` to the number corresponding to your guess.

In [None]:
correlation = ...

In [None]:
grader.check("p2.2")

#### Part 2.3 Correlation Coefficient (5 pts)


Now, use the [pearson_correlation()](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html) function from our library to calculate the correlation coefficient. 

In [None]:
r = ...
r

In [None]:
grader.check("p2.3")

Since `r` is high, we can continue with the assumption that they are linearly related and continue to perform linear regression. 

#### Part 2.4 Linear Regression (5 pts)


Let's first run the regression and then examine various aspects of our optimal choice of prediction line.  

Using functions in our [inference library](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html), set `slope` and `intercept` to be the slope and intercept of the line that minimizes mean squared error (the line of "best fit").  For this regression, remember that we have `'duration'` on the x-axis and `'wait'` on the y-axis. 

In [None]:
...
slope = ...
intercept = ...

# Examine your slope and intercept
print("Slope =", slope)
print("Intercept=", intercept)

In [None]:
grader.check("p2.4")

Run the following cell to plot your line.

In [None]:
plot_scatter_with_line(faithful, 'duration', 'wait', slope, intercept)

#### Part 2.5 Predicted Times for 2- and 5-minute Eruptions (5 pts)


The slope and intercept characterize the regression line.  

We often use linear regression for prediction. Recall, to predict the waiting time (y) for a given duration (x), multiply the eruption's duration by `slope` and then add `intercept`.

Using this, compute the predicted waiting time for an eruption that lasts 2 minutes, and for an eruption that lasts 5 minutes.

In [None]:
two_minute_predicted_waiting_time = ...
five_minute_predicted_waiting_time = ...

# Here is a helper function to print out your predictions.
# Don't modify the code below.
def print_prediction(duration, predicted_waiting_time):
    print("After an eruption lasting", np.round(duration, 2),
          "minutes, we predict you'll wait", np.round(predicted_waiting_time, 2),
          "minutes until the next eruption.\n")

print_prediction(2, two_minute_predicted_waiting_time)
print_prediction(5, five_minute_predicted_waiting_time)

In [None]:
grader.check("p2.5")

#### Part 2.6 Predictions for Unobserved durations (5 pts)


In the `faithful` dataset, we did not observe any eruption that lasted exactly 0, 2.5, or 60 minutes.  

By only **looking** at the regression line we plotted above (**and no code**), give your estimates for the predicted waiting time for an eruption that lasts 0 minutes, 2.5 minutes, and an hour. 

In [None]:
zero_minute_predicted_waiting_time = ...
two_point_five_minute_predicted_waiting_time = ...
hour_predicted_waiting_time = ...

print_prediction(0, zero_minute_predicted_waiting_time)
print_prediction(2.5, two_point_five_minute_predicted_waiting_time)
print_prediction(60, hour_predicted_waiting_time)

In [None]:
grader.check("p2.6")

#### Part 2.7 Prediction Accuracy (5 pts)


While we can predict the waiting time for any length eruption, predictions may not be meaningful if they are for data far away from any observed values.  Which of the following statements best captures whether our three predictions are reliable?  

In the cell below, assign the variable `predictions` to the integer matching the correct statement: 

1.  All three are reliable predictions.
2.  The prediction for 0 minutes and 2.5 minutes are reliable, since they are both close to observed values, whereas the prediction for a 60 minute eruption is not reliable since we observed no actual eruptions anywhere close to that length and it may have a very different character that shorter ones.
3.  The prediction for 2.5 minutes is reliable, since it is close to observed values, whereas the prediction for a 60 minute eruption is not reliable since we observed no actual eruptions anywhere close to that length and it may have a very different character that shorter ones.  Further, the prediction for a 0 minute eruption is meaningless since a 0 minute eruption is phyiscally impossible. 


In [None]:
predictions = ...

In [None]:
grader.check("p2.7")

#### Part 2.8 Predictions (5 pts)


To compute accuracy of our linear regression predictions, we will eventually compare our predicted wait times to the true wait times. 

As a first step, for each eruption in the `faithful` table, use the linear regression you just developed and the function`line_predictions` (from our inference library) to predict the waiting time.  Put these numbers into a column in a new table called `faithful_predictions`. The first row should look like this:

|duration|wait|predicted wait|
|-|-|-|
|3.6|79|72.1011|

In [None]:
predictions = ...
faithful_predictions = faithful.with_column("predicted wait", predictions)
faithful_predictions.show(5)

In [None]:
grader.check("p2.8")

#### Part 2.9 Residuals (5 pts)


 How close were we?  Compute the **residual** for each eruption in the dataset.  The residual is the actual waiting time minus the predicted waiting time.  
 
Add the residuals to `faithful_predictions` as a new column called `residual` and name the resulting table `faithful_residuals`.

In [None]:
...
faithful_residuals = ...
faithful_residuals

In [None]:
grader.check("p2.9")

#### Part 2.10 Plotting the Residuals (5 pts)


Here is a plot of the residuals you computed.  Each point corresponds to one eruption.  It shows how much our prediction over- or under-estimated the waiting time.

In [None]:
plot = faithful_residuals.scatter("duration", "residual", color="r")

plot.line(y=0, color='darkblue', lw=4)

The residual plot of a good regression shows no pattern. That is, the residuals look about the same, above and below the horizontal line at 0, across the range of the predictor variable.  Does this plot give you confidence in your regression's output?  Set `good_regression` to True or False to indicate your answer.

In [None]:
good_regression = ...

In [None]:
grader.check("p2.10")

Earlier, you should have found that the correlation is fairly close to 1, so the line fits fairly well on the training data.  That means the residuals are small (close to 0) relative to the waiting times. We can see that visually by plotting the waiting times and residuals together:

In [None]:
with Figure(1,2):
    faithful_residuals.scatter("duration", "wait", title="actual waiting time", color="blue", ylim=(-20,100))
    faithful_residuals.scatter("duration", "residual", title="residuals", color="red", ylim=(-20,100))

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Prelab N, the assignment will be called "Prelab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)