In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw10.ipynb")

# Homework 10: Linear Regression

**Recommended Readings**: 

* [The Regression Line](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)
* [Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)
* [Least Squares Regression](https://www.inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**

Directly sharing answers is not okay, but discussing problems is encouraged.

In [None]:
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from datetime import datetime

## Triple Jump Distances vs. Vertical Jump Heights 

Does skill in one sport imply skill in a related sport?  The answer might be different for different activities. Let's find out whether it's true for the [triple jump](https://en.wikipedia.org/wiki/Triple_jump) (a horizontal jump similar to a long jump) and the [vertical jump](https://en.wikipedia.org/wiki/Vertical_jump).  Since we're learning about linear regression, we will look specifically for a *linear* association between skill level in the two sports.

The following data was collected by observing 40 collegiate-level soccer players. Each athlete's distances in both events were measured in centimeters. Run the cell below to load the data.

In [None]:
jumps = Table.read_table('triple_vertical.csv')
jumps

### Task 01 📍

Create a function `standard_units` that converts the values in the array `data` to standard units.

_Points:_ 4

In [None]:
def standard_units(data):
    ...

In [None]:
grader.check("task_01")

### Task 02 📍

Now, using the `standard_units` function, define the function `correlation` which computes the correlation between `x` and `y`.

_Points:_ 2

In [None]:
def correlation(x, y):
    ...

In [None]:
grader.check("task_02")

### Task 03 📍🔎

<!-- BEGIN QUESTION -->

Before running a regression, it's important to see what the data looks like, because our eyes are good at picking out unusual patterns in data.  Draw a scatter plot, **that includes the regression line**, with the triple jump distances on the horizontal axis and the vertical jump heights on vertical axis.

See [the documentation on `scatter`](http://data8.org/datascience/_autosummary/datascience.tables.Table.scatter.html#datascience.tables.Table.scatter) for instructions on how to have Python draw the regression line automatically.

*Hint:* The `fit_line` argument may be useful here!

_Points:_ 2

In [None]:
...

<!-- END QUESTION -->

### Task 04 📍🔎

<!-- BEGIN QUESTION -->

1. Does the correlation coefficient $r$ look closest to 0, .5, or -.5? 
2. Provide a brief explanation of your choice.

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Task 05 📍

Create a function called `parameter_estimates` that takes in the argument `tbl`, a two-column table where the first column is the x-axis and the second column is the y-axis. It should return an array with three elements: 
1. the **correlation coefficient** of the two columns
2. the **slope**
3. the **intercept** of the regression line that predicts the second column from the first, in original units.

*Hint:* This is a rare occasion where it's better to implement the function using column indices instead of column names, in order to be able to call this function on any table. If you need a reminder about how to use column indices to pull out individual columns, please refer to [the Tables section of the textbook](https://www.inferentialthinking.com/chapters/06/Tables.html#accessing-the-data-in-a-column).

_Points:_ 4

In [None]:
def parameter_estimates(tbl):
    ...
    return make_array(r, slope, intercept)

parameters = parameter_estimates(jumps) 
print('r:', parameters.item(0), '; slope:', parameters.item(1), '; intercept:', parameters.item(2))

In [None]:
grader.check("task_05")

### Task 06 📍

Now, suppose you want to go the other way and predict a triple jump distance given a vertical jump distance. What would the regression parameters of this linear model be? How do they compare to the regression parameters from the model where you were predicting vertical jump distance given a triple jump distance (in Task 05)?

Set `regression_changes` to an array of 3 elements, with each element corresponding to whether or not the corresponding item returned by `parameter_estimates` changes when switching vertical and triple as $x$ and $y$. For example, if $r$ changes, the slope changes, but the intercept wouldn't change, the `regression_changes` would be assigned to `make_array(True, True, False)`.

_Points:_ 3

In [None]:
regression_changes = ...
regression_changes

In [None]:
grader.check("task_06")

### Task 07 📍

Let's use `parameters` (from Task 05) to predict what certain athletes' vertical jump heights would be given their triple jump distances.

The world record for the triple jump distance is 18.29 *meters* by Johnathan Edwards. What is the prediction for Edwards' vertical jump using this line?

*Hint:* Make sure to convert from meters to centimeters!

_Points:_ 2

In [None]:
triple_record_vert_est = ...
print("Predicted vertical jump distance: {:f} centimeters".format(triple_record_vert_est))

In [None]:
grader.check("task_07")

### Task 08 📍🔎

<!-- BEGIN QUESTION -->

1. Do you think it makes sense to use this line to predict Edwards' vertical jump?
2. Justify your response.

*Hint:* Compare Edwards' triple jump distance to the triple jump distances in `jumps`. Is it relatively similar to the rest of the data (shown in Task 03)? 

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Cryptocurrencies

Imagine you're an investor in December 2017. Cryptocurrencies, online currencies backed by secure software, are becoming extremely valuable, and you want in on the action!

The two most valuable cryptocurrencies are Bitcoin (BTC) and Ethereum (ETH). Each one has a dollar price attached to it at any given moment in time. For example, on December 1st, 2017, one BTC costs $\$10,859.56$ and one ETH costs $\$424.64.$

For fun, here are the current prices of [Bitcoin](https://www.coinbase.com/price/bitcoin) and [Ethereum](https://www.coinbase.com/price/ethereum)!

**You want to predict the price of ETH at some point in time based on the price of BTC.** Below, we load two [tables](https://www.kaggle.com/jessevent/all-crypto-currencies/data) called `btc` and `eth`. Each has 5 columns:
* `date`, the date
* `open`, the value of the currency at the beginning of the day
* `close`, the value of the currency at the end of the day
* `market`, the market cap or total dollar value invested in the currency
* `day`, the number of days since the start of our data

In [None]:
btc = Table.read_table('btc.csv')
btc.show(5)

In [None]:
eth = Table.read_table('eth.csv')
eth.show(5)

### Task 09 📍🔎

<!-- BEGIN QUESTION -->

In the cell below, create an overlaid line plot that visualizes the BTC and ETH open prices as a function of the day. Both BTC and ETH open prices should be plotted on the same graph.

*Hint*: [Section 7.3](https://inferentialthinking.com/chapters/07/3/Overlaid_Graphs.html#overlaid-line-plots) in the textbook might be helpful!

_Points:_ 2

In [None]:
# Create a line plot of btc and eth open prices as a function of time
...

<!-- END QUESTION -->

### Task 10 📍

Now, calculate the correlation coefficient between the opening prices of BTC and ETH using the `correlation` function you defined earlier.

_Points:_ 3

In [None]:
r = ...
r

In [None]:
grader.check("task_10")

### Task 11 📍

Write a function `eth_predictor` which takes an opening BTC price and predicts the opening price of ETH. Again, it will be helpful to use the function `parameter_estimates` that you defined earlier in this homework.

*Hint*: Double-check what the `tbl` input to `parameter_estimates` must look like!

*Note:* Make sure that your `eth_predictor` is using least squares linear regression.

_Points:_ 2

In [None]:
def eth_predictor(btc_price):
    parameters = ...
    slope = ...
    intercept = ...
    ...

In [None]:
grader.check("task_11")

### Task 12 📍🔎

<!-- BEGIN QUESTION -->

Now, using the `eth_predictor` function you just defined, make a scatter plot with BTC prices along the x-axis and both real and predicted ETH prices along the y-axis. The color of the dots for the real ETH prices should be different from the color for the predicted ETH prices.

*Hint 1:* An example of such a scatter plot is generated can be found [in the Regression Line section of the textbook](https://inferentialthinking.com/chapters/15/2/Regression_Line.html).

*Hint 2:* Think about the table that must be produced and used to generate this scatter plot. What data should the columns represent? Based on the data that you need, how many columns should be present in this table? Also, what should each row represent? Constructing the table will be the main part of this question; once you have this table, generating the scatter plot should be straightforward as usual.

_Points:_ 2

In [None]:
btc_open = ...
eth_pred = ...
eth_pred_actual = ...
...

<!-- END QUESTION -->

### Task 13 📍🔎

<!-- BEGIN QUESTION -->

Considering the shape of the scatter plot of the true data, is the model we used reasonable?
* If so, what features or characteristics make this model reasonable? 
* If not, what features or characteristics make it unreasonable?

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Evaluating NBA Game Predictions

### A Brief Introduction to Sports Betting

In a basketball game, each team scores some number of points.  Conventionally, the team playing at its own arena is called the "home team", and their opponent is called the "away team".  The winner is the team with more points at the end of the game.

We can summarize what happened in a game by the "**outcome**", defined as the **the away team's score minus the home team's score**:

$$\text{outcome} = \text{points scored by the away team} - \text{points scored by the home team}$$

If this number is positive, the away team won.  If it's negative, the home team won. 

In order to facilitate betting on games, analysts at casinos try to predict the outcome of the game. This prediction of the outcome is called the **spread.**


In [None]:
spreads = Table.read_table("spreads.csv")
spreads

Here's a scatter plot of the outcomes and spreads, with the spreads on the horizontal axis.

In [None]:
spreads.scatter("Spread", "Outcome")

From the scatter plot, you can see that the spread and outcome are almost never 0, aside from one case of the spread being 0. This is because a game of basketball never ends in a tie. One team has to win, so the outcome can never be 0. The spread is almost never 0 because it's chosen to estimate the outcome.

Let's investigate how well the casinos are predicting game outcomes.

One question we can ask is: Is the casino's prediction correct on average? In other words, for every value of the spread, is the average outcome of games assigned that spread equal to the spread? If not, the casino would apparently be making a systematic error in its predictions.

### Task 14 📍

Compute the correlation coefficient between outcomes and spreads.

*Note:* It might be helpful to use the `correlation` function.

_Points:_ 2

In [None]:
spread_r = ...
spread_r

In [None]:
grader.check("task_14")

### Task 15 📍

Among games with a spread between 3.5 and 6.5 (including both 3.5 and 6.5), what was the average outcome?

_Points:_ 2

In [None]:
spreads_around_5 = ...
spread_5_outcome_average = ...
print("Average outcome for spreads around 5:", spread_5_outcome_average)

In [None]:
grader.check("task_15")

### Task 16 📍

Use the function `parameter_estimates` that you defined earlier to compute the least-squares linear regression line that predicts outcomes from spreads, in original units. 

We have provided a two column table for you in the cell below with the first column representing `Spread` (x) and the second column representing `Outcome` (y), which you should use as an argument to the function.

_Points:_ 4

In [None]:
compute_tbl = spreads.select('Spread', 'Outcome')
estimates = ...
spread_slope = ...
spread_intercept = ...
print("Slope:", round(spread_slope, 3))
print("Intercept", round(spread_intercept, 3))

In [None]:
grader.check("task_16")

### Task 17 📍🔎

<!-- BEGIN QUESTION -->

Suppose that we create another model that simply predicts the average outcome regardless of the value for spread. Does this new model minimize the least squared error? Why or why not?

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Fitting a Least-Squares Regression Line

Recall that the least-squares regression line is the unique straight line that minimizes root mean squared error (RMSE) among all possible fit lines. Using this property, we can find the equation of the regression line by finding the pair of slope and intercept values that minimize root mean squared error. 

### Task 18 📍

Define a function called `errors`.  

* It should take three arguments:
    1. a table `tbl` like `spreads` (with the same column names and meanings, but not necessarily the same data)
    2. the `slope` of a line (a number)
    3. the `intercept` of a line (a number).
* It should **return an array of the errors** made when a line with that slope and intercept is used to predict outcome from spread for each game in the given table.

*Note*: Make sure you are returning an array of the errors, and not the RMSE. 

_Points:_ 3

In [None]:
def errors(tbl, slope, intercept):
    ...

In [None]:
grader.check("task_18")

### Task 19 📍🔎

1. Using `errors`, compute the errors for the line with slope `0.5` and intercept `25` on the `spreads` dataset. Name that array `outcome_errors`.  
2. Then, make a scatter plot of the errors.

*Hint:* To make a scatter plot of the errors, plot the error for each outcome in the dataset.  Put the actual spread on the horizontal axis and the outcome error on the vertical axis.

_Points:_ 1

In [None]:
outcome_errors = ...
...

In [None]:
grader.check("task_19")

You should find that the errors are almost all negative.  That means our line is not the best fit to our data.  Let's find a better one.

### Task 20 📍

Define a function called `fit_line`.  It should take a table like `spreads` (with the same column names and meanings) as its argument.  It should return an array containing the slope (as the first element) and intercept (as the second element) of the least-squares regression line predicting outcome from spread for that table.

*Hint*: Define a function `rmse` within `fit_line` that takes a slope and intercept as its arguments. `rmse` will use the table passed into `fit_line` to compute predicted outcomes and then return the root mean squared error between the predicted and actual outcomes. Within `fit_line`, you can call `rmse` the way you would any other function.

If you haven't tried to use the `minimize` function yet, now is a great time to practice. Check out an [example from the textbook using the minimize function.](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html#numerical-optimization).

_Points:_ 3

In [None]:
def fit_line(tbl):
    # Your code may need more than 1 line below here.
    def rmse(..., ...):
        return ... 
    return ... 
    
# Here is an example call to your function.  To test your function,
# figure out the right slope and intercept by hand.
example_table = Table().with_columns(
    "Spread", make_array(0, 1),
    "Outcome", make_array(1, 3))
fit_line(example_table)

In [None]:
grader.check("task_20")

### Task 21 📍

Use `fit_line` to fit a line to `spreads`, and assign the output to `best_line`. Assign the first and second elements in `best_line` to `best_line_slope` and `best_line_intercept`, respectively.

Then, set `new_errors` to the array of errors that we get by calling `errors` with our new line. The provided code will graph the corresponding residual plot with a best fit line.

*Hint:* Make sure that the residual plot makes sense. What qualities should the best fit line of a residual plot have?

_Points:_ 4

In [None]:
best_line = ...
best_line_slope = ...
best_line_intercept = ...

new_errors = ...

# This code displays the residual plot, given your values for the best_line_slope and best_line_intercept
Table().with_columns("Spread", 
                    spreads.column("Spread"), 
                    "Outcome errors", 
                    new_errors
                   ).scatter("Spread", "Outcome errors", fit_line=True)

# This just prints your slope and intercept
"Slope: {:g} | Intercept: {:g}".format(best_line_slope, best_line_intercept)

In [None]:
grader.check("task_21")

### Task 22 📍🔎

<!-- BEGIN QUESTION -->

The slope and intercept pair you found in Task 21 should be very similar to the values that you found in Task 16. Why were we able to minimize RMSE to find the same slope and intercept from the previous formulas?

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Submit your Homework to Canvas

Once you have finished working on the homework tasks, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the rubric to know how you will be scored for this assignment.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `"grader.check_all()"`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
3. Select the menu item "File" and "Save Notebook" in the notebook's Toolbar to save your work and create a specific checkpoint in the notebook's work history.
4. Select the menu items "File", "Download" in the notebook's Toolbar to download the notebook (.ipynb) file. 
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded .ipynb file.

**Keep in mind that the autograder does not always check for correctness. Sometimes it just checks for the format of your answer, so passing the autograder for a question does not mean you got the answer correct for that question.**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()