# Homework 9: Linear Regression

**Helpful Resource:**

- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Readings**: 

* [The Regression Line](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)
* [Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)
* [Least Squares Regression](https://www.inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)

**Instructions:**

  - Please complete this notebook by filling in the cells provided. 
  - For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. 
  - Directly sharing answers is not okay, but discussing problems with your instructor or with other students is encouraged. 
  - You should start early so that you have time to get help if you're stuck.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from datetime import datetime

## 1. Triple Jump Distances vs. Vertical Jump Heights 

Does skill in one sport imply skill in a related sport?  The answer might be different for different activities. Let's find out whether it's true for the [triple jump](https://en.wikipedia.org/wiki/Triple_jump) (a horizontal jump similar to a long jump) and the [vertical jump](https://en.wikipedia.org/wiki/Vertical_jump).  Since we're learning about linear regression, we will look specifically for a *linear* association between skill level in the two sports.

The following data was collected by observing 40 collegiate-level soccer players. Each athlete's distances in both events were measured in centimeters. Run the cell below to load the data.

In [None]:
# Run this cell to load the data
jumps = Table.read_table('triple_vertical.csv')
jumps

**Question 1.1.** Create a function `standard_units` that converts the values in the array `data` to standard units. **(5 points)**


In [None]:
def standard_units(data):
    ''' data is an array; returns a new array, data in standard units'''
    ...

In [None]:
# A quick test: you should get this result:
# array([-0.98058068, -0.39223227,  1.37281295])
data = make_array(1, 2, 5)
standard_units(data)

**Question 1.2.** Now, using the `standard_units` function, define the function `correlation` which computes the correlation between `x` and `y` (two arrays of numbers, with the same length). **(5 points)**


In [None]:
def correlation(x, y):
    ''' returns the mean product of standard values '''
    ...

In [None]:
# Check 1: A perfect positive correlation of 1
x = make_array(1, 2, 3, 4)
y = make_array(2, 4, 6, 8)
correlation(x, y)  # should be 1.0

In [None]:
# Check 2: 
x = make_array(1, 2, 4, 8)
y = make_array(3, 0, 12, 10)
correlation(x, y)  # should be about 0.7

<!-- BEGIN QUESTION -->

**Question 1.3.** Before running a regression, it's important to see what the data looks like, because our eyes are good at picking out unusual patterns in data.  Use the `jumps` table to draw a scatter plot, **that includes the regression line**, with the triple jump distances on the horizontal axis and the vertical jump heights on vertical axis. **(5 points)**

See the documentation on `scatter` [here](http://data8.org/datascience/_autosummary/datascience.tables.Table.scatter.html#datascience.tables.Table.scatter) for instructions on how to have Python draw the regression line automatically.

*Hint:* The `fit_line` argument may be useful here!


In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.4.** Does the correlation coefficient $r$ look closest to 0, .5, or -.5? Explain. **(5 points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.5.** Create a function called `parameter_estimates` that takes in the argument `tbl`, a two-column table where the first column is the x-axis and the second column is the y-axis. It should return an array with three elements: the **(1) correlation coefficient r** of the two columns and the **(2) slope** and **(3) intercept** of the regression line that predicts the second column from the first, in original units. **(5 points)**

*Hint:* This is a rare occasion where it’s better to implement the function using column indices instead of column names, in order to be able to call this function on any table. If you need a reminder about how to use column indices to pull out individual columns, please refer to [this](https://www.inferentialthinking.com/chapters/06/Tables.html#accessing-the-data-in-a-column) section of the textbook.


In [None]:
def parameter_estimates(tbl):
    "returns an array of the correlation (r), slope (m), and intercept (b)"
    r = ...
    m = ...
    b = ...
    return make_array(r, m, b)

parameters = parameter_estimates(jumps) 
print('r:', parameters.item(0), '; slope:', parameters.item(1), '; intercept:', parameters.item(2))

**Question 1.6.** Let's use `parameters` (from Question 1.5) to predict what certain athletes' vertical jump heights would be given their triple jump distances. **(5 points)**

The world record for the triple jump distance is 18.29 *meters* by Johnathan Edwards. What is the prediction for Edwards' vertical jump using this line?

*Hint:* Make sure to convert from meters to centimeters!


In [None]:
# Hint: prediction = m * x + b
...
triple_record_vert_est = ...
print("Predicted vertical jump distance: {:f} centimeters".format(triple_record_vert_est))

<!-- BEGIN QUESTION -->

**Question 1.7.** Do you think it makes sense to use this line to predict Edwards' vertical jump? **(5 points)**

*Hint:* Compare Edwards' triple jump distance to the triple jump distances in `jumps`. Is it relatively similar to the rest of the data (shown in Question 1.3)? 


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Cryptocurrencies

Imagine you're an investor in December 2017. Cryptocurrencies, online currencies backed by secure software, are becoming extremely valuable, and you want in on the action!

The two most valuable cryptocurrencies are Bitcoin (BTC) and Ethereum (ETH). Each one has a dollar price attached to it at any given moment in time. For example, on December 1st, 2017, one BTC costs $\$10,859.56$ and one ETH costs $\$424.64.$

For fun, here are the current prices of [Bitcoin](https://www.coinbase.com/price/bitcoin) and [Ethereum](https://www.coinbase.com/price/ethereum)!

**You want to predict the price of ETH at some point in time based on the price of BTC.** Below, we load two [tables](https://www.kaggle.com/jessevent/all-crypto-currencies/data) called `btc` and `eth`. Each has 5 columns:
* `date`, the date
* `open`, the value of the currency at the beginning of the day
* `close`, the value of the currency at the end of the day
* `market`, the market cap or total dollar value invested in the currency
* `day`, the number of days since the start of our data

In [None]:
btc = Table.read_table('btc.csv')
btc.show(5)

In [None]:
eth = Table.read_table('eth.csv')
eth.show(5)

<!-- BEGIN QUESTION -->

**Question 2.1.** In the cell below, create an overlaid line plot that visualizes the BTC and ETH open prices as a function of the day. Both BTC and ETH open prices should be plotted on the same graph. **(5 points)**

*Hint*: [Section 7.3](https://inferentialthinking.com/chapters/07/3/Overlaid_Graphs.html#overlaid-line-plots) in the textbook might be helpful!


In [None]:
# Create a line plot of btc and eth open prices as a function of time
...

<!-- END QUESTION -->

**Question 2.2.** Now, calculate the correlation coefficient between the opening prices of BTC and ETH using the `correlation` function you defined earlier. **(5 points)**


In [None]:
r = ...
r

If you did this correctly, you will see a correlation greater than 0.9

**Question 2.3.** Write a function `eth_predictor` which takes an opening BTC price and predicts the opening price of ETH. Again, it will be helpful to use the function `parameter_estimates` that you defined earlier in this homework. **(5 points)**

*Hint*: Double-check what the `tbl` input to `parameter_estimates` must look like!

*Note:* Make sure that your `eth_predictor` is using least squares linear regression.


In [None]:
def eth_predictor(btc_price):
    ''' Takes a Bitcoin price and returns a predicted Ethereum price'''
    parameters = ...
    m = ...
    b = ...
    return ...

In [None]:
# Quick test
eth_predictor(1000)  # about 52.5, right?

<!-- BEGIN QUESTION -->

**Question 2.4.** Now, using the `eth_predictor` function you just defined, make a scatter plot with BTC prices along the x-axis and both real and predicted ETH prices along the y-axis. The color of the dots for the real ETH prices should be different from the color for the predicted ETH prices. **(5 points)**

*Hint 1:* An example of such a scatter plot is generated can be found [here](https://inferentialthinking.com/chapters/15/2/Regression_Line.html). </a>

*Hint 2:* Think about the table that must be produced and used to generate this scatter plot. What data should the columns represent? Based on the data that you need, how many columns should be present in this table? Also, what should each row represent? Constructing the table will be the main part of this question; once you have this table, generating the scatter plot should be straightforward as usual.


In [None]:
btc_open = ...
eth_pred = ...
eth_actual = ...
new_tbl = ...
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.5.** Considering the shape of the scatter plot of the true data, is the model we used reasonable? If so, what features or characteristics make this model reasonable? If not, what features or characteristics make it unreasonable? **(5 points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Great! You're done with Homework 9! ###

**Important submission steps:** 
  - Make sure you have run all the cells from top to bottom.
  - Save your notebook.
  - Export as HTML.
  - Upload to Moodle for submission.