# Homework 5: Linear Regression


**Reading**: 
* [Prediction](https://www.inferentialthinking.com/chapters/15/prediction.html)

Please complete this notebook by filling in the cells provided.

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. 

Directly sharing answers is not okay, but discussing problems with the instructor or with other students is encouraged. Refer to the syllabus page to learn more about how to learn cooperatively.

You should start early so that you have time to get help if you're stuck.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Triple Jump Distances vs. Vertical Jump Heights


Does skill in one sport imply skill in a related sport?  The answer might be different for different activities.  Let us find out whether it's true for the [triple jump](https://en.wikipedia.org/wiki/Triple_jump) (a horizontal jump similar to a long jump) and the vertical jump.  Since we're learning about linear regression, we will look specifically for a *linear* association between skill level in the two sports.

The following data was collected by observing 40 collegiate level soccer players.  Each athlete's distances in both jump activities were measured in centimeters. Run the cell below to load the data.

In [2]:
# Run this cell to load the data
jumps = Table.read_table('triple_vertical.csv')
jumps

**Question 1**

Create the function `standard_units` so that it converts the values in the array `data` to standard units.

<!--
BEGIN QUESTION
name: q1_1
manual: false
-->

In [3]:
def standard_units(data):
    ...

**Question 2**

Now, using `standard units`, define the function `correlation` which computes the correlation between `x` and `y`.

<!--
BEGIN QUESTION
name: q1_2
manual: false
-->

In [6]:
def correlation(x, y):
    ...

#### Question 3
Before running a regression, it's important to see what the data look like, because our eyes are good at picking out unusual patterns in data.  Draw a scatter plot with the triple jump distances on the horizontal axis and the vertical jump heights on vertical axis **that also shows the regression line**. 

See the documentation on `scatter` [here](http://data8.org/datascience/_autosummary/datascience.tables.Table.scatter.html#datascience.tables.Table.scatter) for instructions on how to have Python draw the regression line automatically.

<!--
BEGIN QUESTION
name: q1_3
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [9]:
...

#### Question 4
Does the correlation coefficient `r` look closest to 0, .5, or -.5? Explain. 

<!--
BEGIN QUESTION
name: q1_4
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

#### Question 5
Create a function called `parameter_estimates`. It takes as its argument a table with two columns. The first column is the x-axis, and the second column is the y-axis. It should compute the correlation between the two columns, then compute the slope and intercept of the regression line that predicts the second column from the first, in original units (centimeters). It should return an array with three elements: the correlation coefficient of the two columns, the slope of the regression line, and the intercept of the regression line. Then, assign `parameters` to the result of calling the `parameter_estimates` function on the `jumps` table in order to see the regression parameters for predicting vertical jump distance from a triple jump distance.  

*Hint:* This is a rare occasion where it’s better to implement the function using column indices instead of column names, in order to be able to call this function on any table. If you need a reminder about how to use column indices to pull out individual columns, you can refer to [the textbook](https://www.inferentialthinking.com/chapters/06/Tables.html#accessing-the-data-in-a-column).

It may also be useful to refer to our in class demonstrations.

<!--
BEGIN QUESTION
name: q1_5
manual: false
-->

In [10]:
def parameter_estimates(t):
    ...
    return make_array(r, slope, intercept)

parameters = ...
print('r:', parameters.item(0), '; slope:', parameters.item(1), '; intercept:', parameters.item(2))

#### Question 6
Let's use `parameters` to predict what certain athletes' vertical jump heights would be given their triple jump distances.

The world record for the triple jump distance is 18.29 *meters* by Johnathan Edwards. What is the prediction for Edward’s vertical jump using this line?

**Hint:** Make sure to convert from meters to centimeters!

<!--
BEGIN QUESTION
name: q1_7
manual: false
-->

In [22]:
triple_record_vert_est = ...
print("Predicted vertical jump distance: {:f} centimeters".format(triple_record_vert_est))

## 2. Cryptocurrencies


Imagine you're an investor in December 2017. Cryptocurrencies, online currencies backed by secure software, are becoming extremely valuable, and you want in on the action!

The two most valuable cryptocurrencies are Bitcoin (BTC) and Ethereum (ETH). Each one has a dollar price attached to it at any given moment in time. For example, on December 1st, 2017, one BTC costs $\$$10859.56 and one ETH costs $\$$424.64. 

**You want to predict the price of ETH at some point in time based on the price of BTC.** Below, we [load](https://www.kaggle.com/jessevent/all-crypto-currencies/data) two tables called `btc` and `eth`. Each has 5 columns:
* `date`, the date
* `open`, the value of the currency at the beginning of the day
* `close`, the value of the currency at the end of the day
* `market`, the market cap or total dollar value invested in the currency
* `day`, the number of days since the start of our data

In [5]:
btc = Table.read_table('btc.csv')
btc

In [6]:
eth = Table.read_table('eth.csv')
eth

#### Question 1

In the cell below, create a line plot that visualizes the BTC and ETH open prices as a function of time. Both btc and eth open prices should be plotted on the same graph.

<!--
BEGIN QUESTION
name: q2_1
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [8]:
# Create a line plot of btc and eth open prices as a function of time
...

#### Question 2

Now, calculate the correlation coefficient between the opening prices of BTC and ETH using the `correlation` function you defined earlier.

<!--
BEGIN QUESTION
name: q2_2
manual: false
-->

In [9]:
r = ...
r

#### Question 3
Regardless of your conclusions above, write a function `eth_predictor` which takes an opening BTC price and predicts the opening price of ETH. Again, it will be helpful to use the function `parameter_estimates` that you defined earlier in this homework.

**Note:** Make sure that your `eth_predictor` is using least squares linear regression.

<!--
BEGIN QUESTION
name: q2_3
manual: false
-->

In [15]:
def eth_predictor(btc_price):
    parameters = ...
    slope = ...
    intercept = ...
    ...

#### Question 4

Now, using the `eth_predictor` you defined in the previous question, make a scatter plot with BTC prices along the x-axis and both real and predicted ETH prices along the y-axis. The color of the dots for the real ETH prices will be different from the color for the predicted ETH prices.

Hints:
* An example of such a scatter plot is generated <a href= "https://www.inferentialthinking.com/chapters/15/2/regression-line.html
"> here. </a>
* Think about the table that must be produced and used to generate this scatter plot. What data should the columns represent? Based on the data that you need, how many columns should be present in this table? Also, what should each row represent? Constructing the table will be the main part of this question; once you have this table, generating the scatter plot should be straightforward as usual.

<!--
BEGIN QUESTION
name: q2_4
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [18]:
btc_open = ...
eth_pred = ...
eth_pred_actual = ...
...

#### Question 5
Considering the shape of the scatter plot of the true data, is the model we used reasonable? If so, what features or characteristics make this model reasonable? If not, what features or characteristics make it unreasonable?

<!--
BEGIN QUESTION
name: q2_5
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## 3. Evaluating NBA Game Predictions


#### A brief introduction to sports betting

In a basketball game, each team scores some number of points.  Conventionally, the team playing at its own arena is called the "home team," and the other team is called the "away team."  The winner is the team with more points.

We can summarize what happened in a game by the "**outcome**", defined as the **the away team's score minus the home team's score**:

$$\text{outcome} = \text{points scored by the away team} - \text{points scored by the home team}$$

If this number is positive, the away team won.  If it's negative, the home team won. 

In order to facilitate betting on games, analysts at casinos try to predict the outcome of the game. This prediction of the outcome is called the **spread.**


In [3]:
spreads = Table.read_table("spreads.csv")
spreads

Here's a scatter plot of the outcomes and spreads, with the spreads on the horizontal axis.

In [4]:
spreads.scatter("Spread", "Outcome")

From the scatter plot, you can see that the spread and outcome are almost never 0, aside from 1 case of the spread being 0. This is because a game of basketball never ends in a tie. One team has to win, so the outcome can never be 0. The spread is almost never 0 because it's chosen to estimate the outcome.

Let's investigate how well the casinos are predicting game outcomes.

One question we can ask is: Is the casino's prediction correct on average? In other words, for every value of the spread, is the average outcome of games assigned that spread equal to the spread? If not, the casino would apparently be making a systematic error in its predictions.

#### Question 1
Compute the correlation coefficient between outcomes and spreads. 

**Note:** It might be helpful to use the `correlation` function.

<!--
BEGIN QUESTION
name: q3_1
manual: false
-->

In [5]:
spread_r = ...
spread_r

#### Question 2
Among games with a spread between 3.5 and 6.5 (including both 3.5 and 6.5), what was the average outcome? 

*Hint:* Read the documentation for the predicate `are.between_or_equal_to` [here](http://data8.org/datascience/predicates.html#datascience.predicates.are.between_or_equal_to).

<!--
BEGIN QUESTION
name: q3_2
manual: false
-->

In [8]:
spreads_around_5 = ...
spread_5_outcome_average = ...
print("Average outcome for spreads around 5:", spread_5_outcome_average)

#### Question 3
Compute the slope of the least-squares linear regression line that predicts outcomes from spreads, in original units.

<!--
BEGIN QUESTION
name: q3_3
manual: false
-->

In [11]:
spread_slope = ...
spread_slope

#### Question 4
Suppose that we create another predictor that simply predicts the average outcome regardless of the value for spread. Does this new predictor minimize least squared error? 

<!--
BEGIN QUESTION
name: q3_4
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## 4. Submission


Download this IPython notebook and upload it to your git repository. Instructions for this can be found [here](https://cs.slu.edu/~stylianou/1070/submitting_assignments.html).