# SW 282: Lab 9 - Prediction

---

### Professor Erin Kerrison

In [None]:
from datascience import *
import numpy as np
import pyreadstat
from sklearn.metrics import r2_score
from ipywidgets import interact, Dropdown
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [9,6]

Let's begin by importing one of Salkind's datasets from Chapter 16.  These data illustrate how a group of participants who took a timed test, performed on that test. The data are the average amount of time in seconds the participants took on each item ("Time") and the number of guesses it took to get each item correct ("Correct"). 

Please run the cell below, so you can have a look at the ten study participants' scores.

In [None]:
df, _ = pyreadstat.read_sav("ch-16-dataset-2.sav")
data = Table.from_df(df)
data

Recall from a previous lab that we can create a scatter plot of two variables using the function `Table.scatter(x_col, y_col)`.

In [None]:
data.scatter("Time", "Correct")

<div class="alert alert-info">

**QUESTION:** Based on your reading of the visualization above, what can you say about the relationship between these two variables?  What of its strength?  And its apparent direction? What is your hypothesis about the variables' association?

</div>

_**Type your answer here, replacing this text.**_

In the case of single variable linear regression we can calculate $R^2$ uaing the `sklearn` library. However, we must first fit a model line to our data. For this purpose, we will use the function `np.polyfit(x, y, d)` which fits a polynomial of degree `d` to the data `x` that predicts `y`. Because we are fitting a line, we will use $d=1$. This function returns an array of model coefficients: the first value will be the slope of the line $a$ and the second will be the intercept $b$ such that

$$\Large
y = ax+b
$$

In [None]:
#Don't let any of these more advanced calculus concepts scare you!! These expressions simply 
#need to be run before we can fit a regression line.

model_coeffs = np.polyfit(data.column("Time"), data.column("Correct"), 1)
model_coeffs

Now we can replot the scatterplot with our line of best fit using the `fit_line=True` argument.

In [None]:
data.scatter("Time", "Correct", fit_line=True)

<div class="alert alert-info">

**QUESTION:** Does this line corroborate the scatterplot interpretation you offered above?  What does it confirm or reject? 

</div>

_**Type your answer here, replacing this text.**_

In the cell below, we use our `model_coeffs` array to calculate our `Correct` predictions and then use the `r2_score` function to calculate $R^2$.

In [None]:
correct_pred = model_coeffs[0] * data.column("Time") + model_coeffs[1]
r2_score(data.column("Correct"), correct_pred)

<div class="alert alert-info">

**QUESTION:**  What does the $R^2$ value that you just calcuated suggest? Is that value consistent with your understanding of the scatterplot visualization and the best fit line for the data?

</div>

_**Type your answer here, replacing this text.**_

In the cells below, fill in the `time` variable with an $x$ value and then run the cell to see the predicted value of `Correct`.  

I have filled in the first cell for you and decided to see what the predicted number of guesses it takes to get each item correct, if a test-taker devoted an average of 5 seconds on each question . 

Please replace the elipses and run the second and third cell using an X ("Time") input of your choosing.

In [None]:
time = 5
pred = model_coeffs[0] * time + model_coeffs[1]
print("At time {} the prediction for Correct is: {:.5f}".format(time, pred))

In [None]:
time = ...
pred = model_coeffs[0] * time + model_coeffs[1]
print("At time {} the prediction for Correct is: {:.5f}".format(time, pred))

In [None]:
time = ...
pred = model_coeffs[0] * time + model_coeffs[1]
print("At time {} the prediction for Correct is: {:.5f}".format(time, pred))

Now let's look at the difference between predictions and actual values. In the cell below, we write a function `show_pred_and_actual` that will predict the `Correct` value based on a given `Time` value and print both the prediction and actual value. We then create a widget that will allow you to choose different values of `Time` from the data table.

In [None]:
def show_pred_and_actual(time):
    actual = data.where("Time", time).column("Correct")[0]
    pred = model_coeffs[0] * time + model_coeffs[1]
    print("Time: {}".format(time))
    print("Actual value: {}".format(actual))
    print("Predicted value: {:.5f}".format(pred))
    
interact(show_pred_and_actual, time=Dropdown(options=sorted(data.column("Time"))));

<div class="alert alert-info">

**QUESTION:**  Try out 3 of the time (X variable) input options. What do you notice about how the error (or difference between the predicted and actual Y values) changes as the time input changes? Does this surprise you? Why or why not?  

</div>

_**Type your answer here, replacing this text.**_