# Lab 10: Regression

In [None]:
# Run this cell to set up the notebook, but please don't change it.
import numpy as np
import math
from datascience import *

# These lines set up the plotting functionality and formatting.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

# 1. The Dataset

In this lab, we are exploring movie screenplays. In particular, we have compiled a list of 5,000 words (including stemmed words) that occur in conversations between movie characters. For each movie, our dataset tells us the frequency with which each of these words occurs in certain conversations in its screenplay. All words have been converted to lowercase.

Run the cell below to read the `movies` table.

In [None]:
movies = Table.read_table('movies.csv')
movies.where("Title", "wild wild west").select(0, 1, 2, 3, 4, 14, 49, 1042, 4004)

The above cell prints a few columns of the row for the comedy movie *Wild Wild West*.  The movie contains 3446 words. The word "it" appears 74 times, as it makes up  $\frac{74}{3446} \approx 0.021364$ of the words in the movie. The word "england" doesn't appear at all.
This numerical representation of a body of text, one that describes only the frequencies of individual words, is called a bag-of-words representation. A lot of information is discarded in this representation: the order of the words, the context of each word, who said what, the cast of characters and actors, etc. However, a bag-of-words representation is often used for machine learning applications as a reasonable starting point, because a great deal of information is also retained and expressed in a convenient and compact format.

All movie titles are unique. The `row_for_title` function provides fast access to the one row for each title. 

*Note: All movies in our dataset have their titles lower-cased.* 

In [None]:
title_index = movies.index_by('Title')
def row_for_title(title):
    """Return the row for a title, similar to the following expression (but faster)
    
    movies.where('Title', title).row(0)
    """
    return title_index.get(title)[0]

For example, the fastest way to find the frequency of "none" in the movie *The Terminator* is to access the `'none'` item from its row. Check the original table to see if this worked for you!

In [None]:
row_for_title('the terminator').item('none') 

#### Question 1.1
Set `expected_row_sum` to the number that you __expect__ will result from summing all proportions in each row, excluding the first five columns.

<!--
BEGIN QUESTION
name: q1_1
-->

In [None]:
# Set row_sum to a number that's the (approximate) sum of each row of word proportions.
expected_row_sum = ...

This dataset was extracted from [a dataset from Cornell University](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). After transforming the dataset (e.g., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of 5000 common words in each movie.

## 2. Relationships Between Pairs of Words

For this section we will look at the relationship between the proportion of the word `her` in screenplays versus the word `she`. We would like to use linear regression to make predictions about this relationship, but that won't work well if the data aren't roughly linearly related. To check that, we should look at the data. Run the cell below to construct a new table `she_her` that contains only the proportions of the words `she` and `her` in the screenplays.

In [None]:
she_her = movies.select("she", "her")
she_her

**Question 2.1.** Make a scatter plot of the data.  It's conventional to put the column we want to predict on the vertical axis and the other column on the horizontal axis. Lets's say we want to use the proportions of `she` to predict the proportions of `her`

<!--
BEGIN QUESTION
name: q2_1
-->

In [None]:
...

**Question 2.2.** Are the proportions of `she` and `her` in the screenplays roughly linearly related based on the scatter plot above?

<!--
BEGIN QUESTION
name: q2_2
-->

*Write your answer here, replacing this text.*

We're going to continue with the assumption that they are linearly related, so it's reasonable to use linear regression to analyze this data.

We'd next like to plot the data in standard units. If you don't remember the definition of standard units, textbook section [14.2](https://www.inferentialthinking.com/chapters/14/2/Variability.html#standard-units) might help!

**Question 2.3.** Compute the mean and standard deviation for the proportions of `she` and `her` in the screenplays.  **Then** create a table called `she_her_standard` containing the proportions of `she` and `her` in the screenplays in standard units.  The columns should be named `she (standard units)` and `her (standard units)`.

<!--
BEGIN QUESTION
name: q2_3
-->

In [None]:
she_mean = ...
she_std = ...
her_mean = ...
her_std = ...

she_her_standard = Table().with_columns(
    "she (standard units)", ...,
    "her (standard units)", ...
)

she_her_standard

**Question 2.4.** Plot the data again, but this time in standard units.

<!--
BEGIN QUESTION
name: q2_4
-->

In [None]:
...

You'll notice that this plot looks the same as the last one!  However, the data and axes are scaled differently.  So it's important to read the ticks on the axes.

**Question 2.5.** Which would you guess best describes the correlation between the proportions of `she` and `her` in this dataset?

1. correlation is positive (but not close to zero)
2. correlation is close to zero
3. correlation is negative (but not close to zero)

Assign `correlation` to the number corresponding to your guess.

<!--
BEGIN QUESTION
name: q2_5
-->

In [None]:
correlation = ...

**Question 2.6.** Compute the correlation `r`.  

*Hint:* Use `she_her_standard`.  Section [15.1](https://www.inferentialthinking.com/chapters/15/1/Correlation.html#calculating-r) explains how to do this.


<!--
BEGIN QUESTION
name: q2_6
-->

In [None]:
r = ...
r

## 3. The regression line
Correlation is the **slope of the regression line when the data are put in standard units**.

The next cell plots the regression line in standard units:

$$\text{her proportion in standard units} = r \times \text{she proportions in standard units}.$$

Then, it plots the data in standard units again, for comparison.

In [None]:
def draw_line(slope=0, intercept=0, x=make_array(-1.5, 6), color='r'):
    y = x*slope + intercept
    plots.plot(x, y, color=color)

she_her_standard.scatter('she (standard units)', 'her (standard units)')
draw_line(slope = r)

How would you take a point in standard units and convert it back to original units?  We'd have to "stretch" its horizontal position by `she_std` and its vertical position by `her_std`. That means the same thing would happen to the slope of the line.

Stretching a line horizontally makes it less steep, so we divide the slope by the stretching factor.  Stretching a line vertically makes it more steep, so we multiply the slope by the stretching factor.

**Question 3.1.** Calculate the slope of the regression line in original units, and assign it to `slope`.

(If the "stretching" explanation is unintuitive, consult section [15.2](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html#the-equation-of-the-regression-line) in the textbook.)

<!--
BEGIN QUESTION
name: q3_1
-->

In [None]:
slope = ...
slope

We know that the regression line passes through the point `(she_mean, her_mean)`.  You might recall from high-school algebra that the equation for the line is therefore:

$$\text{her proportion} - \verb|her_mean| = \texttt{slope} \times (\text{she proportion} - \verb|she_mean|)$$

The rearranged equation becomes:

$$\text{her proportion} = \texttt{slope} \times \text{she proportion} + (- \texttt{slope} \times \verb|she_mean| + \verb|her_mean|)$$


**Question 3.2.** Calculate the intercept in original units and assign it to `intercept`.

<!--
BEGIN QUESTION
name: q3_2
-->

In [None]:
intercept = ...
intercept

## 3. Investigating the regression line
The slope and intercept tell you exactly what the regression line looks like. To predict the proportion of the words in a screenplay that are `her` , multiply the proportion of the words in a screenplay that are `she` by `slope` and then add `intercept`.

**Question 3.1.** Compute the the proportion of the words in a screenplay that are `her` for a screenplay with `she` occurring for 0.005 proportion of the words and for a screenplay with `she` occurring for 0.015 proportion of the words

<!--
BEGIN QUESTION
name: q3_1
-->

In [None]:
her_prop_for_she005 = ...
her_prop_for_she015 = ...

# Here is a helper function to print out your predictions.
# Don't modify the code below.
def print_prediction(she, predicted_her):
    print("For a screenplay with 'she' occurring for", she,
          "proportion of the words, we predict that 'her' would occur in the screenplay", predicted_her,
          "proportion of the words.")

print_prediction(0.005, her_prop_for_she005)
print_prediction(0.015, her_prop_for_she015)

The next cell plots the line that goes between those two points, which is (a segment of) the regression line.

In [None]:
she_her.scatter('she', 'her')
draw_line(
    slope = slope,
    intercept = intercept,
    x = make_array(0.005, 0.0015)
)

**Question 3.2.** Make predictions for the proportion of words that are `her` for each screenplay in the `she_her` table.  (Of course, we know exactly what the proportions are! We are doing this so we can see how accurate our predictions are.)  Put these numbers into a column in a new table called `her_predictions` that also includes the columns of the `she_her` table.

<!--
BEGIN QUESTION
name: q3_2
-->

In [None]:
her_predictions = ...
her_predictions

**Question 3.3.** How close were we?  Compute the *residual* for each screenplay in the dataset.  The residual is the actual proportion of the words that are `her` minus the predicted proportion of the words that are `her`.  Add the residuals to `her_predictions` as a new column called `residual` and name the resulting table `her_residuals`.

<!--
BEGIN QUESTION
name: q3_3
-->

In [None]:
her_residuals = ...
her_residuals

Here is a plot of the residuals you computed.  Each point corresponds to one screenplay.  It shows how much our prediction over- or under-estimated the proportion of `her` words in the screenplay.

In [None]:
her_residuals.scatter("she", "residual", color="r")

## 4. More Words for Multiple Regression

#### Question 4.1
Choose a new word in the dataset that you think could be used to predict the proportion of the word `her` in a screenplay. The new word should have a correlation of greater than 0.05 or less than -0.0.5 with `her`. The code to plot the scatter plot and line of best fit is given for you, you just need to calculate the correct values for `r_her_new_word`, `slope_her_new_word` and `intercept_her_new_word`.

*Hint: It's easier to think of words with a positive correlation, i.e. words that are often mentioned together*.

<!--
BEGIN QUESTION
name: q4_1
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [None]:
new_word = ...

# This array should make your code cleaner!
arr_new_word = movies.column(new_word)

new_word_su = ...

r_her_new_word = ...

slope_her_new_word = ...
intercept_her_new_word = ...

# DON'T CHANGE THESE LINES OF CODE
movies.scatter(new_word, 'her')
max_x = max(movies.column(new_word))
print('Correlation:', r_her_new_word)
plots.plot([0, max_x * 1.3], [intercept_her_new_word, intercept_her_new_word + slope_her_new_word * (max_x*1.3)], color='gold');

In class we saw that the best fit regression line used to predict y with x minimized the root mean squared error (rmse). 

$$
rmse ~=~ \sqrt{mean{(y - y_{estimate})}^2}
$$



#### Question 4.2
Define the function below to calculate the root mean squared error for a regression line that uses the proportion of `new_word` as predictor for the proportion of `her`.

In [None]:
def her_new_word_rmse(any_slope, any_intercept):

    ...

    return ...

Run the cells below to see that the minimize function will (most likely) output the same values for the slope and intercept that you calculated in question 3.2

In [None]:
minimize(her_new_word_rmse)

In [None]:
slope_her_new_word, intercept_her_new_word

This method of using the minimize function to find the minimum root mean squared value for a function can be extended to nonlinear models as well as to models with multiple predictors.

#### Question 4.2 

Use the proportions of your `new_word` and `she` to construct a multiple regression model to predict the proportion of the words in the script that are `her`.


$$
her_{estimate} ~=~ a*newword + b*she + c
$$

for constants $a$, $b$, and $c$

Define the function below to calculate the root mean squared error for a regression line that has the proportions of these words as predictors for the proportion of `her` in a screenplay.

In [None]:
def she_new_her_rmse(a, b, c):

    ...

    return ...

#### Question 4.3

Use the minimize function to find the coefficients  $a$, $b$, and $c$ that result in the minimal root mean squared error.

#### Question 4.4

For the movie The Terminator, what proportion of the words in the script does your model predict will be `her` using the proportion of your `new_word` and `she` in a screenplay

You have finished lab 10! We'll use this data again in lab 11 in an attempt to classify the genre of movies.