# Explaining Variation (COMPLETE)

## Chapter 4.1-4.2 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# pull in data from a google drive csv file
horror_movies <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vT165lczx17dL_bOkgzPtkOh17faIoa7IfFJDVwVN2CpifXRyZ7Y0j2sQaqm7yo_Garzk_RIbJMJm-m/pub?gid=988069214&single=true&output=csv")


<div class="teacher-note">
    <b>Goals of this section:</b> In this chapter, students are introduced to the concept of explaining variation in one variable with variation in another variable. At this point, students are given an informal definition of explain variation, namely, that knowing an observation’s value on an explanatory variable will help you make a better guess as to its value on an outcome variable. They will learn to apply this concept as they make hypotheses, and make and interpret visualizations of bivariate relationships.

- Students will learn the importance of distinguishing outcome variables from explanatory variables, and will use word equations of the form Outcome = Explanatory + Other Stuff to represent hypotheses about bivariate relationships.
- Students will see word equations as informal models of relationships, and will recognize the importance of “other stuff” at the end to represent the idea that there are many unmeasured causes of variation in the outcome variable (a concept we will build on later in the book when students are introduced to the concept of error).
- Students will learn to use R to explore a hypothesized relationship between two variables, starting with scatter plots, when both explanatory and outcome variables are quantitative, and they will begin to develop a visual sense of when an explanatory variable does or does not explain much variation in an outcome.
- Students will notice the correspondence between the structure of word equations (outcome = explanatory) and the R code used in this book (gf_point(outcome ~ explanatory)). They will also learn that the outcome variable is usually mapped to the y-axis in a plot, the explanatory to the x-axis.

    
A <a href="https://docs.google.com/document/d/1PpI8hcXhjXy6WX4_5jJs4YZq7oQO51Q9uhcMXcq8ty8/edit?tab=t.5y2a0ykmi2fk#heading=h.wjaasjj3pg90" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
</div>

## 1 Predicting Movie Revenue

Imagine you're a movie analyst trying to predict how much money a new horror movie will make at the box office. You don't know the name of the movie, but you do have some data from [The Movie Database](https://www.themoviedb.org/) on previous movie releases and how much money they have made. 

It includes variables such as:
- `vote_count` The number of votes this movie has gotten on [The Movie Database](https://www.themoviedb.org/)
- `avg_rating` The average rating (0-10) from users on [The Movie Database](https://www.themoviedb.org/)

### 1.1 Take a look at the head of a data set called `horror_movies` (it's also available in your guided notes). What do you think each of the other variables means?
    
Each row is a horror movie. The names have been redacted.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/4.1-4.2-head-horror-movies-missing-revenue.jpg" width="100%">

### 1.2 Definitions of all the variables

The data set includes the following:

- `title` The title of the movie (but we have redacted it for now; you'll see those values later)
- `month` The month the movie was released (1-12)
- `vote_count` The number of votes this movie has gotten on [The Movie Database](https://www.themoviedb.org/)
- `avg_rating` The average rating (0-10) from users on [The Movie Database](https://www.themoviedb.org/)
- `runtime_min` The length of the movie (in minutes)
- `budget_mil` The movie's budget (in millions of dollars)
- `revenue_mil` How much revenue the movie took in at the box office in millions of dollars (we have removed those for now; you'll see those later


### 1.3 Definition of outcome and explanatory variables

Using past data to make predictions—like predicting a movie’s revenue—is something statisticians and data analysts do all the time. To do this well, it's important to understand the two roles that variables can play.

The **outcome variable** is what you're trying to predict. There's usually just one. 

**Explanatory variables** are the other things you know about each observation that might help you make a better prediction of the outcome.


<div class="discussion-question">

### 1.4 *Discussion Questions:* Which variable in this data would be a good outcome variable in this situation? Which would be the explanatory variables? 
How can you tell? If you had to choose only one explanatory variable, which would you choose and why? Which variable do you think would be the least useful explanatory variable?

</div>


<div class="teacher-note"> 

<b>Teacher Note:</b> Students will likely identify <code>revenue_mil</code> as the outcome variable—it’s what we’re trying to predict. 

Encourage them to talk about potential explanatory variables in the context of a hypothesis: What might help us make a better prediction of revenue, and why? For example: “A movie with more votes probably had more viewers,” or “Higher-rated or higher quality movies might earn more money.” 

</div>

### 1.5 Description of word equations 

Your hypothesis about how an explanatory variable might help to predict an observation's value on an outcome variable can be expressed as a word equation. You can think of word equations as *informal models* of the world. Word equations take the form:

- **Outcome Variable = Explanatory Variable + Other Stuff**

For example, one hypothesis might be that the budget of a movie might be related to revenue. We could express this idea as:

- **revenue_mil = budget_mil + Other Stuff**

<div class="discussion-question">

### 1.6 *Discussion Question:* Why do we include "Other Stuff" in a word equation? 

</div>

<div class="teacher-note">
    <b>Teacher Note:</b> 

- No model is perfect--this is an important idea in statistical modeling--so we include “Other Stuff” to represent all the factors that affect the outcome but aren’t captured by the explanatory variable.   
- A big budget might contribute toward a movie’s big revenue, but it won’t tell the whole story. Some expensive movies flop, while small-budget films can become blockbusters. “Other Stuff” includes everything else like the cast, the story, timing, viral marketing—anything else that might influence revenue. 
</div>

<div class="guided-notes">    

### 1.7 Write a word equation to represent your hypothesis for explaining movie revenue

Use as the explanatory variable the variable you think will be most helpful in predicting a movie’s revenue. 
    
</div>

<div class="guided-notes">    

### 1.8 Use your hypothesis to match movies to revenue. Here are the revenue values (in millions) for the six movies: 17, 18, 24, 42, 71, 117

Based on your hypothesis and the data in the table, fill in the empty revenue cells with your best guesses. Which revenue value do you think goes with which movie?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/4.1-4.2-head-horror-movies-missing-revenue.jpg" width="100%">

<div class="guided-notes">    

### 1.9 Write down the definition of "explain variation".
    
> <span style="font-size: 20px;">If we know a case’s value on one variable, we can make a better prediction of its value on another variable.</span>
    
</div>

<div class="guided-notes">    

### 1.10 When you filled in the missing values, how did knowing a movie's value on the explanatory variable help you make a better guess as to its `revenue_mil`?

</div>

<div class="teacher-note">
<b>Sample Responses:</b>
    
- I used avg_rating as the explanatory variable, so I matched the highest revenue_mil values with the movies that had the highest average ratings.
- I picked budget_mil as my explanatory variable. I figured bigger-budget movies usually make more money, so I matched higher revenues to those.

Encourage students to reference their word equation and make clear connections between the variable they chose and the revenue predictions they made.

</div>

## 2 Explore the `horror_movies` Data Frame

Now, let’s step back and explore the full `horror_movies` data set. The data frame contains information from [The Movie Database](https://www.themoviedb.org/) on 50 of the highest-earning horror movies from the past 100 years.

### 2.1 Write code to inspect the `horror_movies` data frame and see what’s inside. 
Do you see anything interesting, surprising, or confusing about the data?

In [None]:
# write code

# sample code
head(horror_movies)
glimpse(horror_movies)

<div class="guided-notes">   
    
### 2.2 Write code to create a scatter plot to explore this hypothesis: revenue_mil = vote_count + other stuff

A scatter plot can help us visualize the relationship between an explanatory variable and an outcome variable.

The general format for creating a scatter plot in R is: `gf_point(outcome ~ explanatory, data = dataframe)`
    
</div>

In [None]:
# edit this
# gf_point(outcome ~ explanatory, data = dataframe)
# try adding color="blue" as an argument

gf_point(revenue_mil ~ vote_count, color="blue", data=horror_movies)

Totally extra, but if you'd like to see a list of all the colors available in R, this is a famous R color cheatsheet (you can always google "Rcolor" and get it that way too): http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

<div class="discussion-question">

### 2.3 Key Discussion Question: Using the definition above, do you think this explanatory variable (vote_count) explains some of the variation in `revenue_mil`?
If yes, where in the graph do you see evidence of this? If no, why not?

</div>

<div class="teacher-note">

<b>Teacher Note:</b>  

Student generally recognize a positive relationship ("positive correlation") but we want to link it to the idea of "explain variation."

<b>Sample Responses:</b>  
- As vote_count increases, revenue_mil also tends to increase. That means if we matched up higher revenues with higher vote counts, that could lead to pretty good predictions.
- (Uncommon but worth pointing out the negative space in the scatter plot) There are basically no movies with low vote counts and revenues above 300 million. Looking at the places with no data can also be informative.

</div>


<div class="guided-notes">   

### 2.4 Write code to create a scatter plot to explore this hypothesis: revenue_mil = avg_rating + other stuff
    
</div>

In [None]:
# write code

# sample code
gf_point(revenue_mil ~ avg_rating, color="forestgreen", data=horror_movies)

<div class="guided-notes">
    
### 2.5 Comparing vote_count and avg_rating: What features of the scatter plots tell you that vote count explains variation in revenue better than avg_rating?
    
</div>

<div class="teacher-note">

<b>Sample Responses:</b>  
- The vote_count scatter plot has a clear pattern. The avg_rating plot just looks random. 
- In the vote_count plot, as vote_count increases, revenue_mil increases. That’s not true for avg_rating.  
    
</div>


<div class="guided-notes">
    
### 2.6 What is the difference between a strong and weak explanatory variable when looking at a scatter plot?
    
</div>

<div class="teacher-note">
<b>Teacher Note:</b>  
    
This question helps students generalize what they observed in 2.5.

<b>Sample Responses:</b>  
- A strong variable creates a pattern where knowing X helps you guess Y. A weak one doesn’t help you guess at all.  
- In a strong relationship, the points kind of line up. In a weak one, they’re all over the place.
- If the scatter plot looks like a random cloud, the variable probably doesn’t explain variation in the outcome.

</div>


## 3 Practice What You Learned

### 3.1 Explore some other explanatory variables
Try some other explanatory variables. For each, consider what the word equation might be; write the R code to graph it; then apply the informal definition of "explain variation." Which variables appear to explain variation in `revenue_mil`? Which do not?

In [None]:
# code here


Which variables appear to explain variation in revenue_mil? Which do not?

