# 7B: Revisiting the Respect Study

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

linktocsv <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vShHDu7P5XnUWo_xBtB67I00R-TTCyB73GyILmjlMq3LGM-kHFg-rIBQwB4upLOU7mTXG6fg8QxTLhB/pub?gid=191363782&single=true&output=csv"
respect_study <- read.csv(linktocsv, head = TRUE)
respect_study$veg_eaten <- respect_study$spoon2_before - respect_study$spoon2_after

## 1.0 - Explore Variation

Remember when we explored the respect hypothesis with the students asked to eat some bitter tasting medicine (but it was actually Vegemite)? The researchers thought that giving more respectful instructions would lead students to be more willing to eat the Vegemite.

1.1 - Write this hypothesis as a word equation.

The data frame is called `respect_study` and has already been loaded into this Jupyter notebook.

As a reminder, there are many variables in the data frame but here are the two most relevant variables in the data:

- `respect_condition` Whether the participant watched the "Respect" or "No Respect" video

- `veg_eaten` How much Vegemite on the 2nd spoon was eaten (grams)

1.2 - Take a look at the data frame using R. What are the cases in this data frame? How many are there?

1.3 - Let’s take a look at the distribution of Vegemite eaten by participants in each condition.
Run the code below. Find the tallest bar. What does it mean?

In [None]:
gf_histogram(~veg_eaten, data = respect_study, fill = "lightgreen", color = "midnightblue") %>%
    gf_facet_grid(respect_condition ~ .)

1.4 - Let's assume this was a well-conducted experiment. In that case, there are two possible reasons why these two groups look different on their Vegemite eating. What are those two reasons? 

## 2.0 - Model Variation

2.1 - How would we write the DGP of randomness using a word equation? How about in GLM notation? 

2.2 - If we used the empty model to predict how much Vegemite students ate, what value would it predict for each student?

2.3 - But enough with the empty model--let’s create a model that’s more complex...a model that might be able to explain something! 

Find the best fitting model that goes with our hypothesis, **VEG_EATEN = RESPECT_CONDITION + OTHER STUFF** and put the numbers into GLM notation (by modifying this equation):

$$Y_i = b_0 + b_1 X_i + e_i$$

2.4 - Interpret the respect model’s estimates ($b_0$ and $b_1$). Where are those numbers in the faceted histogram below?

In [None]:
veg_eaten_stats <- favstats(veg_eaten ~ respect_condition, data = respect_study)

gf_histogram(~veg_eaten, data = respect_study, fill = "lightgreen", color = "midnightblue") %>%
    gf_facet_grid(respect_condition ~ .) %>% 
    gf_vline(xintercept = ~mean, data = veg_eaten_stats, color = "orange3")

2.5 - Also, what does the $Y_i$ and $X_i$ in the GLM notation stand for? Where are those in the plot?

## 3.0 - Using the Respect Model for Making Predictions

3.1 - Which does the respect model predict: 

a. whether or not someone is in the respect group? 

b. how much Vegemite someone would eat? 


3.2 - Why do we call it the “respect model”?

3.3 - What would the $X_i$ value be for someone in the **no respect** group? What would the respect model predict for someone in the no respect group?

3.4 - What would $X_i$ be for someone in the **respect** group? What would the respect model predict for someone in the respect group?


3.5 - Save the predictions of the respect model into the data frame. 
Let's call these `respect_pred`.

Modify the code below to also include the predictions from the respect model.

In [None]:
head(select(respect_study, respect_condition, veg_eaten), 10)

3.6 - Save the residuals from the respect model into the data frame. Let's call these `respect_resid`.

Modify the code below to also include the residuals from the respect model.

In [None]:
head(select(respect_study, respect_condition, veg_eaten, respect_pred), 10)

## 4. 0 - Residuals: How “Off” are these Predictions?

4.1 - Run the code below.

(Everybody gestures) Where are the predictions in the histograms below?
Where are the residuals in the histograms below?

In [None]:
veg_eaten_stats <- favstats(veg_eaten ~ respect_condition, data = respect_study)

gf_histogram(~veg_eaten, data = respect_study, fill = "lightgreen", color = "midnightblue") %>%
    gf_facet_grid(respect_condition ~ .) %>% 
    gf_vline(xintercept = ~mean, data = veg_eaten_stats, color = "orange3")

4.2 - Which of these shows us an example of some residuals, the one with the blue lines or the one with the red lines?

|                                                                           |                                                                           | 
|---------------------------------------------------------------------------|---------------------------------------------------------------------------|
|<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_NYd4B5J1-7-B-Two-Histograms-A.png" title="histograms with blue lines" width="100%"/>|<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_44ypmQLp-7-B-Two-Histograms-B.png" title="histograms with red lines" width="100%"/>|


4.3 - What do those residuals represent: the variation that has been explained or left unexplained by the respect model? How does GLM represent these residuals?

4.4 - The best fitting GLM was:

$$veg\_eaten_i = 0.14 + 0.09respect\_condition_i + e_i$$

Does the $Y_i$ ($veg\_eaten_i$) in the data below actually equal the model prediction plus the residual?

In [None]:
head(select(respect_study, respect_condition, veg_eaten, respect_pred, respect_resid), 10)

4.5 - Run the code below. 

(Everybody gestures) Where are the predictions of the empty model in the visualization below?
Where are the residuals from the empty model?

What do those residuals represent: the variation that has been explained or left unexplained by the empty model? 

In [None]:
# Run this code
empty_stats <- favstats(~veg_eaten, data = respect_study)

gf_histogram(~veg_eaten, data = respect_study, fill = "lightgreen", color = "midnightblue") %>%
    gf_facet_grid(respect_condition ~ .) %>% 
    gf_vline(xintercept =~mean, data = empty_stats, color = "blue")

4.6 - Try adding the predictions of the respect model to the histogram above (so both models will be represented in the visualization).

4.7 - What's the part of the visualization that depicts:

- the residuals from the empty model?
- the residuals from the respect model?
- the part of the error that has been explained? Which model did the explaining? 

## 5. 0 - How Helpful is the Respect Model?

5.1 - Maybe we can just add up the residuals for the empty model to see if that total error is bigger than adding up residuals from the respect model. 

What is the problem with trying to add up these residuals (any of these residuals)? How does the sum of squares solve this problem?

5.2 - Now run `supernova()` on the respect model. Find the three sums of squares.
 
- What unit are those numbers in?
- Which tells us SS (the leftover error) from the empty model?
- Which tells us SS from the respect model?
- Which tells us how much of the error has been explained?

5.3 - What is the relationship between SS Model, SS Error, and SS Total?

Take a look at the word equation for the respect hypothesis. Any similarities?

5.4 - What is the relationship between PRE and the different sums of squares?

5.5 - Interpret the PRE for the respect model. Is that a lot? Is that a little? Is it hard to tell?

5.6 - What would it mean if PRE = 0.00, or if PRE = 1.00? Could the PRE = 2.00?

5.7 - Is it possible that we could have gotten this PRE from a random process (where these groups are essentially the same but they are just a little different because of randomness)?