# 4B: Research Design and Launching Distance

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

LaunchBears <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vR-n7sbDgxBhPlIew383bKP1wbOjyexLAXcevLMMGFHmWgUTWTEn9jWn1cUcujAmjByussTNozcAto0/pub?gid=745552022&single=true&output=csv", header=TRUE)

## 1.0 - Revisiting the Gummy Bear Catapult Experiment

Remember when we created our candy catapults? Now that we’ve learned some new data science skills, let’s revisit that data and think about this question: 

> What affects gummy bear launching distance?

A sample data set has been loaded for you into `LaunchBears`. 

(**Note to Instructors**: If you did lessons 2A and 2B with your students, you can upload the actual data from your class if you prefer.)

1.1 - Do you have any questions about these variables? 

- `LiftCategory` number of lift sticks as a categorical variable (L1 - one stick, L2 - two sticks, L3 - three sticks)
- `NumLifts` number of lift sticks (1,2,3)
- `Dist_cm` how far the gummy bear got launched in centimeters
- `BearColor` color of the gummy bear
- `Group` which group of students did the launches
- `BearHt` how tall the bear was in millimeters 
- `Tool` which measuring tool was used
- `CatapultColor` the color of the catapult

1.2 - Let’s take a look at the distribution of launching distances. What do you notice?

In [None]:
# Sample histogram (students can change to a density histogram with a density plot as well)
gf_histogram(~Dist_cm, data = LaunchBears, fill = "aquamarine1", color = "darkblue")


# Sample boxplot with jitter
gf_boxplot(Dist_cm ~ 1, data = LaunchBears, fill = "palegoldenrod", color = "seagreen4")%>%gf_jitter(color = "plum4")

1.3 - Hmmm... there are some gummy bears that got launched a lot further than others. Of the variables we collected in `LaunchBears`, in your opinion, which of these variables might affect distance? Which variables probably do not affect distance?

1.4 - Make a few visualizations to look at the variables that might affect distance

1.5 - Now make some visualizations to look at the variables that might NOT affect distance

1.6 - Which variable(s) seems best at explaining some of the variation in distance?

## 2.0 - Modeling the Possibilities with Word Equations

A lot of people think `LiftCategory` might help us explain the variation in launch distance, such that, if we knew how many lift sticks the catapult had, we could adjust our prediction of distance a little bit. 

2.1 - If we knew that a gummy bear was launched with 3 lift sticks, how would you adjust your prediction? How about just 1 lift stick?

2.2 - How would we write a word equation for the hypothesis that the number of lift sticks explains some of the variation in distance? 

2.3 - We could be wrong though. So how would we write a word equation for the hypothesis that the number of lift sticks does not explain the variation in distance? 

We will consider each word equation a little theory (or “hypothesis” or “model”) of the DGP that created our launching distance data.

## 3.0 - Explaining variation with `LiftCategory`

Let’s check out a visualization that many of you may have made but be careful not to jump to conclusions.

In [None]:
gf_histogram(~ Dist_cm, data =LaunchBears) %>%
  gf_facet_grid(LiftCategory ~ .)%>%
  gf_boxplot(width=5, fill="white")

3.1 - What are some reasons (from the data) for suspecting that `LiftCategory` really does explain some of the variation in `Dist_cm` scores?

3.2 - What are some reasons (from the data) for suspecting that `LiftCategory` really does **_not_** explain some of the variation in `Dist_cm` scores?

3.3 - Is it possible to have gotten this pattern of data by chance? For example, if we just shuffled these distances into three different groups randomly?

3.4 - Take a look at the diagram of “explaining variation” below. What aspects of the histograms represent the “explained variation” part? What aspects of the histograms represent the “unexplained” variation?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_4C_sources_variation.png" title="Explaining variation diagram" />


## 4.0 - Simulating Unexplained Variation using a Random Data Generating Process (DGP)

4.1 - Remember Westvaco? We couldn’t “simulate” age discrimination **_but_** we could simulate firing people “randomly.” How did we “simulate” a random process then? Which R function works like that?

4.2 - The definition of “random” includes the idea that it doesn’t systematically pick numbers that are bigger or smaller (it’s unbiased). Why is the `sample()` function a “random” process?

4.3 - Now we have a slightly different situation from Westvaco – we have launching distances in three groups. How will the R function `shuffle()` mimic a random process?

Here is some code that shuffles the distances (`Dist_cm`) into different `LiftCategory` groups **_and_** makes a visualization all at the same time. 

In [None]:
# Run this code a few times
gf_histogram(~ shuffle(Dist_cm), data = LaunchBears) %>%
  gf_facet_grid(LiftCategory ~ .)%>%
  gf_boxplot(width=2.5, fill="white")

4.4 - What is this code doing? Explain. How is that different from the code for creating a visualization of the actual data?

4.5 - If we shuffled the `Dist_cm` values into three random groups, would our actual data look similar or different from the randomly shuffled data? (Feel free to try shuffling a few times in order to develop your intuition.)

4.6 - We ran `shuffle()` (the code we showed you above) a bunch of times. One of the faceted histograms below is the empirical sample (the sample from the real data). Can you tell which one it is? What makes it look different than the samples we created from shuffling?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_4c_shuffles.jpg" title="Several shuffled histograms" />

## 5.0 - What did we learn from shuffling?

5.1 - Do you think the likelihood of getting a pattern of data like our empirical sample from a random process (like shuffling) is high? Low? Medium? Explain your reasoning.

5.2 - Let’s think about our “whole thing” diagram. If we create a bunch of histograms from shuffled data, where would we put that in this diagram? Is it empirical data or simulated data?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_4c_dist_triad.jpg" title="The Distribution Triad" />

5.3 - If the likelihood of getting a sample like the empirical sample from a random process is low, which theory of the DGP would that rule out? Which theory would it support? 

5.4 - What are your overall conclusions when it comes to launching distance? Which model seems better: the one that takes `LiftCategory` into account (`Dist_cm` = `LiftCategory` + Other Stuff) OR the one that leaves `LiftCategory` out (`Dist_cm` = Other Stuff)? Does that mean `LiftCategory` helps us be perfectly accurate in predicting launching distance? Why or why not?

## 6.0 - Reflect and Connect

6.1 - In your own words, what do you think it means for a variable to “explain variation” in an outcome?

6.2  - Compare and contrast today’s example to the Westvaco “whole thing” exercise. What was similar? What was different?