# 5A: The 5-A-Day Challenge: Investigating State Differences in Vegetable Consumption

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# Updated USStates data with election data
States <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSGPPBJEKYHYu563seooxlUmhkUGVHQUeedgHsviPhYv5OtsOOOBnPKdFsIB1J0y6zmxXQ7I_CNnh64/pub?gid=2140707809&single=true&output=csv")

States <- States %>% select(State, Region, Population, IQ, Grade8Math, HighSchool, College, Veg, Smokers, PhysicalActivity, Obesity)

States<-na.omit(States)

## 1.0 - The Data

This dataset includes information about the 50 US states. For this lesson, we have included the following variables: 

- `State` Name of state 
- `Region` Area of the country: MW=Midwest, NE=Northeast, S=South, or W=West
- `Population` Number of residents (in millions)
- `IQ` Mean IQ score of residents
- `Grade8Math` Mean score on standardized test administered to 8th graders
- `HighSchool` Percentage of high school graduates
- `College` Percentage of residents with college degrees
- `Veg` Percentage of residents who eat the daily recommended servings of fruits and vegetables
- `Smokers` Percentage of residents who smoke
- `PhysicalActivity` Percentage of residents who have competed in a physical activity in past month
- `Obesity` Percentage of residents classified as obese

1.1 - Today, we will be working with the variable `Veg` - the percentage of residents in a state who report eating the recommended amount of vegetables each day. 

In order to get the `Veg` percentages, samples of people in all 50 states were asked this question: “Did you eat at least 5 servings of fruits and vegetables yesterday?” What would be your answer to that question? 

1.2 - Could people lie? Could people try to tell the truth and still get it wrong? How would that be an example of “measurement error” (the idea that the data are “off” from what actually happened)? How could you reduce the measurement error that results from lying?

1.3 - ***Competing hypotheses*** 

Which variable do you think will help us explain more variation in `Veg`: `College` or `PhysicalActivity`? Make a prediction. Review the variable descriptions above. 

1.4 - Create visualizations to explore the two hypotheses above. 

In [None]:
#create your visualizations here 




1.5 - Which relationship appears to be "stronger"? In other words, which variable explains more variation in `Veg`: `PhysicalActivity` or `College`? Are you surprised by this? 

## 2.0 - Explaining Variation

2.1 - It looks like `College` may be a good predictor of `Veg`. 

What are some reasons (from the data) for suspecting that `College` really does explain some of the variation in `Veg`? How would we write this as a word equation?

2.2 - Does every state fit this pattern? What are some reasons (from the data) for suspecting that `College` DOES NOT explain some of the variation in `Veg`? How would we write this as a word equation?

2.3 - Is it possible to have gotten this pattern of data by chance? Write a word equation that represents this possibility.

## 3.0 - Maybe It’s All Just Other Stuff

3.1 - If we didn't know anything about a state or if we wondered whether this pattern of `College` predicting `Veg`  was maybe just a random fluke, what should we predict a random state's `Veg` to be? 

3.2 - We could add your prediction (it could be the mean or median or any other number) into this scatterplot. For now, let’s try adding the mean. 

Note that just as the `gf_vline()` function adds vertical lines to the x-axis, `gf_hline()` adds horizontal lines to the y-axis. 

In [None]:
Veg_stats <- favstats(~ Veg, data = States)
Veg_stats

# Add the empty model to this scatterplot
gf_point(Veg ~ College, data = States)%>%



3.3 - Why is the mean represented as a line? Why not a single dot? 

3.4 - Fit the empty model for `Veg`. Then use this model to predict each state’s `Veg` percentage. We’ll then plot the predictions right on top of our original scatterplot.

In [None]:
# How do we fit the empty model?
EmptyModel <-

# How do we generate the predictions from it?
States$Prediction <- 

# This will plot the predictions from the empty model
gf_point(Veg ~ College, data = States, size = 3, alpha = .8) %>%
    gf_point(Prediction ~ College, size = 3, alpha = .1, color = "orange")

3.5 - Now we will start writing the word equation for the empty model as **Veg = Mean + Other Stuff** (instead of the old way: **Veg = Other Stuff**). 

In the cell below, insert our mean into the equation (i.e., replace "Mean" with the prediction of our empty model):

$Veg = Mean + Other Stuff$

## 4.0 - Simulating a Random Data Generating Process

4.1 - Remember the gummy bear launches and `NumLifts` experiment? How did we “simulate” a random data generating process? Why is that a “random” process? Which R function acts like that?

4.2 - One pattern we saw in our data is that generally high `College` states also have high `Veg`. But the definition of “random” includes the idea that high numbers don’t systematically go with low numbers. Instead, randomness means that high numbers could go with low, medium, OR high numbers! If we shuffled `Veg` in this dataframe, would we generate data that looks just like our empirical sample?

With R, we don’t just have to wonder. We can actually do it. Run the code below a few times. What is it doing? 

In [None]:
gf_point(shuffle(Veg) ~ College, data = States)

4.3 - Could we put the `shuffle` around `College`? Try it.

4.4 - One of the scatterplots below is the empirical sample. Does it look any different from the shuffled scatterplots? What are you looking for that is different in the empirical sample?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-05A-shuffscatter.png" title="a bunch of shuffled scatterplots" />

4.5 - Do you think the likelihood of getting a pattern of data like the empirical sample from a random process is high? Low? Medium? Explain your reasoning.

4.6 - What does your answer to the question above suggest about the relationship between `College` and `Veg` that we see in our empirical data?

## 5.0 - Connecting the Empty Model and Shuffle

5.1 - If we shuffled `Veg` or `College` in this data frame, would we estimate a different empty model? In other words, would the empty model change?

5.2 - With R, we can try it and see what happens. Explain what each line of code is doing.

In [None]:
# 5.3 - What’s this about?
States$shuffled_Veg <- shuffle(States$Veg)

In [None]:
# 5.4 -  What’s this about?
shuffled_Veg_stats <- favstats(~ shuffled_Veg, data = States)
Veg_stats <- favstats(~ Veg, data = States)

In [None]:
# 5.5 - What’s this about?
gf_point(shuffled_Veg ~ College, data = States, color = "dodgerblue") %>%
    gf_hline(yintercept = ~mean, data = shuffled_Veg_stats)

In [None]:
# 5.6 - What’s this about?
gf_point(Veg~College, data = States) %>%
    gf_hline(yintercept = ~mean, data = Veg_stats)

5.7 - Why is the mean the same on both graphs? Is that just a coincidence? What if you ran the shuffle again, would it calculate a different mean? What if you shuffled `College` -- would that result in a different mean for `Veg`?

5.8 - Why is the empty model a stand-in for a DGP of randomness?