# Ch 1-14 Test: Advanced Models


In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})
candy_rankings <- fivethirtyeight::candy_rankings %>%
                mutate(nutty = peanutyalmondy, sugarpercent = sugarpercent * 100,
                    chocolate = as.factor(chocolate),
                    fruity = as.factor(fruity),
                    nutty = as.factor(nutty), 
                    hard = as.factor(hard), 
                    bar = as.factor(hard)) %>%
                select(competitorname, winpercent, chocolate, fruity, nutty, hard, bar, sugarpercent, pricepercent)




*FiveThirtyEight* conducted an online study.  Participants were shown two candies at a time (randomly selected from a collection of 85 different Halloween candies) and were asked to state which of the two they would prefer. The results are in a data frame called `candy_rankings`, a data frame with 85 rows and 9 variables. (The original data frame had more variables; we have reduced it for convenience.)

The variables are:

- `competitorname` the name of the Halloween candy
- `winpercent` indicates the overall popularity of each candy measured as the percentage of times the candy was chosen over other candies across 269,000 matchups
- `chocolate` whether the candy contains chocolate (coded TRUE or FALSE)
- `fruity` whether the candy contains fruit (coded TRUE or FALSE)
- `nutty` whether the candy contains peanuts, peanut butter, or almonds (coded TRUE or FALSE)
- `hard` whether the candy is a hard candy (coded TRUE or FALSE)
- `bar` whether the candy is a bar (coded TRUE or FALSE)
- `sugarpercent` The amount of sugar in the candy (measured as a percentile in comparison to all the other candies)
- `pricepercent` The price of the candy (measured as a percentile in comparison to all the other candies)

### Explore Variation

**Question 1** 

Use R to look at the first six lines of this data frame. What are the cases in this data frame?

**Question 2** 

Make a visualization to look at the variation in `winpercent` (what percentage of time a particular candy won when matched up against other candies in a head-to-head contest). Describe the shape, center, and spread of this distribution.

**Question 3** 

Which candy had the highest `winpercent` (i.e., what is the "best" candy according to this data)? 

### The Chocolate Model of Variation

If we wanted to buy some treats for an upcoming event or holiday, it would be helpful to know what kind of candy people prefer. Let's test the hypothesis that candy with chocolate is preferred over candy without chocolate with the `candy_rankings` data set. 

**Question 4** 

Make a visualization to explore this idea and place the best fitting model onto it.

**Question 5** 

Below we have found the best fitting values for the chocolate model and depicted the model in the visualization. Which of these correctly represents this model in GLM notation?

- $chocolate_i = 18.78 + 42.14(winpercent_i) + e_i$

- $winpercent_i = 18.78 + 42.14(chocolate_i) + e_i$

- $winpercent_i = 42.14 + 18.78(chocolate_i) + e_i$

- $chocolate_i = 18.78 + 42.14(winpercent_i) + e_i$

**Question 6**  

Should the chocolate variable be included in the model of the DGP? (In other words, can we reject the idea that $\beta_1 = 0$ and there is no effect of chocolate in the DGP?) Why or why not?

### The Sugar Model of Variation

Perhaps it's really just a matter of how much sugar is in the candy, and those with more sugar will be favored more. Let's put it to the test.

Here is a visualization with the best fitting model on it.

**Question 7**

Find the parameter estimates of the best fitting model predicting `winpercent` with `sugarpercent`. Write the model in GLM notation.

**Question 8**

Interpret the $b_1$ -- what does this number mean in the context of `winpercent` and `sugarpercent`?

**Question 9** 

What would be the predicted `winpercent` for a candy that had 0% sugar?

## Evaluating Single Predictor Models (the `chocolate` model and the `sugarpercent` model)

**Question 10** 

Which of these two models (the chocolate model or the sugar percent model) explains
more variation in `winpercent`? Explain how you decided this.

## Even More Complex Models

To explain more of the variation in `winpercent`, let's create a multivariate model that includes both of these explanatory variables in a single model: `chocolate` and `sugarpercent`. 

We have created a visualization and put the best fitting multivariate model on it for you.

**Question 11**

Find the parameter estimates for the best fitting model and fill in the GLM equation partially written for you below.

Fill in the following with the appropriate parameter estimates.

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$

**Question 12**

Interpret the three parameter estimates $b_0$, $b_1$, and $b_2$.

**Question 13** 

Based on the model, what would you predict the `winpercent` to be for a candy that contains chocolate?

- 38.26 + 18.27

- 38.26 + 18.27 + .086

- 38.26 + 18.27 + .086*sugarpercent

- We can’t tell from the output 

**Question 14** 

Compare the PRE for the three models: the chocolate one-predictor model, the sugarpercent one-predictor model, and the multivariate model. Which model reduces the most error? 

**Question 15**

Why are the PREs for the individual variables different in the ANOVA table for the one-predictor model versus the multivariate model? For example, the PRE for chocolate is .4052 in the one-predictor model and .4005 in the multivariate model.

**Question 16**

According to the ANOVA table and assuming that we will set an alpha ($\alpha$) of .05, which model or models of the DGP should we reject? Put an (X) in front of the models that we should reject.

- ( ) $winpercent_i = \beta_0 + \epsilon_i$
- ( ) $winpercent_i = \beta_0 + \beta_1chocolate_i + \epsilon_i$
- ( ) $winpercent_i = \beta_0 + \beta_2sugar_i  + \epsilon_i$
- ( ) $winpercent_i = \beta_0 + \beta_1chocolate_i + \beta_2sugar_i + \epsilon_i$


**Question 17**

Based on all this analysis, which model is the best model of the DGP (the one that reduces as much error as possible as simply as possible)? Estimate the 95% confidence interval for that model.

**Question 18**

Interpret the 95% confidence intervals for the parameters in the model you selected in the previous question. 