# 8A: Life Expectancies Around the World

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# Load the data

csvlink<-"https://docs.google.com/spreadsheets/d/e/2PACX-1vRVo82UG25_Gq0zYVRXUZX5U4vSVcpdFZ_Pn-U-81kACXbDEE5JJx6D41cTxbYAci0HhOJDGpfHe_5H/pub?gid=1875103213&single=true&output=csv"

HappyPlanetIndexData <- read.csv(csvlink, header = TRUE) 

We'll be looking at a data set (collected in 2021) called the `HappyPlanetIndexData`. It has 147 countries and the following variables:

- `Country`	Name of country
- `Region` Comment: 1 - Latin America 2 - N America & Oceania 3-  Western Europe 4 - Middle East & N Africa 5 - Sub-Saharan Africa 6 - South Asia 7 - Eastern Europe & Central Asia 8 - East Asia
- `Happiness` Score on a 0-10 scale for average level of happiness (10 being happiest)
- `LifeExpectancy` Average life expectancy (in years)
- `Footprint` Ecological footprint - a measure of the (per capita) ecological impact
- `HPI`	Happy Planet Index (0-100 scale) combines well-being and ecological impact
- `HPIRank` HPI rank for the country
- `GDPperCapita` Gross Domestic Product (per capita)
- `Population` Population (in millions)

Last time we explored this data we looked at how different countries vary in their happiness levels. This time, let's use the information in this data to help us think about why countries vary in their life expectancies. 

## 1.0 - Explore Variation in Life Expectancy

1.1 - Explore variation in life expectancy with a visualization. What do you notice?

1.2 - One of the bits of information we have in this dataset is `GDPperCapita`. This is often used as a measure of "how rich" a country is. Using Google, try to find out: What is GDP? What is GDP per capita? What does this tell us? Why do you think it is used as a measure of wealth?

1.3 - Do you think `GDPperCapita` can help us explain variation in `LifeExpectancy`? 
What do you think the relationship will look like? What will richer countries be like?
Poorer countries? 

1.4 - Write this idea as a word equation.

1.5 - Predict what a visualization exploring this word equation will look like.
Then make it. What do you notice?

## 2.0 - Explaining Variation in Life Expectancy with GDP

2.1 - So far we've learned about group models (models that predict how one group might
have a different outcome than another group). Can we create a group model of this hypothesis?
Why or why not? 

2.2 - What will happen if we use `ntile()` to create groups? What kind of groups will it make?

2.3 - Let's take a look at the scatterplot of GDP and life expectancy below.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_LifeExpGDPScatter.png" title="scatterplot of life expectancy by GDP per capita" width = 75% align="center"/>

In the code below, we are going to color in the dots of this scatterplot with different colors. We've added this argument: `color = ~GDP2` to `gf_point()`.

Before you run the code, make a prediction about how the scatterplot will change. Which dots are going to have the same colors? How many colors will there be? 

In [None]:
HappyPlanetIndexData$GDP2 <- ntile(HappyPlanetIndexData$GDPperCapita, 2)
HappyPlanetIndexData$GDP2 <- factor(HappyPlanetIndexData$GDP2, levels = c(1,2), labels = c("low GDP", "high GDP"))
gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData, color = ~GDP2) 

2.4 - Why are the dots on the right colored differently from the dots on the left? Why does it seem like the blue-ish dots are kind of a skinny/narrow group and the purple dots are a wider group?

2.5 - If we made a jitter plot of `LifeExpectancy` grouped by `GDP2`, which of these would it (roughly) look like? Why? (We also included some boxplots around the dots.)

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_rVjZWg8m-7-C-Boxplots.png" title="three choices: first group lower than second group, groups equal, first group higher than second group" />

2.6 - What are some differences between the scatterplot and the jitter plot?

2.7 - If we shuffled `LifeExpectancy` so it randomly relates to `GDP2`, which of these would most of the shuffled plots (roughly) look like?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_rVjZWg8m-7-C-Boxplots.png" title="three choices: first group lower than second group, groups equal, first group higher than second group" />

2.8 - Based on exploring the data with visualizations, which model seems like it might be the 
better model of the Data Generating Process: 

- **LifeExpectancy = GDP2 + Error**
- **LifeExpectancy = MEAN + Error**

What makes you think that?

## 3.0 - Modeling Variation in Life Expectancy with GDP2

If we think the GDP2 model is better, the next question is *how much better*? 

3.1 - Find the best fitting GDP2 model (**LifeExpectancy = GDP2 + Error**). How much better is that
model compared to the empty model (**LifeExpectancy = MEAN + Error**)?

3.2 - Here's some code to chain on the predictions of this GDP2 model onto a jitter plot of the data. Write code to make predictions from the GDP2 model and save them as `GDP2_predict`.

In [None]:
#HappyPlanetIndexData$GDP2_predict <- 

#gf_jitter(LifeExpectancy ~ GDP2, data = HappyPlanetIndexData, color = ~GDP2) %>%
#gf_jitter(GDP2_predict ~ GDP2, color = "darkorange", height = 0)

3.3 - Where is $b_0$ and $b_1$ in this jitter plot?

## 4.0 - Will Three Groups be Better than Two?

We can quantify *how much better* the GDP2 model is over the empty model. But, for now, it is hard for us to evaluate whether that is a lot of "reduced error" or not. We can use values like PRE to compare different kinds of models!  

4.1 - We've created a GDP2 model. Do you think a GDP3 model would explain more error? 
Less error? Make a prediction: How much more/less error?

4.2 - Here we have copied the code used to make `GDP2`. Modify the code to make `GDP3`.

In [None]:
HappyPlanetIndexData$GDP3 <- ntile(HappyPlanetIndexData$GDPperCapita, 3)
HappyPlanetIndexData$GDP3 <- factor(HappyPlanetIndexData$GDP3, levels = c(1,2,3), labels = c("low GDP", "medium GDP", "high GDP"))

4.3 - Let's take a look at the original scatterplot again.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_LifeExpGDPScatter.png" title="scatterplot of life expectancy by GDP per capita" width = 75% align="center"/>

Where will `GDP3` split up the groups? Try coloring the dots by `GDP3` in the scatterplot below.

Will the three groups vary less on life expectancy compared to the two groups created in the GDP2 model?

In [None]:
gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) 

4.4 - What will a jitter plot of `LifeExpectancy` by `GDP3` look like? Describe the pattern. Then try making it with R.

4.5 - Based on your visualizations, do you think the GDP3 model will have more or less error than the GDP2 model? Will it have a bigger or smaller PRE? SS Model? SS Error? SS Total?

## 5.0 - Modeling Variation in Life Expectancy with GDP3

5.1 - Find the best fitting GDP3 model (**LifeExpectancy = GDP3 + Error**). How much better is that
model compared to the empty model (**LifeExpectancy = MEAN + Error**)?

5.2 - Is the best fitting GDP3 model better than the GDP2 model? What makes you think that?

5.3 - Put the predictions of the GDP3 model onto this jitter plot. 

In [None]:
gf_jitter(LifeExpectancy ~ GDP3, data = HappyPlanetIndexData) 

5.4 -  What does the model predict for a low GDP country? How about a mid GDP or high GDP country? 

5.5 - What does $b_0$ tell us? How about $b_1$ and $b_2$? What part of the graph corresponds to these numbers?

## 6.0 - A Ridiculous Model

Now let's make a complex model that has 145 groups (note, there are only 147 countries in this data set) so almost every country gets to be their own group.

In [None]:
HappyPlanetIndexData$GDP145 <- ntile(HappyPlanetIndexData$GDPperCapita, 145)
HappyPlanetIndexData$GDP145 <- as.factor(HappyPlanetIndexData$GDP145)

6.1 - Check out the jitter plot below. All the dots are a slightly different color. Why?

In [None]:
gf_jitter(LifeExpectancy ~ GDP145, data = HappyPlanetIndexData, color = ~GDP145, show.legend = FALSE)  +  
    scale_color_hue(l=50, c=75)

6.2 - Here we have created a model. How many more parameter estimates (the $b$s) will it have compared to the empty model? 

Also, try to find that number in the `supernova` table. Where does it appear?

In [None]:
GDP145_model <- lm(LifeExpectancy ~ GDP145, data = HappyPlanetIndexData)

6.3 - We have placed the predictions of the GDP145 model on the jitter plot below. Why is it that the rainbow dots are now all covered by dark orange dots?

In [None]:
HappyPlanetIndexData$GDP145_predict <- predict(GDP145_model)

gf_jitter(LifeExpectancy ~ GDP145, data = HappyPlanetIndexData, color = ~GDP145, show.legend = FALSE) %>%
    gf_jitter(GDP145_predict ~ GDP145, color = "darkorange") +  
    scale_color_hue(l=50, c=75)

## 7.0 - Comparing Models and the F Ratio

7.1 - Of all the models we looked at today (GDP2, GDP3, GDP145), which model is "better"? 
Why is it better? 

7.2 - Judging by PRE, which model is better? Judging by F, which model is better?

7.3 - Why does F penalize the GDP145 model?

7.4 - Some countries are missing from this data. If I told you about a relatively rich country
that is not in the data set, which model would be most useful in generating a prediction of their
life expectancy? Why?

7.5 - Are models with a high PRE always better? Why or why not?