<div class="alert alert-block alert-danger">

# 15B: Wasteful Countries (COMPLETE)

**Use with textbook version 6.0+**


**Lesson assumes students have read up through page: 15.8**


</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this notebook students will explore their own hypotheses on why some countries produce more waste than others. They will try to predict where the model lines should be on a visualization before fitting the model. Then they will fit, compare, and evaluate the additive vs the interaction model for their hypothesis and justify which model to retain. They should consider things like p, PRE, and F, as well as the interaction model row of the supernova() output.
 
#### Includes:

- Trying to predict the lines of a multivariate model
- Comparing the interaction model to the additive model to see which one is the better fit
- Evaluating PRE, F, and p-value for additive vs interaction models
- Interpreting the interaction row of an ANOVA table


</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 40-55 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# Adjust scientific notation
options(scipen = 10)

## About the Data

This `waste` dataset looks at the amount of waste (garbage/trash, and hazardous waste) produced by each country.

**Description of Variables:**

- `country` Name of the country
- `region` Regional category of the country
- `econ_dev` Level of economic development (Developed, Developing, In Transition)
- `gdp_usd` Gross Domestic Product (GDP) per Capita (USD)
- `total_msw_waste` Total Municipal Solid Waste (MSW) per Year (tons)
- `population` Population size
- `haz_waste` Hazardous Waste per Year (tons)
- `nat_agency` Whether the country has a national agency for enforcing solid waste laws
- `nat_law` Whether the country has a national law governing solid waste management
- `msw_capita` MSW per Capita (tons per person)
- `haz_waste_capita` Hazardous Waste per Capita (tons/people)
- `percent_urban` Percentage of the country that has been urbanized
- `urbanization_rate` The rate of urbanization (Growing Rapidly, Growing, Decreasing)
- `HDI` Human Development Index
- `adult_obesity` Adult Obesity Rate by BMI

Data Source: [Bootstrap World](https://bootstrapworld.org/materials/spring2023/en-us/lessons/choosing-your-dataset/pages/datasets-and-starter-files.html)

<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  5-10 mins

</div>

## 1.0 - Explore Variation

Before you begin, take a look at the variables in the data frame. If you have any questions about what any of the variables mean, try Googling about them, or ask your instructor for clarification.

In [None]:
waste <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTMbRKfQDrLmn5ebPaNeqVykHd1ZlkzygXEjKPICJcjd6tv_Zy75O6cby_-6WmkrNGeTt8UflJpoa0Z/pub?gid=0&single=true&output=csv")
head(waste)

1.1 - Every country produces tons (literally) of waste each year. This waste can be categorized as Municipal Solid Waste, or MSW (e.g., typical garbage that goes to the landfill), or hazardous waste (e.g., ignitable, reactive, corrosive, or toxic waste that poses health or environmental threats). For now, let's explore MSW. Take a look at the distribution of MSW per capita in a visualization, and describe the distribution. How much waste does a person per country tend to produce per year?


In [None]:
# Sample Response
gf_histogram(~msw_capita, data = waste, fill = "aquamarine4") %>%
    gf_boxplot(width = 3, fill = "white")

favstats(~msw_capita, data = waste)

<div class="alert alert-block alert-warning">

**Sample Response**

The average person (per country) tends to produce about 0.37 tons of MSW per year.

The distribution is skewed a bit to the right. Most countries produce less than half a ton per person per year, but there are a few outlier countries that produce more than 1 ton per person per year.

</div>

1.2 - DISCUSS: Why do you think there are some countries that produce more (or less) waste per person than other countries? Come up with some theories.

<div class="alert alert-block alert-warning">

**Sample Response**

Student responses will vary. They should stick with using the variables in the data frame, but can also come up with ideas that go beyond this data frame as well (but you should steer them to see that they just won't be able to analyze them unless they seek out additional data).

</div>

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time: 25-30 mins

</div>

## 2.0 - Explain Variation in Waste

2.1 - Pick one of your theories to try to explain variation in `msw_capita`, and write it as a word equation.

<div class="alert alert-block alert-warning">

**Sample Response**

*One example of a model students might explore:*

msw_capita = percent_urban + econ_dev + other stuff

</div>

2.2 - DISCUSS: Before fitting or overlaying any models, what do you notice? Describe any trends you are seeing in the visualizations.

In [None]:
# Sample Response
gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste)

<div class="alert alert-block alert-warning">

**Sample Response**


As percent_urban goes up, msw_capita also appears to go up. A lot of the Developing group appears to be clumped together lower on percent_urban and msw_capita, but is also mixed in a lot with the In Transition group. The In Transition group is a little higher on percent_urban and msw_capita, while the Developed group seems to be clumped highest on percent_urban and msw_capita, but also has a few really high outlier values on both percent_urban and msw_capita.

</div>

2.3 - Before fitting anything, try to fit your own model onto the graph. If you had to draw your own line of best fit for each group, where would you draw them?

<div class="alert alert-block alert-warning">

**Note to Instructors**

This is meant to give students the chance to think about why multivariate models get fit the way they do, and whether they would look at the data and apply similar slopes/intercepts to all of the groups or not, and, hence, why an additive or an interaction model may be a better fit of the data.

*Note: One option might be to have them fit the lines using R. An example of the code is found below:*

`gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste) %>%`
    `gf_abline(intercept = 0, slope = 0)`

*This option may take a bit more time than drawing on screen, paper, or the whiteboard.*

**Sample Approaches:** 

Students will likely take a wide variety of approaches. One potential approach is described below. 

For the model: msw_capita = percent_urban + econ_dev + other stuff

For this graph, students may draw the lines for the Developing and In Transition groups very close to each other. They may draw a similar slope for the Developed group, but with a slightly higher y-intercept.  

</div>

In [None]:
# Sample code if you want to use R to have them draw lines

gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste) %>%
    gf_abline(intercept = 0, slope = 0)

    
# We used the additive and interaction model estimates, however,
# We would not have them fit the model first to get these parameters
# Instead they should try trial and error adjusting intercept/slope
# Until they get lines they think look right (they don't have to be this exact)

# Additive Model
gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste) %>%
    gf_abline(intercept = (0.366281 - 0.29841), slope = 0.002593, color = "purple") %>%
    gf_abline(intercept = 0.366281, slope = 0.002593, color = "turquoise") %>%
    gf_abline(intercept = (0.366281 - 0.245697), slope = 0.002593, color = "brown")
 
lm(msw_capita ~ econ_dev + percent_urban, data = waste)

# Interaction Model
gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste) %>%
    gf_abline(intercept = (0.3026967 - 0.1327807) , slope = (0.0033943 - 0.0034708), color = "purple") %>%
    gf_abline(intercept = 0.3026967, slope = 0.0033943, color = "turquoise") %>%
    gf_abline(intercept = (0.3026967 - 0.1715497), slope = (0.0033943 - 0.0009871), color = "brown")
 
lm(msw_capita ~ econ_dev * percent_urban, data = waste)

2.4 - Now, go ahead and fit your model as an additive model as well as an interaction model, and create visualizations to depict each model. How do the lines of each model compare to the lines you predicted?

In [None]:
# Sample Additive Model
add_model <- lm(msw_capita ~ econ_dev + percent_urban, data = waste)
add_model
gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste) %>%
    gf_model(add_model)

# Sample Interaction Model
int_model <- lm(msw_capita ~ econ_dev * percent_urban, data = waste)
int_model
gf_point(msw_capita ~ percent_urban, color = ~econ_dev, data = waste) %>%
    gf_model(int_model)

2.5 - Do the lines of the two models show very different patterns? What does this suggest?

<div class="alert alert-block alert-warning">

**Sample Response**

If the lines show a similar pattern to the empty model then this suggests they are not much different from the empty model. If the model lines for an interaction model are not much different from the additive model, this suggests that the additive model may be the better model. However, the more the models diverge (from the empty model or from each other) the more variation they are likely to explain.
 
</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  10-15 mins

</div>

## 3.0 - Compare and Evaluate the Additive vs Interaction Models

3.1 - Evaluate your models and provide a rationalization for which model to retain: the empty model, the additive model, or the interaction model. Use statistics to back up your answer.

In [None]:
# Sample Additive Model
add_model <- lm(msw_capita ~ econ_dev + percent_urban, data = waste)
supernova(add_model)

# Sample Interaction Model
int_model <- lm(msw_capita ~ econ_dev * percent_urban, data = waste)
supernova(int_model)

<div class="alert alert-block alert-warning">

**Sample Response**

**msw_capita = percent_urban + econ_dev + other stuff**

Both models have a p-value less than .05, thus the empty model has a low probability of being the true model of the DGP, or, in other words, we are not likely to have gotten such a large effect of the parameter estimates if there were not a true relationship among the variables in the DGP.

But, in this case, the additive model is likely better than the interaction model because, although the interaction model is significant overall, it is not contributing more to the explained variation than the additive model according to the interaction row (F=0.76, PRE=.01, p=.47). Just comparing the additive model to the overall interaction model, the F is much smaller for the interaction model (40.66 compared to 67.47) even though the PREs are very close (.55 compared to .54), thus, we are not getting as much explained variation per degree of freedom in the interaction model over the additive model.
 
***Additive Model***
- F = 67.46
- PRE = .54
- p-value < .05

***Interaction Model***
- F = 40.66
- PRE = .55
- p-value < .05

> ***Interaction Row***
- F = 0.76
- PRE < .01
- p-value = .47
 
</div>

3.2 - Summarize your findings. What has your analysis today revealed about waste in the world, and what might be your suggestions for how countries can reduce their waste, or some future analyses you would like to do?

<div class="alert alert-block alert-warning">

**Sample Response**

According to this analysis, waste in a country tends to increase as the percentage of the country that has been urbanized increases, and it also tends to be higher in countries that are economically developed. While both of these factors can help us explain variation in waste, economic development (with a higher PRE) seems to be explaining more of the variation than urbanization.

I would recommend market policies and incentives for more reusable products.

Based on this, another theory I would like to explore is whether population would be a stronger predictor in the model than percent_urban because there can be non-urbanized areas that still have a lot of people, and more people probably make more waste.

 
</div>