<div class="alert alert-block alert-danger">

# 9E: Intro to Polynomial and Log Models - COMPLETE

**Note**: This lesson assumes the class did 9D. This is an extra notebook that you can use to introduce polynomial and log models (not covered in the textbook), if you would like to. 
    

</div>

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

csvlink<-"https://docs.google.com/spreadsheets/d/e/2PACX-1vRVo82UG25_Gq0zYVRXUZX5U4vSVcpdFZ_Pn-U-81kACXbDEE5JJx6D41cTxbYAci0HhOJDGpfHe_5H/pub?gid=1875103213&single=true&output=csv"

HappyPlanetIndexData <- read.csv(csvlink, header = TRUE) 

# Data cleaning and making two new categorical variables
HappyPlanetIndexData <- na.omit(HappyPlanetIndexData) %>%
  mutate(GDP2 = factor(ntile(GDPperCapita, 2), levels = c(1,2), labels = c("low GDP", "high GDP")),
         GDP3 = factor(ntile(GDPperCapita, 3), levels = c(1,2,3), labels = c("low GDP", "med GDP", "high GDP"))) %>%
  filter(GDPperCapita > 0)

# Gets rid of scientific notation
options(scipen = 999)

We'll be looking at a data set (collected in 2021) called the `HappyPlanetIndexData`. It has 147 countries and the following variables:

- `Country`	Name of country
- `Region` Comment: 1 - Latin America 2 - N America & Oceania 3-  Western Europe 4 - Middle East & N Africa 5 - Sub-Saharan Africa 6 - South Asia 7 - Eastern Europe & Central Asia 8 - East Asia
- `Happiness` Score on a 0-10 scale for average level of happiness (10 being happiest)
- `LifeExpectancy` Average life expectancy (in years)
- `Footprint` Ecological footprint - a measure of the (per capita) ecological impact
- `HPI`	Happy Planet Index (0-100 scale) combines well-being and ecological impact
- `HPIRank`	HPI rank for the country
- `GDPperCapita` Gross Domestic Product (per capita)
- `Population` Population (in millions)
- `GDP2` Category based on GDP (low GDP or high GDP)
- `GDP3` Category based on GDP (low GDP, med GDP, or high GDP)

Last time we explored this data we looked at how different countries vary in their life expectancy by GDP group (either using `GDP2` or `GDP3`). We found that three groups based on GDP helps us efficiently predict life expectancy better than two groups (and this holds up even 4, 5, or 6 groups!). 

The three group variable (`GDP3`) made a model that was even better than a linear model using `GDPperCapita`. That model was also pretty efficient but just not as good as the three group model.

But many of us noticed that the problem was that `LifeExpectancy` and `GDPperCapita` just didn't have a relationship that could be modeled with a **straight** line. Could a model be based on a linear function that bends? That's what we will explore today.

## 1.0: Describe the Relationship Between Life Expectancy and GDP per Capita

**1.1:** Take a close look at the scatterplot depicting the relationship between `LifeExpectancy` and `GDPperCapita`. Describe the shape of the linear model that you would want to create.

In [None]:
gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData)

<div class="alert alert-block alert-warning">

**Sample Response**

- Students may describe the shape as if they are drawing it "going straight up and then making a sharp right turn".
- Students may describe the relationship between the variables "as GDP increases, life expectancy goes up a LOT at first but after a certain point, even if GDP keeps going up, life expectancy doesn't really go up as fast."
- You may also want to look out for gestures and drawings.

</div>

**1.2:**  In your prior algebra classes, you may have learned about some functions that curve. What are some of those functions? Maybe we can try some of them today.

<div class="alert alert-block alert-warning">

**Sample Responses**

- Students are unlikely to say "polynomials" but they may give examples of polynomials such as "parabolas" ($y = x^2$) and the quadratic function ($y = ax^2 + bx + c$). Point out to students that these all have $x^\text{power}$ and these are called "polynomials".
- Some students may also cite these functions:
    - exponential
    - logrithmic (rare)
    - sine/cosine or trigonometric (rare)
    
</div>

**1.3:** In our class, we have used GLM notation rather than traditional algebra courses. We're going to explore the models listed below (two polynomial models and a logrithmic one) but let's take a minute to write polynomials and log functions in GLM notation.

In the GLM notation, we usually write the simplest term first. We also use $b_0$, $b_1$, $b_2$, etc rather than $a$, $b$, $c$'s. Try re-writing these equations in GLM notation. 

**1.3a:** $Y = aX^2 + bX + c$

**1.3b:** $Y = aX^3 + bX^2 + cX + d$

**1.3c:** $Y = a\log(X) + b$

Are there any benefits of writing using GLM notation?


<div class="alert alert-block alert-warning">

**Sample Response**

**1.3a:** $Y = b_0 + b_1X + b_2X^2$

**1.3b:** $Y = b_0 + b_1X + b_2X^2 + b_3X^3$

**1.3c:** $Y = b_0 + b_1\log(X)$
    
Student may not notice but statisticians like to be able to compare equations. Writing from simplest to more complex terms makes it easier to compare two equations and see what is different.

</div>

## 2.0 - Straight vs. Curved Models

**2.1:** Here are the predicted `LifeExpectancy` from the linear model pictured below. Take a moment to write the GLM notation for this best-fitting linear function.

In [None]:
# this code will depict the predictions of the model
linear_model <- lm(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData)
gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) %>%
  gf_point(predict(linear_model) ~ GDPperCapita, color = "magenta")

# what code would help you write the function in GLM notation?



In [None]:
# COMPLETE
linear_model

<div class="alert alert-block alert-warning">

**Sample Responses**

- $LifeExp_i = 65.9081478 + 0.0002642GDPperCapita_i$
- $Y_i =   65.9081478 + 0.0002642X_i$
- Note for students that when we write *just* the function, or *just* the model, we don't need the error term. Also we aren't going to round at all in this notebook mostly to impress upon students that it's a pain in the butt to write these numbers. That's why people actually end up preferring to use the math notation $b_0$ and $b_1$!

</div>

**2.2:** The code below outputs the ANOVA table. How much error is explained by the model? How many degrees of freedom beyond the empty model is used by this model?

In [None]:
supernova(linear_model)

<div class="alert alert-block alert-warning">

**Sample Responses**

- PRE is 57.7%, the F is 192.672
- 1 additional df compared to the empty model

</div>

## 3.0 - Polynomial Model (start with degree 2)

In a polynomial, we're going to have the same 2 variables (the outcome and the explanatory) but we can do things like square or cube the explanatory variable! Like this:

$Y_i = b_0 + b_1X_i + b_2X_i^2$ <- here is a polynomial model with $X_i$ squared! This is a polynomial of degree 2.

$Y_i = b_0 + b_1X_i + b_2X_i^2 + b_3X_i^3$ <- here is a polynomial model with $X_i$ cubed! This is a polynomial of degree 3.

The `lm()` function is very handy because it can also handled these polynomial functions. We use the function `poly()` and let R know how many degrees you want. We'd put in 2 if we'd like to include up to $X_i^2$ and we would say 3 if we would like up to $X_i^3$.

(Note: For now also include the argument `raw=TRUE` because that will help you interpret the coefficients in a manner that builds upon your previous algebra courses. For other statistical reasons, statisticians prefer to use something called *orthogonal* coefficients but let's leave that aside for the moment.)

In [None]:
# how to create the best-fitting polynomial of degree 2 model
poly2_model <- lm(LifeExpectancy ~ poly(GDPperCapita, 2, raw=TRUE), data = HappyPlanetIndexData)

# how to graph the predictions of this model
gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) %>%
  gf_point(predict(poly2_model) ~ GDPperCapita, color = "magenta")

**2.1:** Take a moment to write the GLM notation for this best-fitting polynomial of degree 2 function.

In [None]:
# COMPLETE
poly2_model

<div class="alert alert-block alert-warning">

**Sample Responses**

- $LifeExp_i = 62.578136016270 + 0.000590101776perCapita_i + -0.000000003973GDPperCapita_i^2$
- $Y_i = 62.578136016270 + 0.000590101776X_i + -0.000000003973X_i^2$
- Hopefully students start to see that polynomials (quadratic equations in this case) in the real world don't just have nice whole numbers that are easily factorable.

</div>

**2.2:** Visually, how does this model compare to the linear model? What are things you like and don't like about it?

<div class="alert alert-block alert-warning">

**Sample Responses**

- PRO: The curve seems good especially for countries with GDP per capita less than 90,000. 
- CON: After 90,000, the model curves back down... that doesn't seem right that very rich countries would have lower life expectancies.    
- Point out to students that this is a "parabola" ($y = x^2$). 
    
</div>

**2.3:** Connecting to algebra:

- What does $b_0$ mean? Is it still the y-intercept (the predicted y when x is 0)? 
- Which coefficient makes this parabola this "upside-down u-shape"? 


<div class="alert alert-block alert-warning">

**Sample Responses**

- Yes! $b_0$ still means the same thing. Polynomials also have a y-intercept.
- That the $b_2$ (the coefficient in front of the $X^2$ term) is negative.
    
</div>

**2.4:** Just like the linear model, we can use `supernova()` to find the PRE and F for this model. How does the PRE and F compare to the linear model? How many additional degrees of freedom does it use compared to the empty model?

In [None]:
# COMPLETE
supernova(linear_model)

supernova(poly2_model)

<div class="alert alert-block alert-warning">

**Sample Responses**

- PRE is better for the polynomial degree 2 model (.73) than the linear model (.57). That means more of the error has been reduced by this model. 
- But in adding the $X_i^2$ term, we actually spent another degree of freedom in order to estimate $b_2$. Notice that the df Model is now 2 (instead of 1).
    
</div>

## 4.0 - Other Polynomial Models

Now you try. We'll add the code we used for the degree 2 polynomial model below. Try to see if having 3 or 4 terms (or even more) will be a lot better!

In [None]:
# modify this for 3 or 4 terms
poly_model <- lm(LifeExpectancy ~ poly(GDPperCapita, 2, raw=TRUE), data = HappyPlanetIndexData)

gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) %>%
  # change this for your model
  gf_point(predict(poly2_model) ~ GDPperCapita, color = "magenta")

In [None]:
# COMPLETE VERSION
# for polynomial, degree 3
poly3_model <- lm(LifeExpectancy ~ poly(GDPperCapita, 3, raw=TRUE), data = HappyPlanetIndexData)

gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) %>%
gf_point(predict(poly3_model) ~ GDPperCapita, color = "orange2")

# for polynomial, degree 4
poly4_model <- lm(LifeExpectancy ~ poly(GDPperCapita, 4, raw=TRUE), data = HappyPlanetIndexData)

gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) %>%
gf_point(predict(poly4_model) ~ GDPperCapita, color = "orange3")


**4.1:** Isn't it a real pain to write all those numbers? Write the GLM notation using $b$s for your model instead. So elegant!

<div class="alert alert-block alert-warning">

**Sample Responses**

- $Y_i = b_0 + b_1X_i + b_2X_i^2 + b_3X_i^3$
- $Y_i = b_0 + b_1X_i + b_2X_i^2 + b_3X_i^3 + b_4X_i^4 $
</div>

**4.2:** Is a degree 3 or degree 4 polynomial better? Which explains the most variation? Which uses the most degrees of freedom?

In [None]:
# COMPLETE VERSION

supernova(poly3_model)

supernova(poly4_model)

<div class="alert alert-block alert-warning">

**Sample Response**

- The poly3 and poly4 models have the same PRE. However, the poly4 model has a lower F and uses one more degree of freedom than the poly3 model. 
- The PRE is the same because the predictions from poly4 just aren't that different from poly3.

</div>

## 5.0 - Log Models

The polynomial models are flexible because we can make a variety of different types of curves to fit some data. But each curve spends a degree of freedom. This particular relationship between life expectancy and GDP per capita is **the inverse** of an exponential pattern (the rate of increase gets faster, the slope gets steeper, as x gets larger). 

**5.1:** Given the definition of the exponential pattern, try to say the inverse of it.

<div class="alert alert-block alert-warning">

**Sample Response**

- The rate of increase gets slower, the slope is less steep, as x gets larger.

</div>

### Logrithms versus Exponential Functions

The inverse of an exponential function is the logrithmic function. Just like division is the inverse of multiplication and subtraction is the inverse of addition, logs are the inverse of exponents.

Here is a log function written in GLM notation.

$Y_i = b_0 + b_1\log(X_i)$

When we write "log" we mean natural log (sometimes noted as `ln`). In statistics, when we use the log function, we almost always mean natural log. This is also the case in R:
- the natural log function, $Y = ln(X)$, in is `log(X)` 
- the exponential function, $Y = e^X$, is `exp(X)`

**5.2:** Below we have written the code for finding the best-fitting log model. Take a look at the code from earlier to figure out how to put the predictions of the log model onto the graph.


In [None]:
log_model <- lm(LifeExpectancy ~ log(GDPperCapita), data = HappyPlanetIndexData)

# put the predictions of the log model onto the graph
# gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) 



In [None]:
# COMPLETE

log_model <- lm(LifeExpectancy ~ log(GDPperCapita), data = HappyPlanetIndexData)

# put the predictions of the log model onto the graph
gf_point(LifeExpectancy ~ GDPperCapita, data = HappyPlanetIndexData) %>%
  gf_point(predict(log_model) ~ GDPperCapita, color = "magenta") 


**5.3:** Visually, what do you think about this model? 
- Will the PRE or F be better than some of the previous models? 
- Based on the GLM notation for this model, how many more degrees of freedom does it use compared to the empty model?


<div class="alert alert-block alert-warning">

**Sample Response**

- Students often really like this model and have high hopes for the PRE.
- If they look at the GLM notation (many forget to do so, $Y_i = b_0 + b_1\log(X_i)$), they will see that it uses just one more parameter ($b_1$) than the empty model! If they notice this, they may also predict the F will be high.
</div>

**5.4:** Use the ANOVA table to check out your predictions about PRE, F, and the degrees of freedom.

In [None]:
# COMPLETE
supernova(log_model)

# if they want to compare to other models
supernova(poly4_model)
supernova(poly3_model)
supernova(linear_model)

**5.5:** What is so great about the log model? Is it even better than the reigning champion: the 3-group model (using `GDP3`)?

In [None]:
# COMPLETE
GDP3_model <- lm(LifeExpectancy ~ GDP3, data = HappyPlanetIndexData)
supernova(GDP3_model)

<div class="alert alert-block alert-warning">

**Sample Response**

- The highest PRE compared to any other model. Highest F compared to other models.
- Only uses 1 degree of freedom more than empty model. (It's just as simple as the linear model but does so much better.)
    
Log functions! Aren't they great!?!?!
</div>