<div class="alert alert-block alert-danger">

# 13B: Happy College Students (COMPLETE)

**Use with textbook version 6.0+**


**Lesson assumes students have read up through page: 13.9**


</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this lesson, students will explore a dataset about estimating happiness in colleges from three different countries based on geotagged Instagram photos. The focus is to examine whether the sample PRE of the multivariate model could have been generated by the empty model (you may also choose to swap out F for PRE). The goal is to connect the sampling distribution created by the empty model to p-value.

#### Includes:

- Interpreting the overall PRE, F, and p-value (the Model row of the ANOVA table)
- Connecting the empty model to the null hypothesis (the random model of the DGP)

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 60-75 Mins

</div>

In [None]:
# this loads the packages and functions used in coursekata 
suppressPackageStartupMessages({
    library(coursekata)
})

# loads data & creates factor variables
univ <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vToEmFPyz81gyEvNWm7V8J7Iq9cqhfZ_Dg7XB7siW1M1v7GaQjBrgxYTbRsNbILytQ2ySeHCzVQ7NlQ/pub?output=csv")
happy_univ <- na.omit(select(univ, university_name, code, percent_smiles, num_students, student_staff_ratio, female_male_ratio, world_rank_times, world_rank_cwur, national_rank, quality_of_education))

# removes scientific notation from this notebook
options(scipen = 999)

<img src="https://news.calstatela.edu/wp-content/uploads/2021/07/Streamers2.jpg" alt="thrilled college students at graduation" width = 80%>

## Good Times in College

College can be life changing in many ways and for many people, it's the site of some happy memories. For example, for [one in 7 married Americans](https://www.prnewswire.com/news-releases/only-1-in-3-us-marriage-proposals-are-a-surprise-engagement-ring-spend-rises-according-to-the-knot-2017-jewelry--engagement-study-300552669.html), college is the place where they met their future spouse.

Data analysts at [Resume.io](Resume.io) used Instagram and Amazon Rekognition (an AI tool that measures levels of emotion in human faces) to estimate the happiness levels. Thousands of Instagram photos geotagged at different universities were put through the AI tool. The data set called `happy_univ` contains the proportion of faces that register as "happy" for 126 universities in three different countries (United States, Great Britain, and Australia). 

You can read more about Resume.io's methodology [here](https://resume.io/blog/the-happiest-schools-in-the-us-uk-and-australia) and special thanks to [Dr. Jess Hartnett](https://twitter.com/Notawful) for finding this data set.

## College rankings

Information about happiness from IG photos were joined together with other information available about colleges such as the [Times Higher Education world rankings](https://www.timeshighereducation.com/world-university-rankings/2016/world-ranking/detailed) and the [Center for World University Rankings (CWUR) rankings](https://cwur.org/2015.php).



<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  20-25 mins

</div>

## 1.0: Explore Variation



In [None]:
head(happy_univ)

**1.1:** Which variables do you want to know more about?

### About `happy_univ`

The `happy_univ` data frame has data from 126 universities:

- `university_name`  name of university
- `code` the country the university is in represented by the [ISO code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#:~:text=ISO%203166%2D1%20alpha%2D3%20codes%20are%20three%2Dletter,special%20areas%20of%20geographical%20interest.) (USA, GBR, and AUS) 
- `percent_smiles` the percentage of IG photos geotagged at the university that contained a "happy" face according to [Amazon Rekognition](https://aws.amazon.com/rekognition/image-features/?trk=a7f57dee-fc58-4084-9037-cb552d58a5d5&sc_channel=ps&s_kwcid=AL!4422!3!637214078536!!!g!!&ef_id=Cj0KCQiA2-2eBhClARIsAGLQ2Rm9FFHZ3SZWDNTC4EAnVW-3hR1kLU6lXBjQFV8zJ7WsAPOtUGTQ3Q0aAmbWEALw_wcB:G:s)
- `num_students` how many fully enrolled students (from Times Higher Ed, 2016)
- `student_staff_ratio` the ratio of students to staff (from Times Higher Ed, 2016)
- `female_male_ratio` the ratio of female to male students (from Times Higher Ed, 2016)
- `world_rank_times` international rank according (from Times Higher Ed, 2016), lower numbers indicate higher rankings (e.g., 1 is a better rank than 10); beyond 200 the rankings are categorical (e.g., there may be several schools ranked 300)
- `world_rank_cwur` international rank (from [Center for World University Rankings, 2015](https://cwur.org/2015.php)), lower numbers indicate higher rankings (e.g., 1 is a better rank than 10)
- `national_rank` rank compared to other schools in the same country (from [Center for World University Rankings, 2015](https://cwur.org/2015.php)), lower numbers indicate higher rankings (e.g., 1 is a better rank than 10)
- `quality_of_education` based on the academic success of a university’s alumni, and measured by the number of a university's alumni who have won prestigious academic distinctions relative to the university's size (from [Center for World University Rankings, 2015](https://cwur.org/2015.php))


**1.2:** Write a few word equations to try and predict variation in `percent_smiles` and explore with visualizations.

**1.3:** Take a look at the visualization below. What's the corresponding word equation to go with it?

In [None]:
gf_point(percent_smiles ~ world_rank_times, data = happy_univ, color = ~code) %>%
  gf_facet_wrap(~ code)

<div class="alert alert-block alert-warning">

**Sample Response:**

percent_smiles = country + world_rank_times + other stuff

</div>


## 2.0: Model Variation

**2.1:** Create a multivariate model and place it onto the visualization below.

In [None]:
# complete
mult_model <- lm(percent_smiles ~ world_rank_times + code, data = happy_univ)

gf_point(percent_smiles ~ world_rank_times, data = happy_univ, color = ~code) %>%
  gf_facet_wrap(~ code) %>%
  gf_model(mult_model)

**2.2:** Interpret the parameter estimates.

In [None]:
mult_model

<div class="alert alert-block alert-warning">

**Sample Response:**

- $b_0$ = 0.48, this is the percent of predicted smiles for the country AUS when world_rank_times is zero.
- $b_1$ = 0.00003172, this is the adjustment we make to percent_smiles for every 1 unit increase in world_rank_times.
- $b_2$ = -0.05, this is the amount we adjust the intercept for the country GBR.
- $b_3$ = 0.00067399, this is the amount we adjust the intercept for the country USA.

</div>


## 3.0: Evaluate Models

**3.1, Discussion:** What are some ways we might evaluate this model in comparison to the empty model in the DGP?

<div class="alert alert-block alert-warning">

**Sample Responses:**

*Students might say some combination or variation of the following:*

- Check if we have reduced a lot of error compared to the total error of the empty model (e.g., check PRE or F; compare SS Total, SS Model, SS Error)
- Compare how "correct" the predictions of each model are
- Check the p-value on the first line of ANOVA table
- Check how "big" the $b$ estimates are (if they predict a big change or not much change from the empty model)
- Use confidence intervals to see if 0 is within the range of plausible $\beta$s

To follow up, have students sort which of these compare how much variation the multivariate model versus the empty model explains **in the sample** versus in the **DGP**.

Explain variation in the **Sample**:
- Check if we have reduced a lot of error compared to the total error of the empty model (e.g., check PRE or F; compare SS Total, SS Model, SS Error)
- Compare how "correct" the predictions of each model are
- Check how "big" the $b$ estimates are (if they predict a big change or not much change from the empty model)

Explain variation in the **DGP**:
- See if PRE or F or b1 is unlikely to be generated by the empty model (e.g., with shuffle).
- Check the p-value on the first line of ANOVA table
- Use confidence intervals to see if 0 is within the range of plausible $\beta$s
</div>


**3.2:** This multivariate model *does* explain more variation in this sample than the empty model. How much more? Is that value "a lot"?

In [None]:
# sample responses
pre(percent_smiles ~ world_rank_times + code, data = happy_univ)
fVal(percent_smiles ~ world_rank_times + code, data = happy_univ)

supernova(mult_model)

<div class="alert alert-block alert-warning">

**Sample Response:**

The PRE is about 7.4 percent. This is a low-to-moderate amount of explained variation.

</div>


The problem is this: The empty model of the DGP (one where `percent_smiles` is unrelated to `world_rank_times` and `code`) **could** have created a PRE as large (or larger) than .0744. It's possible. But is it likely? 

**3.3:** Is our sample PRE one of the likely PREs to be generated by the empty model? Use simulations to figure this out.

In [None]:
# modify this code
pre(percent_smiles ~ world_rank_times + code, data = happy_univ)

pre(shuffle(percent_smiles) ~ world_rank_times + code, data = happy_univ)


<div class="alert alert-block alert-warning">

**Sample Responses:**

Here we created a sampling distribution of 1000 PREs generated by the empty model. Most of the PREs generated by the empty model are quite low (pretty close to 0). The "unlikely" PREs are the upper .05 of PREs. Note that the sample PRE (.07) is in that unlikely area.

</div>


In [None]:
#complete

sdopre <- do(1000) * pre(shuffle(percent_smiles) ~ world_rank_times + code, data = happy_univ)
sample_pre <- pre(percent_smiles ~ world_rank_times + code, data = happy_univ)

gf_histogram(~ pre, data = sdopre, fill = ~upper(pre, .05)) %>%
  gf_point(0 ~ sample_pre) + 
  # annotations for teaching 
  annotate("text", x = .1, y = 25, label = "Unlikely zone") + 
  annotate("text", x = .02, y = 25, label = "Likely zone") +
  annotate("text", x = .07, y = 100, label = "1000 PREs generated by empty model")

**3.4:** Approximately, what percentage of empty model generated PREs are larger than the sample PRE (.0744)? 

<div class="alert alert-block alert-warning">

**Sample Responses:**

Note that this value will be similar to the p-value for the whole model (.024). That is what we are depicting when we create a sampling distribution using the empty model!
</div>


In [None]:
# complete
tally(sample_pre > sdopre$pre, format = "proportion")

# we can connect the result from tally to the p-value in the ANOVA table
supernova(mult_model)

**3.5:** So what do we think -- is the multivariate model a better contender than the empty model? 

<div class="alert alert-block alert-warning">

**Sample Responses:**

Yes, based on this analysis, we'd reject the empty model -- it isn't easily able to produce the PRE of our sample.

</div>


**3.6:** Does this mean the two predictors (`world_rank_times` and country `code`) are both equally valuable predictors? Why or why not?

<div class="alert alert-block alert-warning">

**Sample Responses:**

No it doesn't! In fact, some students may start to look at the ANOVA table and start to suspect that the country is more predictive than rank. Students may also compare the multivariate model to a single predictor model.

These are good ideas to discuss because it leads to the next chapter where we will more directly compare multivariate models to single predictor models. 

</div>


**3.7, bonus:** Here are the ANOVA tables for two models:
- **percent smile = rank + country code + other stuff**
- **percent smile = country code + other stuff**

Which has the higher F? Why?

In [None]:
supernova(lm(percent_smiles ~ world_rank_times + code, data = happy_univ))
supernova(lm(percent_smiles ~ code, data = happy_univ))

<div class="alert alert-block alert-warning">

The nice thing about F is that it takes the degrees of freedom spent into account. The country code model is "cheaper" because we only used 2 degrees of freedom more than the empty model (the multivariate model spent 3 df more than the empty model). At the same time, the country code model explains almost as much PRE (around .07) as the multivariate model (just a little more, .074). Because F is like "bang for your buck" (or variation explained for degrees of freedom spent), the F is actually better for the simpler model.

</div>

## 4.0: Concluding Thoughts

**4.1:** Why might some countries have more "happy" photos than others? 

**4.2:** What are some issues or questions that might arise from using Instagram photos and AI emotion detection for this research? 

<div class="alert alert-block alert-warning">

Some issues that might arise:
- Are photos on IG representative of students' experiences? People might post happier pictures.
- Are college students even on IG? 
- These are geo-tagged photos... maybe they are just visiting the college and not even enrolled there!
- Will AI detect emotions accurately? What if it's just a smiling person in the background?
- Research has found that [people who are on social media more are unhappier](https://www.scientificamerican.com/article/why-social-media-makes-people-unhappy-and-simple-ways-to-fix-it/) -- could it be true that colleges that post a lot of happy photos have unhappy people who are on social media a lot?
</div>