# Extending From Two-Group to Three-Group Models

## Chapter 8.1-8.3 Overview Notebook

In [None]:
# teacher: please enter a value for class_id
class_id <- "arthur" # put any unique id here (e.g., teacher email address)
user_id <- Sys.getenv("JUPYTERHUB_USER") # no need to do anything here; this gets the users Jupyter ID

# run this to set up the notebook
library(coursekata)
library(gridExtra)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# generate pet data
set.seed(1)
n <- 90
pet_type <- rep(c("Dog+Cat", "Dog", "Cat"), each = n / 3)
happiness <- round(c(
  rnorm(n/3, mean = 5, sd = 1.8),  # Dog+Cat
  rnorm(n/3, mean = 5.5, sd = 1.8),  # Dog
  rnorm(n/3, mean = 4.5, sd = 1.8)   # Cat
), 1)
pet_owners <- sample(data.frame(pet_type, happiness)) %>%
  select(-orig.id)
# write.csv(df, "pet_owners.csv", row.names = FALSE)


## 1 Happiness of Pet Owners

There are, of course, cat people and dog people. But which people are happier? And what about people with both? Believe it or not, there is a lot of research on this. But today let's just look at a small sample of data from a survey of pet owners.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/cats-dogs-1.png">

### 1.1 The survey questions

The participating pet owners answered a number of questions, two of which were:

1. Do you own a pet? If so, what kind? Responses were coded as `Cat`, `Dog`, `Dog+Cat`, or `Neither`.
2. On a scale of 1 to 10, how happy are you these days? 1 = extremely unhappy; 10 = extremely happy.

Run the cell below to see what the survey looked like and try answering it.

In [None]:
IRdisplay::display_html(sprintf(
  '<iframe src="https://uclatall.github.io/jupyter-survey/cats-n-dogs-v3.html?class_id=%s&user_id=%s" width="100%%" height="520" 
    sandbox="allow-scripts allow-same-origin" 
    style="border: none; box-shadow: none; background: white; "></iframe>',
  class_id, user_id
))

### 1.2 Write some code to see what's in the `pet_owners` dataset and to explore the distributions of the variables. 

Note: A few participants reported having neither a cat nor a dog. Because that group was so small, we’ve excluded them from this dataset to keep things simple.

The `pet_owners` dataset has two variable:

- `pet_type`: `Cat`, `Dog`, `Dog+Cat`
- `happiness`: a 1-10 scale with 1=Extremely Happy, 10=Extremely Unhappy

In [None]:
# code here



<div class="guided-notes">

### 1.3 Who do you think would be happier on average? A cat owner, a dog owner, or someone who owns both dogs and cats? Explain your hypothesis. Also write a word equation that says: if we knew a person's `pet_type`, we would be able to make a better prediction of their `happiness`. 
    
</div>


### 1.4 Write code to visualize the relationship between `pet_type` and `happiness` 

In [None]:
# code here




<div class="discussion-question">

### 1.5 Discussion Question: What do you think? Can you tell who is happier on average?
    
</div>

## 2 Fitting and Interpreting Three-Group Model

Most of the concepts and R code you used for constructing and interpreting two-group models can be directly applied to three-group models. There are a few minor differences, but mostly, you can extend your understanding from two-group to three-group models without much trouble. Let's start with fitting the model.

<div class="guided-notes">

### 2.1 Write the code you think you would use to fit the three-group model of happiness based on `pet_type` (Cat, Dog, or Dog+Cat) in the second row of the table.
    
</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:22%"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:39%">Two-Group Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:39%">Three-Group Model</td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top;"><b>Example</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>guess ~ condition</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>happiness ~ pet_type</code></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>R to Fit Model</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>lm(guess ~ condition, data = anchor_data)</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"> </td>
    </tr>
  </tbody>  
</table>

### 2.2 Write code in the cell below to fit and save `pet_type_model`.  Overlay the model onto the jitter plot of `happiness` by `pet_type`

In [None]:
# fit and save the model here

# modify this code to overlay the model onto the plot
gf_jitter(happiness ~ pet_type, data = pet_owners, width = .2)



<div class="discussion-question">

### 2.3 Discussion Question: Now that you see the `pet_type` model's predictions, which group seems happier on average? Any other patterns you notice?

</div>

<div class="discussion-question">

### 2.4 Discussion Questions: How many unique predictions do you think the `pet_type_model` will make? What will those predictions be?   

</div>

### 2.5 Print out the parameter estimates for a three-group model

What are the parameter estimates for `pet_type_model`?

In [None]:
# code here



<div class="guided-notes">

### 2.6 The R output has three numbers labeled `(Intercept)`, `pet_typeDog`, and `pet_typeDog+Cat`. Label them as $b_0$, $b_1$, and $b_2$.
    
</div>


<div class="guided-notes">

### 2.7 Draw and write on the jitter plot to show where $b_0$, $b_1$, and $b_2$ are represented   

</div>


In [None]:
# run this code
pet_type_model

gf_jitter(happiness ~ pet_type, data = pet_owners, width = .2) %>%
gf_model(pet_type_model)

<div class="guided-notes">

### 2.8 Write in the first three rows how you would interpret each of the parameter estimates:  
- $b_0$ `(Intercept)`  
- $b_1$ `pet_typeDog`  
- $b_2$ `pet_typeDog+Cat`  

Hint: Build off what you know about interpreting $b_0$ and $b_1$ in two-group models. In the table, we put in an example of how we might interpret a two-group model that only had Cat and Dog owners. 

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:24%"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:38%">Two-Group Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:38%">Three-Group Model</td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px">$b_0$<br><code>Intercept</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">mean of reference group</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
       <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px">$b_1$<br><code>pet_typeDog</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">adjustment to get from mean of reference group to mean of second group</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"> </td>
    </tr>
    <tr>
    <tr>
       <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px">$b_2$<br><code>pet_typeDog+Cat</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">NA</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>Model specification (with variable names)</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$\text{happiness}_i = b_0 + b_1\text{pet_typeDog}_i + e_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>Model specification (with $Y$, $X_{1i}$, $X_{2i}$)</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$Y_i = b_0 + b_1X_{1i} + e_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
       <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px;"><b>Residual</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$e_i = Y_i - \hat{Y}_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
  </tbody>
</table>  

## 3 How R Generates Predictions — GLM Notation for the Three-Group Model

The three-group model generates three parameter estimates. These parameter estimates are part of the function that generates the three model predictions, which are the three group means. But how do they do that? Let's look at the GLM notation for the two-group model, and see if we can figure out how it would work for the three-group model.

For example, if we only had the Cat and Dog groups, we might write a two-group model like this:<br><br>

<span style="font-size: 20px">
$$\underbrace{\text{happiness}_i}_{\text{DATA}} \;=\; \underbrace{(b_0 + b_1\text{pet_typeDog}_i)}_{\text{MODEL}} \;+\; \underbrace{e_i}_{\text{ERROR}}$$</span>

<div class="guided-notes">

### 3.1 In the next row of the large table in your guided notes, write how you think the three-group model would be written using variable names 
</div>

<div class="guided-notes">

### 3.2 In the next row of the table, write the same model in GLM notation, this time using $Y_i$ to represent the outcome variable, and $X_{1i}$ and $X_{2i}$ to represent the explanatory variables.  

</div>

<div style="font-size: 18px; line-height: 1.4; border: 2px solid black; padding: 10px;">
    
We use <b>dummy coding</b> to represent predictor variables in the General Linear Model. (There are other ways to represent predictor variables; this is just one option.) This means that regardless of how many predictor variables we have, we write them such that each $X$ can only take two values: 0 or 1. 
    
Each group (beyond the reference group) is represented with an $X_i$. If there is more than one $X_i$ we name them $X_{1i}$, $X_{2i}$, and so on. For all of the $X$s, 1 = is in that group; 0 = is not in that group.
    
</div>

### 3.3 How the model makes predictions

A model is a function that generates a prediction for every row in a dataset. The model function for the three-group model can be written like this: 

<span style="font-size: 20px">$$\hat{Y}_i=b_0+b_1X_{1i}+b_2X_{2i}$$</span>

<div class="discussion-question">

### 3.4 Discussion Questions: Why do we use $\hat{Y}$ instead of $Y_i$ in this equation? Why is there no $e_i$? 

</div>

<div class="guided-notes">
    
### 3.5 In the table below, fill in the values that $X_{1i}$ and $X_{2i}$ would have for each of the three groups. Then substitute those values into the function that makes predictions for the three-group model, and fill in the other missing cells in the table. 

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 10%>Group</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 5%>$X_{1i}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 5%>$X_{2i}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 23%>Function</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 13%>Prediction</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;">Interpretation of Prediction</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 80px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">Cat</td>
      <td style="border: 1px solid black; text-align: left; text-align: center;">0</td>
      <td style="border: 1px solid black; text-align: left; text-align: center;">0</td>
      <td style="border: 1px solid black; text-align: center;">$b_0+b_1(\text{__})+b_2(\text{__})$</td>
      <td style="border: 1px solid black; text-align: center;">$$\\b_0$$</td>
      <td style="border: 1px solid black; text-align: center;"> </td>
    </tr>
    <tr style="height: 80px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">Dog</td>
        <td style="border: 1px solid black; font-weight: bold; text-align: center"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;">$b_0+b_1(\ 1\ )+b_2(\ 0\ )$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;">$$\\b_0 + b_1$$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
    </tr>
    <tr style="height: 80px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">Dog+Cat</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;">$b_0+b_1(\text{__})+b_2(\text{__})$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;">mean happiness of Dog+Cat group</td>
    </tr>
  </tbody>  
</table>

<div class="discussion-question">
    
### 3.6 Discussion Question: In the three-group pet model, what do the dummy-coded variables mean?

- What does $X_{1i}$ represent?
- What does $X_{2i}$ represent?


</div>

<div style="font-size: 18px; line-height: 1.4; border: 2px solid black; padding: 10px;">

R does not require you to set up dummy-coded variables; it does that automatically. So, if you create the <code>pet_type_model</code> using <code>lm(happiness ~ pet_type, data = pet_owners)</code>, R transforms that into a 3-group model with two dummy-coded $X$ variables.
    
</div>

<div class="discussion-question">
    
### 3.7 Discussion Question: If there was a 4-group model (e.g., Cat, Dog, Dog+Cat, Neither), how many dummy-coded $X$s would R set up?

</div>

<div class="discussion-question">

### 3.8 How would you write a 4-group model in GLM notation? What would you add to the 3-group model shown below?

<span style="font-size: 24px">$$\hat{Y}_i=b_0+b_1X_{1i}+b_2X_{2i}$$</span>

Is there a pattern between the number of $X$s and the number of groups?
    
</div>

## 4 Analyzing Error From the Three-Group Model

Just like the two-group model, we can assess how well our three-group model fits the data by analyzing the error around the model. Just like always, our calculations of error start with residuals. From there we move to sums of squares. And just like the two-group model, we assess model fit by comparing the SS Total (from the empty model) to the leftover SS Error (for the three-group model) to see how much error has been reduced by applying our model.

<div class="guided-notes">

### 4.1 In the final row of the table (shown below), fill in the formula used to calculate residuals from the three-group model
    
</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:24%"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:38%">Two-Group Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:38%">Three-Group Model</td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px">$b_0$<br><code>Intercept</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">mean of reference group</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
       <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px">$b_1$<br><code>pet_typeDog</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">adjustment to get from mean of reference group to mean of second group</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"> </td>
    </tr>
    <tr>
    <tr>
       <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px">$b_2$<br><code>pet_typeDog+Cat</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">NA</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>Model specification (with variable names)</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$\text{happiness}_i = b_0 + b_1\text{pet_typeDog}_i + e_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>Model specification (with $Y$, $X_{1i}$, $X_{2i}$)</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$Y_i = b_0 + b_1X_{1i} + e_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
       <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px;"><b>Residual</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$e_i = Y_i - \hat{Y}_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
  </tbody>
</table>  

### 4.2 Run `supernova()` for the `pet_type_model`

In [None]:
# code here
supernova(pet_type_model)

<div class="discussion-question">

### 4.3 Discussion Questions: How much error has been reduced from the empty model to the pet type model? What percentage of the variation in happiness is explained by pet type under this model?
    
Do you see evidence that DATA = MODEL + ERROR? Explain.
    
</div>

## 5 Practice What You've Learned

### 5.1 Gather data from your own class and analyze it.

In [None]:
IRdisplay::display_html(sprintf(
  '<iframe src="https://uclatall.github.io/jupyter-survey/cats-n-dogs-v3.html?class_id=%s&user_id=%s" width="100%%" height="520" 
    sandbox="allow-scripts allow-same-origin" 
    style="border: none; box-shadow: none; background: white; "></iframe>',
  class_id, user_id
))

In [None]:
# run this code to import only data from your class
sheet_url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTm5IafEMmLJBMdaGLiDzAsFu0lEQYXQeKJDNlPVSm33FwdoWdUjYgki1RlDQ-gVfVVH78MfEeNzozm/pub?gid=0&single=true&output=csv"
data <- suppressWarnings(utils::read.csv(sheet_url, header = TRUE, stringsAsFactors = FALSE)) %>%
  mutate(date = as.POSIXct(date, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")) %>%
  filter(class_id == !!class_id) %>%
  select(date, user_id, class_id, pet_type, happiness)

head(data)

### 5.2 Make a data visualization to explore the variation in happiness based on pet type.

In [None]:
# make a visualization here


### 5.3 Create a model and place it onto your data visualization above.

### 5.4 Write the best-fitting model in GLM notation. (Feel free to use b0, b1, X1i, X2i)

Fill in the ...

Yi = ... + ei

### 5.5 Interpret each parameter estimate.

- b0:
- b1: 
- b2:
- b3:


### 5.6 Interpret each dummy-coded variable.

- X1i: 
- X2i: 
- X3i: 
