# Salamander Growth in Mack Creek (COMPLETE)
## Chapter 7.1-7.4 Making Two-Group Models

In [None]:
# run this cell to set up the notebook
suppressMessages(library(coursekata))
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# this will pull in the giant_sal data
giant_sal <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8kCNPCY830PwnzJLnriG70NRApkgELQEJ_n2DY0-b8AgBhIAl2sH00RILA3Indfji8pX-baJu7_Z5/pub?gid=2069091587&single=true&output=csv")

<div class="teacher-note">
<b>Teacher Note:</b> The goal of this notebook is to give students a chance to practice what they are learning about the two-group model with a new data set. 
</div>

## 1. The `giant_sal` data set: Salamander Growth in Mack Creek

<img src="https://lter.github.io/lterdatasampler/reference/figures/and_salamander.jpg" alt="Coastal Giant Salamander at Mack Creek study site" width="250" style="float: left; margin-right: 20px; height: auto;"> 

Biologists studying coastal giant salamanders in Mack Creek (Willamette National Forest, Oregon) wanted to know how the harvesting of forests affects salamander growth. Forest harvesting involves cutting down trees, which could change the creek environment in which the salamanders live and thus impact how large they grow to be. 

To investigate this, they captured and measured the length of 60 coastal giant salamanders. Some salamanders were captured in areas where trees had recently been cut down (clear-cut), while others came from untouched old-growth forest. Your task is to explore this dataset, called `giant_sal`, to find out whether forest harvesting impacts salamander size.

### Variables in the `giant_sal` Data Frame:

- `year`: The year the salamander was caught and measured  
- `location`: The section of forest in which the salamander was caught (clear cut or old growth)
- `habitat`: Where in the creek the salamander lived: Coded as `pools` (deeper, slower-moving parts of the creek) or `cascades` (shallow, rocky areas with faster-moving water)
- `length_mm`: Body length in millimeters, measured from its snout to the start of its tail (excluding the tail)  

Note: This is a subset of data from [the Long Term Ecological Research program (LTER) Network](https://lter.github.io/lterdatasampler/reference/index.html).

### 1.1 Run some code to look at a few rows of the `giant_sal` data

In [None]:
# sample response

# run code here
head(giant_sal)

### 1.2 Write a word equation to represent that hypothesis that location affects salamander length

<div class="teacher-note">

<b>Sample Response</b>: length_mm = location + other stuff
    
</div>

## 2. Explore Variation in `length_mm`
### 2.1 Make a data visualization to explore the hypothesis that location affects salamander length

In [None]:
# sample response

# make a data visualization
gf_jitter(length_mm ~ location, data = giant_sal, width = .1)

### 2.2 Based on your visualization, do think knowing `location` helps you make a better prediction of `length_mm`? What specific patterns or features do you notice in your visualization that support your answer?

<div class="teacher-note">

<b>Sample Response</b>: Yes, I would predict the length was a little longer if I knew the salamander was from the clear cut forest area.
 
</div>

## 3. Model Variation in `length_mm`

### 3.1 Fit and save the best-fitting model of `length_mm` using `location` as the explanatory variable. Then overlay your model onto your data visualization in the code block below.

In [None]:
# sample response

# fit and save the model
location_model <- lm(length_mm ~ location, data = giant_sal)

# place your model on a data visualization
gf_jitter(length_mm ~ location, data = giant_sal, width = .1) %>%
  gf_model(location_model)


### 3.2 Print out the parameter estimates for your model (using the code cell below). What is the estimate $b_0$? What is the estimate $b_1$?

In [None]:
# sample response

# print out the parameter estimates
location_model

<div class="teacher-note">

<b>Sample Response</b>: b0=62.30; b1=-14.77
 
</div>

### 3.3 Write an expression in GLM notation that will generate predictions of the location model. Use $X_i$ to represent location.

<div class="teacher-note">

<b>Sample Response</b>: MODEL prediction =  62.30 + -14.77Xi 
 
</div>

### 3.4 What does it mean in this context when $X_i= 1$?

<div class="teacher-note">

<b>Sample Response</b>: We can look at the output of lm() and see that when X=1, location is "old growth forest." When X = 0, the salamander was NOT found in old growth forest but in clear cut forest. 
 
</div>


### 3.5 What does the $b_1$ estimate mean in this context? Why is it negative?

<div class="teacher-note">

<b>Sample Response</b>: -14.77 is the b1 estimate. It is the amount we must add to b0 in order to get the prediction of length for the old growth forest salamanders (i.e., when X=1). It is negative because the salamanders living in old growth forest are shorter in length than those living in clear cut forest. 
 
</div>

### 3.6 Use your model with the `predict()` function to generate predicted values for each salamander in the `giant_sal` data frame. Store these predictions as a new column in the data frame.

In [None]:
# sample response

# generate predictions and save them back into giant_sal$location_prediction
giant_sal$location_prediction <- predict(location_model)
# take a look at the updated giant_sal data frame
head(giant_sal)

### 3.7 How many different predictions can your model make? Explain how the model uses the parameter estimates $b_0$ and $b_1$ to make each prediction.

<div class="teacher-note">

<b>Sample Response</b>: 

The model makes two predictions.
    
1. The predicted length in the clear cut location is $b_0 + b_1(0)$ or just $b_0$.
    
2. The predicted length in the old growth location is $b_0 + b_1(1)$ or just $b_0+b_1$.    
    
</div>

## 4. Compare the Two-Group Model to the Empty Model

### 4.1 Run the code below to visualize the `condition_model` on a jitter plot
Notice that we added some code to make the model predictions red

In [None]:
# run this code
gf_jitter(length_mm ~ location, data = giant_sal, width = .1) %>%
  gf_model(location_model, color="red")

### 4.2 Now add some code in the cell below to fit and save the empty model as `empty_model`, and then add the empty model onto the graph as well.

In [None]:
# sample response

# fit and save the empty model here
empty_model <- lm(length_mm ~ NULL, data=giant_sal)
# run this code
gf_jitter(length_mm ~ location, data = giant_sal, width = .1) %>%
  gf_model(location_model, color="red") %>%
  gf_model(empty_model)

### 4.3 Compare the empty model to the group model. Which model makes better predictions? Which model has less error? 

<div class="teacher-note">

<b>Sample Response</b>: The group model’s predictions are higher for salamanders in the clear-cut forest and lower for those in the old-growth forest. This matches the actual data because salamanders tend to be longer in the clear-cut forest and shorter in the old-growth forest. So, the group model’s predictions seem to better match the data compared to the empty model. Because the predictions for the group model are more accurate, there is therefore less error around the group model predictions.
    
</div>