<div class="alert alert-block alert-success">

**Reference Guide for R (student resource) -** Check out our <a href = "https://docs.google.com/document/d/1dXT3NO4QjtbIrq29fH6pXHKgt7odZN7BIeU_TBoia4I/edit?usp=sharing">reference guide</a> for a full listing of useful R commands for this project. 

<div>

<img src="https://skewthescript.org/s/erasing_student_debt.jpg">

## Data Science Project: Use data to determine the best and worst colleges for conquering student debt.

### Notebook 2: Simple Linear Regression

Does college pay off? We'll use some of the latest data from the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a> to answer that question. 

In this notebook (the 2nd of 4 total notebooks), you'll use R to create scatterplots, fit simple linear regression models, and compare the strength of your models. By the end of this notebook, you'll see what factors make certain colleges better investments than others.

In [None]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)

<div class="alert alert-block alert-success">

### The Dataset (`four_year_colleges.csv`)

**General description** - In this notebook, we'll be using the `four_year_colleges.csv` file, which only includes schools that offer four-year bachelors degrees and/or higher graduate degrees. Community colleges and trade schools often have different goals (e.g. facilitating transfers, direct career education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges only to other four-year colleges, we'll have clearer analyses and conclusions. 

This data is a subset of the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a>. The data is current as of the 2020-2021 school year.

**Description of all variables:** See <a href="https://docs.google.com/document/d/1C3eR6jZQ2HNbB5QkHaPsBfOcROZRcZ0FtzZZiyyS9sQ/edit">here</a>

**Detailed data file description:** See <a href="https://docs.google.com/spreadsheets/d/1fa_Bd3_eYEmxvKPcu3hK2Dgazdk-9bkeJwONMS6u43Q/edit?usp=sharing">here</a></div>

### 1.0 - Creating scatterplots

To begin, let's download our data. We'll download the `four_year_colleges.csv` file from the `skewthescript.org` website and store it in an R dataframe called `dat`.

In [None]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads the data
dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')

<div class="alert alert-block alert-warning"> 

**1.1 -** Use the `head` command to print out the first several rows of the dataset.
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-warning"> 

**1.2 -** Use the `dim` command to find the number of colleges (rows) and number of variables (columns) in our dataset.
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-info">

**Check yourself:** Your code should have printed out two numbers: 1053 and 26.</div>

A good measure of whether attending a certain college "pays off" is its **student loan default rate**. If a college is low-cost and prepares students for high-paying jobs, few students will default on their loans. If a college is high-cost and does not prepare students for high-paying jobs, many students will have trouble paying off their loans (high default rate).

So, our main **outcome variable** in this analysis will be `default_rate`. We're going to use scatterplots to see how strongly different **predictor variables** correlate with default rates. In particular, we're going to explore how well each of the following variables predicts colleges' default rates:
- `pct_PELL` - percent of student body that receives [PELL grants](https://www.benefits.gov/benefit/417). Note: PELL grants are government scholarships given to students from low-income families
- `grad_rate` - percent of students who successfully graduate
- `net_tuition` - Net tuition (tuition minus average discounts and allowances) per student, in thousands of dollars

To begin, let's create a **scatterplot** of colleges' default rates and the percent of their student body that receive PELL grants. We can use the `gf_point` command to make the graph:

In [None]:
## Run this code but do not edit it
# Create scatterplot: default_rate ~ pct_PELL
gf_point(default_rate ~ pct_PELL, data = dat)

We see that there's a positive relationship between `pct_PELL` and `default_rate`. The colleges with the highest rates of PELL grant recipients (low-income students) also tend to have higher student loan default rates. In other words, if you were to fit a model to this data, it would predict higher default rates at schools that serve more PELL recipients.

We must keep in mind: **correlation is not causation**. The scatterplot shows us that default rates and PELL recipient rates are positively correlated. However, the graph doesn't show us a clear causal explanation behind the correlation. For example, here are several causal explanations that this graph can't clarify:
- PELL recipients may only be able to afford to attend low-quality colleges. These colleges have higher default rates because they fail to prepare students for the workforce.
- PELL recipients may have less familial resources to weather the storms of financial emergencies in the first few years after college. So, the schools that serve PELL recipients at high rates will also have more of their students defaulting on loans (regardless of the school's quality).
- PELL recipients may have attended lower-quality high schools, which don't properly prepare them for college. So, these students may drop out of college at higher rates, which raises their chances of defaulting on student loans. 

Or, it could be a combination of all those explanations! We can't tell from this analysis alone.

<div class="alert alert-block alert-warning"> 

**1.3 -** In the next question, you will create a scatterplot that visualizes the relationship between `grad_rate` and `default_rate`. Before doing so, make a prediction: Do you expect student loan default rates to positively or negatively correlate with graduation rates? Why?
    
</div>

**Double-click this cell to type your answer here:**

<div class="alert alert-block alert-warning"> 

**1.4 -** Create a scatterplot that visualizes the relationship between `grad_rate` (predictor) and `default_rate` (outcome).
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-info">

**Check yourself:** Your code should have generated a scatterplot with the x-axis labled with `grad_rate` and the y-axis labeled with `default_rate`.</div>

<div class="alert alert-block alert-warning"> 

**1.5 -** Using your scatterplot, describe the relationship between graduation rates and student loan default rates. For instance, are these variables positively or negatively related? How can you tell? Does this corroborate your prediction from **Question 1.3**? Explain.
    
</div>

**Double-click this cell to type your answer here:** 

### 2.0 - Simple linear regression (one predictor)

<div class="alert alert-block alert-warning"> 

**2.1 -** If you haven't taken AP Stats, watch <a href="https://youtu.be/hvWgu4A0VA4">this video</a>, which provides an introduction to linear regression. 

**Note:** This video is adapted from other materials and covers data from a separate context. However, the video provides a good intro to the concepts and models we'll be using in this section of the project.
    
</div>

Let's create a linear regression model relating `pct_PELL` (x) and `default_rate` (y). To visualize our model, we can graph the line modeled by our equation on top of the scatterplot relating `pct_PELL` to `default_rate`. We use the `gf_point` command to produce the scatterplot, the `gf_lm` command to graph our linear model, and the `%>%` symbol to put the elements together on the same graph:

In [None]:
## Run this code but do not edit it
# Overlay linear model of default_rate ~ pct_PELL on top of scatterplot
gf_point(default_rate ~ pct_PELL, data = dat) %>% gf_lm(color = "orange")

<div class="alert alert-block alert-warning"> 

**2.2 -** Is the slope value of this model positive or negative? How can you tell?
    
</div>

**Double-click this cell to type your answer here:** 

R can help us find the equation that models this linear regression line. As shown in the video, we can model a linear trend between a predictor (x) and outcome (y) using this linear regression formula:

$$
\hat{y} = \beta_{0} + \beta_{1}x
$$
Where:
- $\hat{y}$ (pronounced "y hat") is the predicted y-value (predicted outcome value)
- $\beta_{0}$ (pronounced "beta zero") is the y-intercept --> the predicted y-value (outcome value) when x = 0 (the predictor's value is 0)
- $\beta_{1}$ (pronounced "beta 1") is the slope --> the predicted change in y (outcome) for a 1-unit increase in x (predictor)
- $x$ is the x-value (predictor value)

To fit a linear regression model to a set of data in R, we use the `lm` command. `lm` stands for "linear model." Here, we use `lm` to find the linear regression model relating `pct_PELL` (x) and `default_rate` (y).

In [None]:
## Run this code but do not edit it
# Create and display linear model: default_rate ~ pct_PELL
PELL_model <- lm(default_rate ~ pct_PELL, data = dat)
PELL_model

The output of the `lm` command is a bit clunky, but here's what it means:
- The `(Intercept)` value is the y-intercept ($\beta_{0}$)
- The `pct_PELL` value is the coefficient for the predictor. In other words, it's the slope ($\beta_{1}$)

So, our regression equation can be written as:

$$
\hat{y} = -0.9327 + (0.1765)x
$$

<div class="alert alert-block alert-warning">

**2.3 -** Identify the slope value and interpret what it means (in context).
    
</div>

**Double-click this cell to type your answer here:** 

<div class="alert alert-block alert-warning">

**2.4 -** Use the `gf_point` and `gf_lm` commands to visualize a linear regression model for predicting `default_rate` (outcome) using `grad_rate` (predictor).
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-info">

**Check yourself:** Your scatterplot should have a line on it with a negative slope.</div>

<div class="alert alert-block alert-warning">

**2.5 -** Use the `lm` command to find the linear regression model you visualized above. Store the model in an object called `grad_model` and print it to see its values.
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-info">

**Check yourself:** If you print out `grad_model`, you should see two numbers: 14.46 and -0.1584.</div>

<div class="alert alert-block alert-warning">

**2.6 -** Identify the slope value and interpret what it means (in context).
    
</div>

**Double-click this cell to type your answer here:** 

### 3.0 - Analyzing strength $(R^2)$

In addition to the direction of a relationship (positive or negative), we can also look at the **strength** of a relationship. The strength is a measure of the **quality of our model's predictions.** A key metric for analyzing the strength of a model is $R^2$. The following diagram (from <a href = "https://skewthescript.org/3-3-a">Skew The Script</a>) shows the $R^2$ values of various linear models:

<img src="https://skewthescript.org/s/r_squared.PNG">

In the "weak" correlations, we see that our predictions (the linear model) tend to be far away from the actual data values (the points). If we used a model with weak correlation to predict **new** data values, our predictions would have high error. If we used a model with strong correlation to predict **new** data values, our predictions would have low error.

$R^2$ takes values between 0 - 1 (alternatively: 0% - 100%). The stronger the model, the closer $R^2$ gets to 1 (or 100%). The weaker the model, the closer $R^2$ gets to 0 (or 0%). An intuitive way to think about it: for the perfectly strong correlations, the model gives 100% perfect predictions. The models explain 100% of the variation in the data, so $R^2 = 100\%$. As the correlations get weaker, they start leaving room for error, since the models capture less of the variation in the data. So, the $R^2$ value declines from 100%, approaching 0% if there's no correlation (model adds no prediction power compared to naive guessing).

**Optional Resource:** If you'd like a more thorough explanation of the math behind $R^2$, check out <a href="https://youtu.be/bMccdk8EdGo">this video</a>.

To see the $R^2$ values of our linear regression models, we can use the `summary` command. For example, here we get the `summary` printout of `grad_model`.

In [None]:
## Run this code but do not edit it
# Summarize default_rate ~ grad_rate model
summary(grad_model)

There's a lot going on in this printout. For now, focus at the bottom of the printed information. The `Multiple R-squared` value is the $R^2$ value for the model. In this case, $R^2 = 51.5\%$. So, we can say that the correlation between graduation rates and student loan default rates is moderately strong. This model would yield moderately strong predictions for default rates if used to predict on new colleges.

<div class="alert alert-block alert-warning">

**3.1 -** Let's consider a new variable: `net_tuition` (tuition minus average discounts and allowances per student, in thousands of dollars). How well does a school's tuition predict its student loan default rate? Let's start exploring. Go ahead and create a scatterplot that visualizes the relationship between `net_tuition` (predictor) and `default_rate` (outcome). **Overlay a linear regression model on the graph** using the `%>% gf_lm(color = "orange")` command.
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-warning">

**3.2 -** Use the `lm` command to find the linear regression model you visualized above. Store the model in an object called `tuition_model` and print out the model's values.
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-info">

**Check yourself:** If you print out `tuition_model`, you should see two numbers: 8.0029 and -0.2077.</div>

<div class="alert alert-block alert-warning">

**3.3 -** Use the `summary` command to find the $R^2$ value of your linear model.
    
</div>

In [None]:
# Your code goes here


<div class="alert alert-block alert-info">

**Check yourself:** The $R^2$ value for `tuition_model` should be 0.1882.</div>

<div class="alert alert-block alert-warning">

**3.4 -** When evaluating different college options to predict if attending them would "pay off," many students look very closely at the tuition and costs of attending. Very few students look at colleges' graduation rates. Is this reasonable or a mistake? Justify your answers using the $R^2$ values for the `grad_model` and `tuition_model`.
    
</div>

**Double-click this cell to type your answer here:** 

<div class="alert alert-block alert-warning">

**3.5 -** The correlation between tuition costs and student loan default rates is **negative**. This means that as tuition costs get higher, **fewer** student tend to default on their student loans. Is that possible? What might be going on here?
    
</div>

**Double-click this cell to type your answer here:** 

### Feedback (Required)

Please take 2 minutes to fill out <a href="https://forms.gle/ePwTHdSeAc8FvVjg7">this anonymous notebook feedback form</a>, so we can continue improving this notebook for future years!