<div class="alert alert-block alert-info"><b>Instructor Note:</b> This project is intended for AP Statistics and/or AP Computer Science students seeking to learn modern data science skills. We'll use data from the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a> to see which colleges are worth the price of admission. Instructions, walkthrough videos, and additional support for teachers can be found <a href="https://skewthescript.org/data-science-challenge">here</a>.
<br>In this notebook (the 2nd of 4 total notebooks), students will...
<li>Create scatterplots </li><li>Fit simple linear regression models </li><li>Compare the strength of predictor variables</li></div>

<div class="alert alert-block alert-success">

**Reference Guide for Python (student resource) -** Check out our <a href = "https://docs.google.com/document/d/1DaWN9HWInSBxSMhU0b5BetHlBM4n-ylqDiA8SJrhb7c/edit?tab=t.0">reference guide</a> for a full listing of useful Python commands for this project. 

<div>

<img src="https://skewthescript.org/s/erasing_student_debt.jpg">

## Data Science Project: Use data to determine the best and worst colleges for conquering student debt.

### Notebook 2: Simple Linear Regression

Does college pay off? We'll use some of the latest data from the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a> to answer that question. 

In this notebook (the 2nd of 4 total notebooks), you'll use Python to create scatterplots, fit simple linear regression models, and compare the strength of your models. By the end of this notebook, you'll see what factors make certain colleges better investments than others.

In [None]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of Python commands
%pip install seaborn

import pyodide_http
pyodide_http.patch_all()

import pandas as pd
import statsmodels.formula.api as smf  # For statistical modeling
import matplotlib.pyplot as plt  # For creating static visualizations
import seaborn as sns  # For advanced data visualization

<div class="alert alert-block alert-success">

### The Dataset (`four_year_colleges.csv`)

**General description** - In this notebook, we'll be using the `four_year_colleges.csv` file, which only includes schools that offer four-year bachelors degrees and/or higher graduate degrees. Community colleges and trade schools often have different goals (e.g. facilitating transfers, direct career education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges only to other four-year colleges, we'll have clearer analyses and conclusions. 

This data is a subset of the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a>. The data is current as of the 2020-2021 school year.

**Description of all variables:** See <a href="https://docs.google.com/document/d/1C3eR6jZQ2HNbB5QkHaPsBfOcROZRcZ0FtzZZiyyS9sQ/edit">here</a>

**Detailed data file description:** See <a href="https://docs.google.com/spreadsheets/d/1fa_Bd3_eYEmxvKPcu3hK2Dgazdk-9bkeJwONMS6u43Q/edit?usp=sharing">here</a></div>

### 1.0 - Creating scatterplots

To begin, let's download our data. Our full dataset is included in a file named `four_year_colleges.csv`, which we are retrieving from a public Github respository we are using to store data files. The command below downloads the data from the file and stores it into a the pandas dataframe object called `four_year_col`. The python package, pandas, is useful for data manipulation and analysis.

In [None]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads the data
four_year_col = pd.read_csv('https://ds-modules.github.io/ds-challenge-assets/four_year_colleges.csv')

<div class="alert alert-block alert-warning"> 

**1.1 -** Use the `head` command to print out the first several rows of the dataset.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-warning"> 

**1.2 -** Use the `shape` command to find the number of colleges (rows) and number of variables (columns) in our dataset.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** Your code should have printed out two numbers: 1053 and 26.</div>

A good measure of whether attending a certain college "pays off" is its **student loan default rate**. If a college is low-cost and prepares students for high-paying jobs, few students will default on their loans. If a college is high-cost and does not prepare students for high-paying jobs, many students will have trouble paying off their loans (high default rate).

So, our main **outcome variable** in this analysis will be `default_rate`. We're going to use scatterplots to see how strongly different **predictor variables** correlate with default rates. In particular, we're going to explore how well each of the following variables predicts colleges' default rates:
- `pct_PELL` - percent of student body that receives [PELL grants](https://www.benefits.gov/benefit/417). Note: PELL grants are government scholarships given to students from low-income families
- `grad_rate` - percent of students who successfully graduate
- `net_tuition` - Net tuition (tuition minus average discounts and allowances) per student, in thousands of dollars

To begin, let's create a **scatterplot** of colleges' default rates and the percent of their student body that receive PELL grants. We can use the seaborn command `scatterplot` on the pandas DataFrame, `four_year_col`:

In [None]:
## Run this code but do not edit it
# Create scatterplot comparing the default_rate(outcome) to pct_Pell(predictor)
sns.scatterplot(x='pct_PELL', y='default_rate', data=four_year_col, alpha=0.7)
plt.title('Scatterplot: Default Rate vs Percentage of PELL')
plt.show()

<div class="alert alert-block alert-info"><b>Instructor Note:</b> If the x and y values tend to rise and fall together (e.g. as x increases, y tends to increase | as x decreases, y tends to decrease), we say the variables are <b>positively</b> related. A linear model of the data would have a positive slope. If the x and y values are inversely related (e.g. as x increases, y tends to decrease | as x decreases, y tends to increase), we say the variables are <b>negatively</b> related. A linear model of the data would have a negative slope.
</div>

We see that there's a positive relationship between `pct_PELL` and `default_rate`. The colleges with the highest rates of PELL grant recipients (low-income students) also tend to have higher student loan default rates. In other words, if you were to fit a model to this data, it would predict higher default rates at schools that serve more PELL recipients.

We must keep in mind: **correlation is not causation**. The scatterplot shows us that default rates and PELL recipient rates are positively correlated. However, the graph doesn't show us a clear causal explanation behind the correlation. For example, here are several causal explanations that this graph can't clarify:
- PELL recipients may only be able to afford to attend low-quality colleges. These colleges have higher default rates because they fail to prepare students for the workforce.
- PELL recipients may have less familial resources to weather the storms of financial emergencies in the first few years after college. So, the schools that serve PELL recipients at high rates will also have more of their students defaulting on loans (regardless of the school's quality).
- PELL recipients may have attended lower-quality high schools, which don't properly prepare them for college. So, these students may drop out of college at higher rates, which raises their chances of defaulting on student loans. 

Or, it could be a combination of all those explanations! We can't tell from this analysis alone.

<div class="alert alert-block alert-warning"> 

**1.3 -** In the next question, you will create a scatterplot that visualizes the relationship between `grad_rate` and `default_rate`. Before doing so, make a prediction: Do you expect student loan default rates to positively or negatively correlate with graduation rates? Why?

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-info">

**Instructor Note:** As stated in the dataset description above, <code>default_rate</code> describes the percent of <b>all</b> of a school's borrowers that are in default on their student loans. This includes students who have graduated, transferred, or did not complete their programs.
</div>

<div class="alert alert-block alert-warning"> 

**1.4 -** Create a scatterplot that visualizes the relationship between `grad_rate` (predictor) and `default_rate` (outcome).

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** Your code should have generated a scatterplot with the x-axis labled with `grad_rate` and the y-axis labeled with `default_rate`.</div>

<div class="alert alert-block alert-warning"> 

**1.5 -** Using your scatterplot, describe the relationship between graduation rates and student loan default rates. For instance, are these variables positively or negatively related? How can you tell? Does this corroborate your prediction from **Question 1.3**? Explain.

</div>

**Double-click this cell to type your answer here:** ...

### 2.0 - Simple linear regression (one predictor)

<div class="alert alert-block alert-warning"> 

**2.1 -** If you haven't taken AP Stats, watch <a href="https://youtu.be/hvWgu4A0VA4">this video</a>, which provides an introduction to linear regression.

**Note:** This video is adapted from other materials and covers data from a separate context. However, the video provides a good intro to the concepts and models we'll be using in this section of the project.

</div>

<div class="alert alert-block alert-info"><b>Instructor Note:</b> If your students want to cover the discussion question at the end of the video, you can download the handout key or slide deck from the <a href = "https://skewthescript.org/3-2">full Skew The Script lesson page</a>, which has useful teaching materials for covering the discussion question. Note: Covering the discussion question is not required for successful completion of this project.
</div>

Let's create a linear regression model relating `pct_PELL` (x) and `default_rate` (y). To visualize our model, we can graph the line modeled by our equation on top of the scatterplot relating `pct_PELL` to `default_rate`. We use the seaborn `lmplot` command to produce the scatterplot and linear model. The attributes, `line_kws` and `scatter_kws` allow us to pass keyword arguements to each plot:

In [None]:
## Run this code but do not edit it
# Overlay linear model of default_rate and pct_PELL on top of scatterplot
# Scatterplot with regression line
sns.lmplot(x='pct_PELL', y='default_rate', data=four_year_col, line_kws={'color': 'orange'}, scatter_kws={'alpha': 0.7})
plt.title('Scatterplot with Linear Model: Default Rate vs Percentage of PELL')
plt.show()

<div class="alert alert-block alert-warning"> 

**2.2 -** Is the slope value of this model positive or negative? How can you tell?

</div>

**Double-click this cell to type your answer here:** ...

Python can help us find the equation that models this linear regression line. As shown in the video, we can model a linear trend between a predictor (x) and outcome (y) using this linear regression formula:

$$
\hat{y} = \beta_{0} + \beta_{1}x
$$
Where:
- $\hat{y}$ (pronounced "y hat") is the predicted y-value (predicted outcome value)
- $\beta_{0}$ (pronounced "beta zero") is the y-intercept --> the predicted y-value (outcome value) when x = 0 (the predictor's value is 0)
- $\beta_{1}$ (pronounced "beta 1") is the slope --> the predicted change in y (outcome) for a 1-unit increase in x (predictor)
- $x$ is the x-value (predictor value)

To fit a linear regression model to a set of data in Python, we use another package called `statsmodels` that includes a set functions for under the name, `statsmodels.formula.api`. The  `ols` command stands for "ordinary least squares." A method you hopefully used in high school math for determining a best-fit line in cases such as these! Here, we use `ols` to find the linear regression model relating `pct_PELL` (x) and `default_rate` (y). We then extract the y-intercept($\beta_{0}$) and slope ($\beta_{1}$) of best fit line. A common way to represent the relationship between variables is using the notation `default_rate ~ pct_PELL`, where the outcome variable (`default_rate`) is modeled as a function of the predictor variable (`pct_PELL`). We will use this notation frequently.

In [None]:
## Run this code but do not edit it
# Create and display linear model: default_rate ~ pct_PELL
pell_model = smf.ols(formula='default_rate ~ pct_PELL', data=four_year_col).fit()

print(f"Intercept: {pell_model.params['Intercept']}")
print(f"Slope: {pell_model.params['pct_PELL']}")

The output of the `ols` command is a bit clunky, but here's what it means:
- The `PELL_model.params['Intercept']` value is the y-intercept ($\beta_{0}$)
- The `PELL_model.params['pct_PELL']` value is the coefficient for the predictor. In other words, it's the slope ($\beta_{1}$)

So, our regression equation can be written as:

$$
\hat{y} = -0.9327 + (0.1765)x
$$

<div class="alert alert-block alert-warning"> 

**2.3 -** Identify the slope value and interpret what it means (in context).

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-warning"> 

**2.4 -** Use the `seaborn` package, `lmplot` command to visualize a linear regression model for predicting `default_rate` (outcome) using `grad_rate` (predictor).

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** Your scatterplot should have a line on it with a negative slope.</div>

<div class="alert alert-block alert-warning"> 

**2.5 -** Use the `statsmodel.formula.api` package's `ols` command to find the linear regression model you visualized above. Store the model in an object called `grad_model` and extract the Intercept and Slope as we did earlier.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** If you print out `grad_model`, you should see two numbers: 14.46 and -0.1584.</div>

<div class="alert alert-block alert-warning"> 

**2.6 -** Identify the slope value and interpret what it means (in context).

</div>

**Double-click this cell to type your answer here:** ...

### 3.0 - Analyzing strength $(R^2)$

In addition to the direction of a relationship (positive or negative), we can also look at the **strength** of a relationship. The strength is a measure of the **quality of our model's predictions.** A key metric for analyzing the strength of a model is $R^2$. The following diagram (from <a href = "https://skewthescript.org/3-3-a">Skew The Script</a>) shows the $R^2$ values of various linear models:

<img src="https://skewthescript.org/s/r_squared.PNG">

In the "weak" correlations, we see that our predictions (the linear model) tend to be far away from the actual data values (the points). If we used a model with weak correlation to predict **new** data values, our predictions would have high error. If we used a model with strong correlation to predict **new** data values, our predictions would have low error.

$R^2$ takes values between 0 - 1 (alternatively: 0% - 100%). The stronger the model, the closer $R^2$ gets to 1 (or 100%). The weaker the model, the closer $R^2$ gets to 0 (or 0%). An intuitive way to think about it: for the perfectly strong correlations, the model gives 100% perfect predictions. The models explain 100% of the variation in the data, so $R^2 = 100\%$. As the correlations get weaker, they start leaving room for error, since the models capture less of the variation in the data. So, the $R^2$ value declines from 100%, approaching 0% if there's no correlation (model adds no prediction power compared to naive guessing).

**Optional Resource:** If you'd like a more thorough explanation of the math behind $R^2$, check out <a href="https://youtu.be/bMccdk8EdGo">this video</a>.

To see the $R^2$ values of our linear regression models, we can use the `summary` command. For example, here we get the `summary` printout of `grad_model`.

In [None]:
## Run this code but do not edit it
# Summarize default_rate ~ grad_rate model
grad_model.summary()

There's a lot going on in this printout. For now, focus at the top-right of the printed information. The `R-squared` value is the $R^2$ value for the model. In this case, $R^2 = 0.515$ or $51.5\%$. So, we can say that the correlation between graduation rates and student loan default rates is moderately strong. This model would yield moderately strong predictions for default rates if used to predict on new colleges.

<div class="alert alert-block alert-warning"> 

**3.1 -** Let's consider a new variable: `net_tuition` (tuition minus average discounts and allowances per student, in thousands of dollars). How well does a school's tuition predict its student loan default rate? Let's start exploring. Go ahead and create a scatterplot that visualizes the relationship between `net_tuition` (predictor) and `default_rate` (outcome). **Overlay a linear regression model on the graph** as we did earlier in the notebook.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-warning"> 

**3.2 -** Use the `lm` command to find the linear regression model you visualized above. Store the model in an object called `tuition_model` and print out the model's values.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** If you print out `tuition_model`, you should see two numbers: 8.0029 and -0.2077.</div>

<div class="alert alert-block alert-warning"> 

**3.3 -** Use the `summary` command to find the $R^2$ value of your linear model.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** The $R^2$ value for `tuition_model` should be 0.1882.</div>

<div class="alert alert-block alert-warning"> 

**3.4 -** When evaluating different college options to predict if attending them would "pay off," many students look very closely at the tuition and costs of attending. Very few students look at colleges' graduation rates. Is this reasonable or a mistake? Justify your answers using the $R^2$ values for the `grad_model` and `tuition_model`.

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-warning"> 

**3.5 -** The correlation between tuition costs and student loan default rates is **negative**. This means that as tuition costs get higher, **fewer** student tend to default on their student loans. Is that possible? What might be going on here?

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-info">

**Instructor Note:** Spend some time on discussing this question, since it provides a nice setup for the next notebook (multiple regression).
</div>

<div class="alert alert-block alert-success">


### Summer Opportunity: Do you want to learn more about Data Science & AI?
Join our Data Science & AI Summer Bootcamp, where you'll take your learning from this project to the next level. **No prior coding or statistics experience required!** Designed by Harvard grads, the bootcamp allows students from all experience levels to dive deeper into data science concepts, from the basics (e.g. linear regression) to the advanced (e.g. AI neural networks). Students learn in a supportive and collaborative environment, and they walk away with their own real-world project that can be shared on college and internship applications.

üì¢ Scholarships are available! We‚Äôre committed to making this opportunity accessible to all students.

üìù Applications are considered on a rolling basis. Final application deadline: **May 30, 2025**

üîó Learn more and apply here: https://skewthescript.org/bootcamps
</div>

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8f82fdea-532b-4a4b-9937-108d7206dda5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>