<div class="alert alert-block alert-info"><b>Instructor Note:</b> This project is intended for AP Statistics and/or AP Computer Science students seeking to learn modern data science skills. We'll use data from the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a> to see which colleges are worth the price of admission. Instructions, walkthrough videos, and additional support for teachers can be found <a href="https://skewthescript.org/data-science-challenge">here</a>.
<br>In this notebook (the 3rd of 4 total notebooks), students will...
<li>Fit a multiple regression models </li><li>Interpret the coefficients of multiple regression models </li><li>Compare the strength of different multiple regression models</li></div>

<div class="alert alert-block alert-success">

**Reference Guide for Python (student resource) -** Check out our <a href = "https://docs.google.com/document/d/1DaWN9HWInSBxSMhU0b5BetHlBM4n-ylqDiA8SJrhb7c/edit?tab=t.0">reference guide</a> for a full listing of useful Python commands for this project. 

<div>

<img src="https://skewthescript.org/s/college_price_tag.png">

## Data Science Project: Use data to determine the best and worst colleges for conquering student debt.

### Notebook 3: Multiple Regression

Does college pay off? We'll use some of the latest data from the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a> to answer that question. 

In this notebook (the 3rd of 4 total notebooks), you'll use Python to create a more advanced type of model: multiple regression models. In doing so, you'll be able to isolate which factors (controlling for other variables) that make certain colleges worth the price of admission.

In [None]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of Python commands
%pip install seaborn

import pyodide_http
pyodide_http.patch_all()

import pandas as pd
import statsmodels.formula.api as smf  # For statistical modeling
import matplotlib.pyplot as plt  # For creating static visualizations
import seaborn as sns  # For advanced data visualization

<div class="alert alert-block alert-success">

### The Dataset (`four_year_colleges.csv`)

**General description** - In this notebook, we'll be using the `four_year_colleges.csv` file, which only includes schools that offer four-year bachelors degrees and/or higher graduate degrees. Community colleges and trade schools often have different goals (e.g. facilitating transfers, direct career education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges only to other four-year colleges, we'll have clearer analyses and conclusions. 

This data is a subset of the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a>. The data is current as of the 2020-2021 school year.

**Description of all variables:** See <a href="https://docs.google.com/document/d/1C3eR6jZQ2HNbB5QkHaPsBfOcROZRcZ0FtzZZiyyS9sQ/edit">here</a>

**Detailed data file description:** See <a href="https://docs.google.com/spreadsheets/d/1fa_Bd3_eYEmxvKPcu3hK2Dgazdk-9bkeJwONMS6u43Q/edit?usp=sharing">here</a></div>

### 1.0 - Motivating multiple regression

To begin, let's download our data. We'll download the `four_year_colleges.csv` file from a public Github respository we are using to store data files and store it in an pandas dataframe called `four_year_col`.


In [None]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads data from the file 'our_year_colleges.csv' and stores it in an object called `four_year_col`
four_year_col = pd.read_csv('https://ds-modules.github.io/ds-challenge-assets/four_year_colleges.csv')
four_year_col.head()

As before, we're going to use **student loan default rates** as our key **outcome variable** in determining whether college "pays off."

In the previous notebook, we looked at the following predictors of student loan default rates:
- `pct_PELL` - percent of student body that receives PELL grants. Note: PELL grants are government scholarships given to students from low-income families
- `grad_rate` - percent of students who successfully graduate
- `net_tuition` - Net tuition (tuition minus average discounts and allowances) per student, in thousands of dollars

In the last notebook, we fit a simple linear regression model to predict `default_rate` (outcome) using `net_tuition` (predictor). Below is the scatterplot we produced (along with a visual of our linear model):

In [None]:
## Run this code but do not edit it
# create scatterplot: default_rate ~ net_tuition, with linear model overlayed
sns.lmplot(x='net_tuition', y='default_rate', data=four_year_col, line_kws={'color': 'orange'}, scatter_kws={'alpha': 0.7})
plt.title('Scatterplot with Linear Model: Default Rate vs Net Tuition')
plt.show()

<div class="alert alert-block alert-info"><b>Instructor Note:</b> As stated in the dataset description, <code>net_tuition</code> is given in thousands of dollars. So, an x-value of 10 on our graph indicates at $10,000 net tuition.

<div class="alert alert-block alert-warning">

**1.1 (Review Question) -** Is the correlation between tuition costs and student loan default rates positive or negative? Does the direction of the relationship suprise you? Why or why not?

</div>

**Double-click this cell to type your answer here:** The correlation is negative (as tuition increases, default rates tend to decrease). This is surprising. As tuition costs go up, you'd expect students to take on more debt. As they take on more debt, they'd be more likely to default on their loans. So, without seeing the data, I would have expected the relationship to be positive (as tuition rises, default rates also rise).

Below is the same graphic except, this time, we color the colleges by their graduation rates. Take a look:

In [None]:
## Run this code but do not edit it
# show scatter for default_rate ~ net_tuition, color by grad_rate
sns.scatterplot(
    x='net_tuition', 
    y='default_rate', 
    hue='grad_rate',  # Color points by grad_rate
    data=four_year_col,
    palette='viridis',
    alpha=0.7  # Adjust transparency for better visualization
)
plt.title('Scatterplot: Default Rate vs Net Tuition (Colored by Grad Rate)')
plt.show()

**Note:** As stated in the dataset description above, <code>default_rate</code> describes the percent of <b>all</b> of a school's borrowers that are in default on their student loans. This includes students who have graduated, transferred, or did not complete their programs.

**There's a lot going on in this graph.** For help, we recommend watching <a href="https://youtu.be/vEXWnOs72oQ">this video</a>, which discusses how to interpret graphs that visualize multiple variables at once.

<div class="alert alert-block alert-warning">

**1.2 -** Look at the bottom-right corner of the graph. These are colleges that charge their students a lot of money (high tuition) yet, somehow, they have low student loan default rates. Describe the graduation rates of these schools.

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-warning">

**1.3 -** Look at the top-left corner of the graph. These are colleges that don't charge a lot (low tuitions) yet, somehow, their students have high default rates. Describe the graduation rates of these schools.

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-warning">

**1.4 -** Based on your answers to the previous two questions, give a possible reason why students at **lower-cost** schools (who, presumably, have less initial debt than their peers) somehow have **higher** loan default rates.

</div>

**Double-click this cell to type your answer here:** ...

In data science, we say that graduation rates and tuition are **confounded**. Since they both rise and fall together, it can be hard to tell which is really "making the difference" in default rates. Is it possible to "tease out" which factor is *more* directly associated with students being able to pay off their loans? The next section will introduce you to a new type of modeling - multiple regression - that can help us answer this question.

### 2.0 - Fitting and interpreting a multiple regression model

Again, let's show the scatterplot between `net_tuition` (predictor) and `default_rate` (outcome), along with the linear model:

In [None]:
## Run this code but do not edit it
# create scatterplot: default_rate ~ net_tuition, with linear model overlayed
sns.lmplot(x='net_tuition', y='default_rate', data=four_year_col, line_kws={'color': 'orange'}, scatter_kws={'color':'black','alpha': 0.7})
plt.title('Scatterplot with Linear Model: Default Rate vs Net Tuition')
plt.show()

<div class="alert alert-block alert-warning">

**2.1 (Review Question) -** Use the `lm` command to fit and store the linear regression model that's visualized above, using `net_tuition` (predictor) in order to predict `default_rate` (outcome). Save the model in an object called `tuition_model` and print out the model.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** If you print out `tuition_model`, you should see two numbers: 8.0029 and -0.2077.</div>

Recall that simple linear regressions follow this formula:

$$
\hat{y} = \beta_{0} + \beta_{1}x
$$
Where:
- $\hat{y}$ is the predicted y-value (predicted outcome value)
- $\beta_{0}$ is the y-intercept --> the predicted y-value (outcome value) when x = 0 (the predictor's value is 0)
- $\beta_{1}$ is the slope --> the predicted change in y (outcome) for a 1-unit increase in x (predictor)
- $x$ is the x-value (predictor value)

<div class="alert alert-block alert-warning">

**2.2 (Review Question) -** What is the slope value from our `tuition_model`? Interpret the meaning of this value (in context).

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-info"><b>Instructor note:</b> Because <code>net_tuition</code> is given in thousands of dollars, a 1-unit increase in <code>net_tuition</code> can be interpreted as a 1,000 dollar increase in tuition (this is reflected in the sample response for interpreting the slope value). It's important to clarify this with students.
</div>

<div class="alert alert-block alert-warning">

**2.3 (Review Question) -** Use the `summary` command on `tuition_model` to see summary information about the linear model. What is the $R^2$ value from our `tuition_model`? What does this value indicate about the strength of the model?

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** You should find an $R^2$ value of 0.1882 </div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-info"><b>Student Note:</b> There's a lot going on in the following section. We recommend taking a break to watch <a href="https://youtu.be/kFB2Dp_gaWQ">this video</a>, which provides an overview of multiple regression models and walks through interpreting the values from this model. Once you're done with the video, continue reading below.

So far, we have only been working with simple linear regressions: models that use one predictor variable (`net_tuition`) to predict the outcome variables (`default_rate`). If we'd like to use multiple predictor variables at once in order to model our outcome, we can use a technique called **multiple regression**.

For example, imagine we want to use **both** `net_tuition` ($x_{1}$) and `grad_rate` ($x_{2}$) to predict `default_rate` ($y$). We can write a new model with multiple predictors, like this:

$$
\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}
$$
Where:
- $\hat{y}$ is the predicted `default_rate`
- $x_{1}$ is the `net_tuition`
- $x_{2}$ is the `grad_rate`

This means that...
- $\beta_{1}$ is the slope for `net_tuition` --> the slope between `default_rate` and `net_tuition`, controlling for all other predictors
- $\beta_{2}$ is the slope for `grad_rate` --> the slope between `default_rate` and `grad_rate`, controlling for all other predictors
- $\beta_{0}$ is the y-intercept --> the predicted y-value when $x_{1} = 0$ and $x_{2} = 0$ (when `net_tuition` and `grad_rate` are both 0)

Let's go ahead and fit this model, so we can understand what this all really means. In Python, when we use the package `statsmodels.formula.api`, if we want to use multiple predictors within our model (such as `net_tuition` and `grad_rate`), we simply include both of them in our `ols` command. See below:

In [None]:
## Run this code but do not edit it
tuition_grad_model = smf.ols(formula='default_rate ~ net_tuition + grad_rate', data=four_year_col).fit()
print(f"Intercept: {tuition_grad_model.params['Intercept']}")
print(f"Slope Net Tuition: {tuition_grad_model.params['net_tuition']}")
print(f"Slope Grad Rate: {tuition_grad_model.params['grad_rate']}")


As you can see, the values have changed a bit, and an extra slope term has now appeared in our model. We can plug these values into our model like so:

$$
\hat{y} = 14.479 + (0.007)x_{1} + (-0.160)x_{2}
$$

Here's how we can interpret the slopes in our model:
- $\beta_{1} = 0.007$ --> For every 1,000 dollar increase in `net_tuition`, we expect a 0.007 percent point **increase** in `default_rate`, controlling for `grad_rate`
- $\beta_{2} = -0.160$ -->  For every 1 percentage point increase in `grad_rate`, we expect a 0.160 percentage point **decrease** in `default_rate`, controlling for `net_tuition`

The key is that multiple regression allows you to **control for other predictors**, which helps us **eliminate confounding**. When we can control for graduation rates - i.e. when comparing colleges with similar graduation rates - we see that tuition is now **positively** related to default rates. In other words, if students attend colleges with similar graduation rates, we'd expect the one that charges more in tuition to have higher rates of default. 

So, charging students more for school is, in fact, associated with higher rates of default - as long as we're comparing among schools with similar graduation rates.

<div class="alert alert-block alert-info"><b>Instructor Note:</b> You may need to use some direct instruction to walk students through the introduction to multiple regression (shown above).
</div>

Just as we can use the `summary` command to find the $R^2$ value of a simple linear regression, we can use `summary` to find the $R^2$ of our multiple regression model:

In [None]:
## Run this code but do not edit it
# summary of tuition_grad_model
tuition_grad_model.summary()

<div class="alert alert-block alert-warning">

**2.4 -** How does the $R^2$ value of our multiple regression model (`tuition_grad_model`) compare to the $R^2$ value of our simple linear regression model (`tuition_model`). Did adding `grad_rate` alongside `net_tuition` help make the model's predictions stronger? Explain.

</div>

**Double-click this cell to type your answer here:** ...

### 3.0 - Making your own multiple regression models

<div class="alert alert-block alert-warning">

**3.1 -** There's no reason that you have to stop at 2 predictors. Your model could have many predictors! Use the `lm` command to create a model that predicts `default_rate` using three predictor variables: `net_tuition`, `grad_rate`, and `pct_PELL`. Store the model in an object called `tuition_grad_pell_model` and then print out the model.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** When you print out the model, you should see four numbers: 8.513, 0.031, -0.117, 0.090 </div>

<div class="alert alert-block alert-warning">

**3.2 -** Interpret (in context) the slope value for `pct_PELL` from your model.

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-warning">

**3.3 -** Use the `summary` command to find the $R^2$ value of the `tuition_grad_pell_model`.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info">

**Check yourself:** The output should show an $R^2$ value of 0.5775 </div>

<div class="alert alert-block alert-warning">

**3.4 -** Compare the $R^2$ values from `tuition_grad_model` and `tuition_grad_pell_model`. Did adding `pct_PELL` strengthen the model's predictions? If so, did it strengthen the model's predictions by a large amount? Explain.

</div>

**Double-click this cell to type your answer here:** ...

<div class="alert alert-block alert-warning">

**3.5 -** Create *your own* multiple regression model, using variables of your own choosing. Analyze the slope values from at least two separate predictors and try to maximize the $R^2$ value. 

**Hints:** 
- Look at the dataset description <a href="https://docs.google.com/document/d/1C3eR6jZQ2HNbB5QkHaPsBfOcROZRcZ0FtzZZiyyS9sQ/edit">here</a> to identify good potential predictor variables for your model.
- You may be tempted to use *all* the variables in the dataset as predictors. This may not be the best idea. The next notebook will explore why.

</div>

In [None]:
# Your code goes here
...

<div class="alert alert-block alert-info"><b>Instructor Notes:</b> 
<li> It can be fun to turn this last question into a class competiton: Who can make the strongest model (i.e.the highest R^2 value)?<li>Students may wonder: Why not include every single variable as a predictor? Won't it just add strength to the model? Well, not really. Let students know that this is an issue that we'll explore in the next notebook. </div>

<div class="alert alert-block alert-success">


### Summer Opportunity: Do you want to learn more about Data Science & AI?
Join our Data Science & AI Summer Bootcamp, where you'll take your learning from this project to the next level. **No prior coding or statistics experience required!** Designed by Harvard grads, the bootcamp allows students from all experience levels to dive deeper into data science concepts, from the basics (e.g. linear regression) to the advanced (e.g. AI neural networks). Students learn in a supportive and collaborative environment, and they walk away with their own real-world project that can be shared on college and internship applications.

📢 Scholarships are available! We’re committed to making this opportunity accessible to all students.

📝 Applications are considered on a rolling basis. Final application deadline: **May 30, 2025**

🔗 Learn more and apply here: https://skewthescript.org/bootcamps
</div>

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8f82fdea-532b-4a4b-9937-108d7206dda5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>