# Homework: Linear Regression

## 00. Import Libraries

# 0. Problem Definition

You are the data analyst acting on a taskforce exploring first-year retention rates at a college in NY state. "Retention rate" is defined as the percentage of students who return to school the fall following their freshman year. Your goal in this assignment is to build a model that predicts what the retention rate will be, and to identify and rank possible interventions.

Data have been collected from the US government's [college scorecard](https://collegescorecard.ed.gov/data/) and columns cut down to a reasonable number, in the file `MERGED2019_20_PP.csv` found in the same folder as this file. The columns are:
* `UNITID` an ID number for the institution
* `OPEID` and `OPEID6` OPE ID numbers (8 and 6 digit) for the institution.
* `INSTNM` College's name
* `CITY` City where the college is located
* `STABBR` State abbreviation
* `ZIP` ZIP Code
* `SCH_DEG` Degree primarily granted at that school. "3" means "bachelors degrees".
* `MAIN` If the college has multiple locations, this is a 1 if it's the main campus.
* `ADM_RATE` Acceptance rate.
* `SAT_AVG` Average SAT score of students
* `DISTANCE` This is a 1 if the school is online, 0 otherwise (this used to be called "distance learning").
* `UGDS` Total number of undergraduate students.
* `UDGS_*` Percentage of the student body that is white, black, or asian.
* `PPTUG_EFF` Percentage of students who are part-time students.
* `COST` Total cost, including tuition and fees.
* `INEXPFTE` How much money the institution spends, per student.
* `AVGFACSAL` Average faculty salary per month.
* `PFTFAC` Percentage of the faculty that are part-time.
* `PCTPELL` Percentage of students who are PELL Grant eligible
* `RET_FT4` The retention rate.
* `MED_EARN_WNE_P7` Median salary of former students 7 years after they begin their education (does not include students currently enrolled in graduate school).

Yes, this is *after* Dr. Hallenbeck cut out 95% of the columns.


## 1. Dataset and Preprocessing

Load in the provided dataset into a variable called `college`.

Some of the columns in the dataset are "junk" that don't carry any useful value, and others are categorical data with far too many different categories (more than 20 or so). Remove these columns from the dataset.

Let's focus only on institutions that primarily award bachelors degrees. Cut out all other schools.

There were a lot of NAs in the dataset. To makethis homework possible, Dr. Hallenbeck wrote the code below to *impute* a lot of values in the dataset, by setting all of the missing values to the median for the column. Not doing this would have reduced the dataset to only 400 institutions. Run the code below to perform the imputing.

Briefly, argue at least one bias introduced by either imputing or simply removing the missing data.

In [None]:
for c in college.columns:
    if c in ["STABBR", "RET_FT4"]:
        continue
    college.loc[college[c].isna(), c] = college[c].median()

## 2. Linear Regression

Produce a linear regression model predicting student retention rates, and print out the summary of the model.

Does the variable for average SAT scores have a statistically significant contribution to the model? Why or why not?

Regardless of its statistical significance, interpret what the slope associated with average SAT score means.

As noted in the description above, the `distance` variable is properly a category, not a number, even though it is coded as 0 and 1 in the dataset itself. If we had forced it into being a string, so that we could make dummy variables out of it using `pd.get_dummies()`, what effect would that have had on either the output of the model or its interpretation?

Using the residuals, evaluate the quality of the model. Be explicit how you know what the issues or good qualities of it are.

How well does this model typically do at predicting retention rates? That is, what is the typical error?

## 3. Commentary and Analysis

Several colleagues on the task force have made suggestions to you in what you should investigate. Respond to their statements. Be explicit in how you know what you do.

**Professor Benjamins:** "Our college has been relying on adjunct labor for too long. Retention rates are dropping because faculty are overworked. We should instead be focusing on converting part-time faculty to full-time, and increasing salaries across the board to attract and retain better instructors."

Let's assume Professor Benjamin's line of reasoning is correct. By how much would we expect the retention rate to change if the percentage of the faculty which were full-time was raised from 0.7 to 0.8?

**Dr. IHeartNY:** "We are accepting too many students from New Jersey, who just aren't as good as the NY students and tend to leave for New Jersey -related reasons."

**Dr. Octavius:** "Our main campus is too large and intimidating to freshmen. We should focus on our smaller satellite campuses and helping students to transfer in after a year or two."

What factors related to race and social class seem to have a significant impact on retention rates?

For two of the factors above, indicate the direction of the relationship, and how you know.

For example, if average family income were a column, you might say "the higher the average family income, the lower the retention rate, as seen by blah blah blah."

Suggest reasons for these connections. You will likely need to do some research on this. If you do, make sure to cite your sources in your answer.