<hr size=7 color=#8D84B5 > </hr> 

<div align="center">

# <font color = #6b4cde face="Verdana"> **Universities and Gentrification**
## <font color = #6b4cde face="Verdana"> **UMD CMSC320 Data Science, Spring 2023** </font>
## <font color = #6b4cde face="Verdana"> **Joe Diaz and Connor Pymm** </font>
</center>

</div>

<hr size=7 color=#8D84B5 > </hr> 

### 🙏RUN ME FIRST🙏

In [None]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

<hr size=7 color=#8D84B5 > </hr> 

<div align="center">

## <font color = #6b4cde face="Verdana"> **Data Curation** </font>
</center>

</div>

<hr size=7 color=#8D84B5 > </hr> 

### Selecting Datasets

In order to perform analysis on colleges and their surrounding regions, we needed to
find some subset of colleges, a dataset with characteristics of those colleges on a yearly basis, 
and then a dataset with characteristics of their nearby geographical areas. again yearly. 

Initially, we decided to limit our analysis to the top 100 universities in the country 
according to current US News rankings, under the assumption that more highly ranked universities 
might have a more significant impact on their respective communities. We used Andy Reiter's
“U.S. News & World Report Historical Liberal Arts College and University Rankings” dataset (**citation**).
  
In order to obtain college characteristics, we discovered that the Department of
Education has extensive data available on accredited universities called the College Scorecard, which has
a public API for programatically querying data.
  
In order to obtain characteristics of the region around each university, we needed a dataset that would contain
demographic and economic data for defined geographical regions associated with the location of the University.
We found that the American Community Survey yearly data from the Census had the housing cost and income data we
wanted to analyze, and that its Public Use Microdata API from the census allowed us to programatically request that
data for geographical groupings called "Public Use Microdata Areas," which are the smallest geographical entities that
the Census collects yearly data from.

### Extract, Transform, and Load

Since we queried a *substantial* amount of data from *ridiculously large* datasets,
and requesting federal data from the Department of Education and the Census required
registration for and usage of API keys, we decided that on top of the source datasets that
we were able to download in full, stored in our repository under ETL/source_data, we would
create modules for making federal API requests and loading the results into csv files for usage
later. 
  
Dataframes that we generated from data that we queried were stored under ETL/generated_data
as csv, and then loaded into the notebook when needed, specifically: we built ScorecardData.csv using
our scorecard_client.py module, which defines a CollegeScorecardClient object that can be used to query
DoE data, given a valid API key, set of desired variables, and set of colleges using IPEDS IDs, we built 
college_FIPs by combining the university list we got from Reiter with state FIPs data from DoE and county 
FIPs data by collecting them manually university by university.

For the rest of this tutorial, we will be using the data we collected by default, but if you would like to
recreate the analysis of this tutorial using a different set of colleges, and thus your own datasets, you can
fork this repository and use the modules provided in the ETL/ directory to do so.

<hr size=7 color=#8D84B5 > </hr> 

<div align="center">

## <font color = #6b4cde face="Verdana"> **Data Processing** </font>
</center>

</div>

<hr size=7 color=#8D84B5 > </hr> 

### Loading and Representation

Here, we load the data we have downloaded or generated locally into our
notebook for use to use in our analysis. We stored each of our datasets as
csv, so they are easily loaded into Pandas Dataframes.

In [None]:
# Read dataframes from generated data
CollegeScorecard_df_raw = pd.read_csv("generated_data/ScorecardData.csv")
fips_df = pd.read_csv("generated_data/college_FIPs.csv")
cpi_df = pd.read_csv("source_data/cpi_all.csv").groupby("Year")["Value"].mean()

# Join county FIPs codes into College Scorecard dataframe for use later in
# associating with Census geographies.
scard_df = CollegeScorecard_df_raw.copy()
scard_df = pd.merge(scard_df, fips_df[["school.name", "county_fips"]], on="school.name", how="left")

scard_df.head()

### Data Cleaning and Reshaping

The data that we have still uses the variable names and formatting of our
original sources, and those variable names are unweildy and not ideal for usage
in analysis later, so we rename our columns to be more human readable and
developer friendly. Additionally, cost data in our sources does not account for
inflation, so we should use an all-consumers/all-goods CPI to transform our dollar
values to a standard value.

In [None]:
# Rename columns to be more readable, usable
scard_df = scard_df.rename(
    columns={
        "student.size": "size",
        "cost.tuition.in_state": "in_state_tuition",
        "cost.tuition.out_of_state": "out_state_tuition",
        "cost.avg_net_price.public": "public_net_price",
        "cost.avg_net_price.private": "private_net_price",
        "id": "ipeds_id",
        "school.name": "name",
        "school.carnegie_size_setting": "size_setting",
        "school.zip": "zip",
        "school.state_fips": "state_fips",
        "school.region_id": "region_id",
        "school.locale": "locale",
        "school.ownership": "ownership"
    }
)

# Combine public and private net prices into a single net price column, and drop those columns
scard_df["net_cost"] = scard_df.apply(lambda row: 
            row["public_net_price"] if (row["ownership"] == 1) else row["private_net_price"],
        axis=1
)
scard_df["net_cost_adjusted"] = scard_df.apply(lambda row: 
            (row["net_cost"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["in_tuition_adjusted"] = scard_df.apply(lambda row: 
            (row["in_state_tuition"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["out_tuition_adjusted"] = scard_df.apply(lambda row: 
            (row["out_state_tuition"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df.drop(["public_net_price", "private_net_price"], axis=1, inplace=True)
scard_df.head()

We can note that some rows do not have cost data associated with them, thus they are missing data.
Since we will use this cost data later in our analysis, we need to either interpolate the missing data
or drop the invalid rows. Here, we experiment with dropping rows with missing data.

In [None]:
scard_clipped = scard_df.dropna(subset=["net_cost", "in_state_tuition", "out_state_tuition"]).copy()
scard_clipped

It seems as if the clipped dataframe after dropping null cost data is just the data after 2009.
To verify that this is true, I try querying the original dataset purely by restricting the years.
If there is complete cost data from 2009 to 2020, then the resulting dataframe should be equal to the
dataframe resulting from dropping null data. Run the next code cell to confirm this.

In [None]:
scard_clipped_year = scard_df[scard_df["year"] >= 2009].copy()
scard_clipped_year.equals(scard_clipped)