# EDA for Data Incubator project proposal, fall 2020

## How does performance relate to compensation in publicly funded universities?

### -1- Get data

#### Start with the raw csv file downloaded from website

Link to [Urban Institute's data explorer](https://educationdata.urban.org/data-explorer/colleges/) (super cool). 

Link to [my document](https://github.com/dagny099/does_good_payoff/blob/master/docs/getting-started.rst) showing selection criteria and variables.

*Rounds of downloads* <br>
Rd_1: TX, FL         ... Approximate results: 188k records from 838 institutions <br>
Rd_2: OR, AZ, MI, OH ... Approximate results: 169k records from 752 institutions <br>
Rd_3: NY, GA, MN     ... Approximate results: 170k records from 755 institutions <br>
Rd_4: CA             ... Approximate results: 164k records from 729 institutions <br>
Rd_5: VA, PA, IN, WI ... Approximate results: 178k records from 791 institutions <br>

Note: There appears to be a query limit of around 200k records, hence multiple rounds of downloads. 

Rationale behind states chosen:
- Question of interest is geared towards public funding and higher education outcomes
- Wikipedia lists the Top 10 university campuses by enrollment, by year
- Rd 1. I chose to include the states where those campuses are located (based on [2018–19 academic year]{https://en.wikipedia.org/wiki/List_of_United_States_public_university_campuses_by_enrollment#2018%E2%80%9319_enrollment})
- Rd 2-4. Include more geographic diversity by adding CA, NY, OR, MI
- Rd 5. Include any state with a university in the top-10 since 2009, added IN, PA

Manually summed size of aggregated csv files: **70MB**

In [None]:
# Use bash to combine all files with same columns from multiple downloads:
 
!(head -1 ../../EducationDataPortal_TX_FL_years_after_entry.csv && tail -n +2 -q ../../EducationDataPortal*_years_after_entry.csv ) > ../../EducationDataPortal_years_after_entry_ALL.csv
!(head -1 ../../EducationDataPortal_TX_FL_level_of_study.csv && tail -n +2 -q ../../EducationDataPortal*_level_of_study.csv ) > ../../EducationDataPortal_level_of_study_ALL.csv
!(head -1 ../../EducationDataPortal_TX_FL_institutions.csv && tail -n +2 -q ../../EducationDataPortal*_institutions.csv ) > ../../EducationDataPortal_institutions_ALL.csv

!mv ../../EducationDataPortal_years_after_entry_ALL.csv ../data/raw/higherEd/usa/
!mv ../../EducationDataPortal_level_of_study_ALL.csv ../data/raw/higherEd/usa/
!mv ../../EducationDataPortal_institutions_ALL.csv ../data/raw/higherEd/usa/



In [2]:
import pandas as pd
import numpy as np
%matplotlib inline

pathDir = '../data/raw/higherEd/usa/'
filename_inst = 'EducationDataPortal_institutions_ALL.csv'
filename_los = 'EducationDataPortal_level_of_study_ALL.csv'
filename_yae = 'EducationDataPortal_years_after_entry_ALL.csv'

# Read downloaded csv files
tmp = pd.read_csv(pathDir+filename_inst)
tmp_los = pd.read_csv(pathDir+filename_los)
tmp_yae = pd.read_csv(pathDir+filename_yae)

  interactivity=interactivity, compiler=compiler, result=result)


### -2- Reduce data, Clean data

In [13]:
# I know I want to look at Degree-granting institutions but let's see what other inst_category values there are:
want='Degree-granting, primarily baccalaureate or above'
print(f"There are {tmp[tmp.inst_category==want].shape[0]} records from {want} institutions")
print(f"# which is {round(tmp[tmp.inst_category==want].shape[0]/tmp.shape[0],2)} of all {tmp.shape[0]} records\n")

print(tmp.inst_category.value_counts())


There are 10997 records from Degree-granting, primarily baccalaureate or above institutions
# which is 0.08 of all 129945 records

Nondegree-granting, sub-baccalaureate                      12777
Degree-granting, primarily baccalaureate or above          10997
Degree-granting, associate's and certificates               9291
Degree-granting, not primarily baccalaureate or above       2867
Degree-granting, graduate with no undergraduate degrees     1981
Not applicable                                              1075
Missing/not reported                                         114
Nondegree-granting, above the baccalaureate                   51
Name: inst_category, dtype: int64


In [3]:
# Let's look at the size of those institutions:
tmp[tmp.inst_category==want]['inst_size'].value_counts()

1,000-4,999             4454
Under 1,000             2890
5,000-9,999             1237
20,000 and above         894
10,000-19,999            878
Missing/not reported      10
Not applicable             2
Name: inst_size, dtype: int64

In [None]:
# How many are institutions with 5K or above (that's 3 categories):
sizeFilt=tmp.apply(lambda row: row['inst_size'] in ['5,000-9,999','10,000-19,999','20,000 and above'], axis=1)

In [45]:
tmp[sizeFilt].groupby(['state_name','year'])['number_enrolled_total'].count()

state_name  year
Arizona     2005     2
            2006     2
            2007     7
            2008     1
            2009     3
            2010     5
            2011     2
            2012     4
            2013     6
            2014     6
            2015     6
            2016     6
            2017     6
California  2005    41
            2006    41
            2007    41
            2008    27
            2009    42
            2010    41
            2011    45
            2012    44
            2013    46
            2014    48
            2015     0
            2016     0
            2017     6
Florida     2005     2
            2006     4
            2007    17
            2008    15
                    ..
Texas       2013    35
            2014    36
            2015    36
            2016    37
            2017    37
Virginia    2005    13
            2007    13
            2008    14
            2009    12
            2010    14
            2011    14
            2012 

In [30]:
tmp[sizeFilt].pivot(index='unitid',columns='year',values='number_enrolled_total')

year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
unitid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
103644,,,,,,661.0,,,,,,,
103893,,,,,4676.0,,,,,,,,
104151,7706.0,7894.0,7894.0,9274.0,9707.0,9344.0,9544.0,9254.0,7171.0,7647.0,8348.0,8230.0,7874.0
104160,,,,,,,,,,,,,
104179,5974.0,6009.0,8482.0,,6966.0,7032.0,7032.0,7300.0,7401.0,7744.0,8037.0,7753.0,7360.0
104346,,,,,,,,,,,,,
104425,,,,,,,,,,,,,
104577,,,,,,,,,,,,,
104708,,,,,,,,,,,,,
104717,,,1009.0,,,2273.0,,2818.0,3190.0,2980.0,3459.0,3318.0,3515.0


In [None]:
# ------ REDUCE ROWS TO SUIT QUESTIONS OF INTEREST ------
criteria = dict({'inst_category':'Degree-granting, primarily baccalaureate or above',
                 'inst_size':['20,000 and above','10,000-19,999']})

univData = tmp[(tmp['inst_category']==criteria['inst_category'])] # & (tmp['inst_size']==criteria['inst_size'])]
univData = univData[(univData.inst_size=='20,000 and above')|(univData.inst_size=='10,000-19,999')]
univData['inst_category'].value_counts()
print(f"Reduced data to suit questions-of-interest:")
print(f"ORIG: {tmp.shape[0]} records from {tmp.unitid.nunique()} institutions\nto")
print(f"FILTERED: {univData.shape[0]} records from {univData.unitid.nunique()} institutions")
print(f"Data from {univData.state_name.nunique()} states")
print(f"\nThese were the criteria applied:")
print(criteria)

In [None]:
# Which institutions have at least 8 years of data? => Generate a list of schools (unitid)
# Enrollment data, e.g. "number_enrolled_total"
nYrs_ts = 8

# using pivot -- I don't think I did this correctly ...

# df_AvgEnrollment = pd.pivot_table(univData,index=['state_name','unitid'],values='number_enrolled_total',\
#                aggfunc=({np.mean,'count'})).sort_values(['state_name','mean'],ascending=False)
# df_AvgEnrollment = df_AvgEnrollment[df_AvgEnrollment['count']>=nYrs_ts]

Use_These_Univ = univData.reset_index().unitid.unique()

In [None]:
# Keep columns with at least 80% non-null values
keepThresh = .80
keepCols = [c for c in dataBigUniv.columns if dataBigUniv[c].isnull().sum() < (dataBigUniv.shape[0]*keepThresh)]

print(f'Dropped these columns for missing more than {100*keepThresh}% values:')
print([col for col in dataBigUniv.columns if col not in keepCols])

In [None]:
# Filter all data files to only include institutions that meet criteria
tmp['keepRow']=tmp['unitid'].apply(lambda x: True if x in Use_These_Univ else False)
dataInst = tmp[tmp['keepRow']==True]
dataInst.drop('keepRow',axis=1,inplace=True)

tmp_los['keepRow']=tmp_los['unitid'].apply(lambda x: True if x in Use_These_Univ else False)
dataLOS = tmp_los[tmp_los['keepRow']==True]
dataLOS.drop('keepRow',axis=1,inplace=True)

tmp_yae['keepRow']=tmp_yae['unitid'].apply(lambda x: True if x in Use_These_Univ else False)
dataYAE = tmp_yae[tmp_yae['keepRow']==True]
dataYAE.drop('keepRow',axis=1,inplace=True)


In [None]:
# Merge data sets  (would definitely create SQL tables for this)
data = pd.merge(dataInst, dataYAE, how ='left', on =['unitid','year','inst_name','state_name'])
data = pd.merge(data, dataLOS, how ='left', on =['unitid','year','inst_name','state_name'])
data.describe(include='all')


In [1]:
# KEEP AN EXCEL FILE WITH ALL DATA
# why? I need to double check that what I'm doing in Python is correct

data.to_csv('../data/processed/'+'Merged_Univ_DataMore3.csv')


NameError: name 'data' is not defined

In [None]:
# CHANGE SOME DATATYPES:
# Change dtype 'year' to DATETIME64
# Change dtype 'unitid', 'inst_name', 'state_name' as CATEGORY

dataInst['year'] = dataInst['year'].apply(pd.to_datetime, format='%Y')



In [None]:
# Add columns for AdmissionRate and EnrollmentRate
univData['admission_rate']=univData['number_admitted']/univData['number_applied']
univData['enrollment_rate']=univData['number_enrolled_total']/univData['number_admitted']


In [None]:
pathDir = '../data/processed/'
filename = 'EducationDataPortal_HigherEdMore.xlsx' #This started as an empty excel file with data dictionary


# Read excel sheet
whichSheet = 'Data_by_Institution'
df = pd.read_excel(pathDir+filename, sheet_name=whichSheet, 
                   usecols=keepCols, na_values=np.nan, verbose=True,
                   dtype={'unitid':'category', 'inst_name': 'category', 'state_name': 'category'},
                   parse_dates=['year'])

whichSheet = 'Breakdown_years_after_entry'
df_yae = pd.read_excel(pathDir+filename, sheet_name=whichSheet, 
                   na_values=np.nan, verbose=True,
                   dtype={'unitid':'category', 'inst_name': 'category', 'state_name': 'category'},
                   parse_dates=['year'])

whichSheet = 'Breakdown_level_of_study'
df_los = pd.read_excel(pathDir+filename, sheet_name=whichSheet, 
                   na_values=np.nan, verbose=True,
                   dtype={'unitid':'category', 'inst_name': 'category', 'state_name': 'category'},
                   parse_dates=['year'])


In [None]:
# Merge data sets  (would definitely create SQL tables for this)
data = pd.merge(df, df_yae, how ='inner', on =['unitid','year','inst_name','state_name'])
data = pd.merge(data, df_los, how ='inner', on =['unitid','year','inst_name','state_name'])
data.to_csv('../data/processed/'+'Merged_Univ_Data.csv')
data.describe(include='all')


### -3- Compute some stats, Group data

In [None]:
# Add column for admission_rate
data['admission_rate'] = data['number_admitted'] /  data['number_applied']


In [None]:
# Enrollment & Admissions Trends
cols = ['inst_name','state_name','number_applied','number_admitted','number_enrolled_total','admission_rate']
df_Enrollment = pd.DataFrame(data.groupby(['state_name','inst_name','year'])[cols].mean().to_records()) 


In [None]:
# Graduation Trends
cols = ['inst_name','state_name','completers_150pct','completion_rate_150pct']
df_Grad = pd.DataFrame(data.groupby(['state_name','inst_name','year'])[cols].mean().to_records()) 


In [None]:
# Funding Trends, Revenue
cols = ['inst_name','state_name','rev_tuition_fees_net','rev_tuition_fees_gross']
df_Rev = pd.DataFrame(data.groupby(['state_name','inst_name','year'])[cols].mean().to_records()) 

In [None]:
# Funding Trends, Expenditures
cols = ['inst_name','state_name','exp_total_current','exp_total_salaries','exp_total_benefits']
df_Exp = pd.DataFrame(data.groupby(['state_name','inst_name','year'])[cols].mean().to_records()) 


In [None]:
# Earnings Trends
# cols = ['inst_name','state_name','earnings_mean','count_working','count_not_working']
# df_Earnings = pd.DataFrame(data.groupby(['state_name','inst_name','year'])[cols].mean().to_records()) 

### -5- Have a look at the data, visually and export table

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

figpathDir = '../src/visualization/'

In [None]:
# PAIR-PLOT for ENROLLMENT TRENDS: 
DF = df_Enrollment
sns.set_style('white')

pair_plot = sns.pairplot(DF, hue='year');
pair_plot.savefig(figpathDir+'Enrollment_pair_plot_by_YEAR.png')  

pair_plot = sns.pairplot(DF, hue='state_name');
pair_plot.savefig(figpathDir+'Enrollment_pair_plot_by_STATE.png')  


In [None]:
# PAIR-PLOT for GRADUATION TRENDS: 
DF = df_Grad
sns.set_style('white')

pair_plot = sns.pairplot(DF, hue='year');
pair_plot.savefig(figpathDir+'Grad_pair_plot_by_YEAR.png')  

pair_plot = sns.pairplot(DF, hue='state_name');
pair_plot.savefig(figpathDir+'Grad_pair_plot_by_STATE.png')  


In [None]:
# PAIR-PLOT for Funding-Revenue TRENDS: 
DF = df_Rev
sns.set_style('white')

pair_plot = sns.pairplot(DF, hue='year');
pair_plot.savefig(figpathDir+'Fund_Rev_pair_plot_by_YEAR.png')  

pair_plot = sns.pairplot(DF, hue='state_name');
pair_plot.savefig(figpathDir+'Fund_Rev_pair_plot_by_STATE.png')  


In [None]:
# PAIR-PLOT for Funding-Expenditures TRENDS: 
DF = df_Exp
sns.set_style('white')

pair_plot = sns.pairplot(DF, hue='year');
pair_plot.savefig(figpathDir+'Fund_Exp_pair_plot_by_YEAR.png')  

pair_plot = sns.pairplot(DF, hue='state_name');
pair_plot.savefig(figpathDir+'Fund_Exp_pair_plot_by_STATE.png')  


#### Time series of admission_rate

In [None]:
sns.set()
pvDf_admission_rate = df_Enrollment['']
ad_ts = pvDf_admission_rate.plot(figsize=(12,8),lw=2,title='Admission Rate at Public Universities over Time')
ad_ts.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.ylabel('number admissions / number applicants');

In [None]:
pd.pivot(df_Enrollment

In [None]:
#### Boxplots
sns.boxplot()