### Ingest EDFacts Graduation Rate Data for 2009

Data sources:  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2018-19-wide.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2017-18.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2016-17.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2015-16.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-release2-sch-sy2014-15.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2013-14.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2012-13.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2011-12.csv  
https://www2.ed.gov/about/inits/ed/edfacts/data-files/acgr-sch-sy2010-11.csv  

## What are the features in this data?
* Each instance is a school.

## Graduation rate is given by two features--Cohort and Rate--for each subpopulation below.
* Cohort - Number of students in that subpopulation
* Rate - Percentage (or range of percentage) of students in the cohort graduating with a high school diploma within 4 years

## School identifiers
* STNAM  - State name
* FIPST  - 2 digit code for the state
* LEANM  - School district name
* LEAID  - 7 digit code for school district 
* SCHNAM - School name
* NCESSH - 12 digit school id (only unique identifier for a school)


## Subpopulations
* ALL 	= All students in the school
* MAM 	= American Indian/Alaska   Native students
* MAS 	= Asian/Pacific Islander students
* MHI 	= Hispanic students
* MBL 	= Black students
* MWH 	= White students
* MTR 	= Two or More Races
* CWD 	= Children with Disabilities (IDEA)
* ECD 	= Economically Disadvantaged students
* LEP 	= Limited English Proficient students


In [11]:
import pandas as pd
import numpy as np
import seaborn as sns
import sys
sys.path.append("..")
from ingest import gr
# local_dir = "~/datasets/grad_rate/"
# !ls $local_dir 

In [12]:
# This takes about 2 minutes the first time it is run. Raw data is downloaded and 
# stored in gr_dfs.dat, gr_dfs.bak, and gr_dfs.dir. 
# Calling it a second and future times is much faster.
df = gr.make_total_cohort_frame(2013)
# Set the number formating for the pandas dataframe
pd.set_option('display.float_format', str)
df.style.format("{:.1f}")
df

HTTPError: HTTP Error 504: Gateway Timeout

In [9]:
print(df.ALL_COHORT_1314.describe())
df[df.ALL_COHORT_1314 > 4000]

NameError: name 'df' is not defined

In [10]:
print(df.ALL_RATE_1314.describe())

NameError: name 'df' is not defined

### Questions:
* How many schools are there?  
* How many schools per state?  
* How many schools in the dataset had 5 or less in the cohort?  
* What is the missing value distribution?

In [None]:
# Number of schools per state
df.STNAM.value_counts()

In [None]:
# Number of schools with 5 or fewer students in cohort.
"""
1468
1293
1304
1456
1265
1553
1350
1353
1475
"""
# Commented out because takes about a minute
if False:
    for year in range(2010,2019):
        df = gr.make_raw_gr_frame(year)
        for col in df.columns:
            if col.startswith('ALL_RATE'):
                print(sum(df[col] == 'PS'))

In [None]:
# Visualize missing values
sns.heatmap(df.isna().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})

In [None]:
# School names are not unique
len(df.SCHNAM) == len(df.SCHNAM.unique())
print(len(df.SCHNAM), len(df.SCHNAM.unique()))

In [None]:
# NCESSCH 12-digit identifier is unique
print(len(df.NCESSCH), len(df.NCESSCH.unique()))

In [None]:
# Are the state and school name unique?
df[['STNAM','SCHNAM']].value_counts()

In [None]:
# Are the LEAID and school name unique?
df[['LEAID','SCHNAM']].value_counts()