## What are the features in this data?
* Each instance is a school.

## Graduation rate is given by two features--Cohort and Rate--for each subpopulation below.
* Cohort - Number of students in that subpopulation
* Rate - Percentage (or range of percentage) of students in the cohort graduating with a high school diploma within 4 years

## School identifiers
* STNAM  - State name
* FIPST  - 2 digit code for the state
* LEANM  - School district name
* LEAID  - 7 digit code for school district 
* SCHNAM - School name
* NCESSH - 12 digit school id (only unique identifier for a school)


## Subpopulations
* ALL 	= All students in the school
* MAM 	= American Indian/Alaska   Native students
* MAS 	= Asian/Pacific Islander students
* MHI 	= Hispanic students
* MBL 	= Black students
* MWH 	= White students
* MTR 	= Two or More Races
* CWD 	= Children with Disabilities (IDEA)
* ECD 	= Economically Disadvantaged students
* LEP 	= Limited English Proficient students


In [178]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from swampy import structshape as ss
import sys
sys.path.append("..")
from ingest import gr


In [179]:
dfs = [gr.make_raw_gr_frame(year=y) for y in range(2010, 2019)]
years = [gr.year_string(y) for y in range(2010, 2019)]

In [180]:
# Note that many of the column names have the school year in them
print(dfs[0].columns)

Index(['STNAM', 'FIPST', 'LEAID', 'LEANM', 'NCESSCH', 'SCHNAM',
       'ALL_COHORT_1011', 'ALL_RATE_1011', 'MAM_COHORT_1011', 'MAM_RATE_1011',
       'MAS_COHORT_1011', 'MAS_RATE_1011', 'MBL_COHORT_1011', 'MBL_RATE_1011',
       'MHI_COHORT_1011', 'MHI_RATE_1011', 'MTR_COHORT_1011', 'MTR_RATE_1011',
       'MWH_COHORT_1011', 'MWH_RATE_1011', 'CWD_COHORT_1011', 'CWD_RATE_1011',
       'ECD_COHORT_1011', 'ECD_RATE_1011', 'LEP_COHORT_1011', 'LEP_RATE_1011',
       'DATE_CUR'],
      dtype='object')


Check shape of data from different years since we are planning to combine them

In [181]:
shape_data = [(school_year, df.shape) for school_year, df in zip(years, dfs)]
shape = pd.DataFrame(shape_data, columns=('school_year', 'shape'))
shape


Unnamed: 0,school_year,shape
0,1011,"(21335, 27)"
1,1112,"(21244, 27)"
2,1213,"(22077, 27)"
3,1314,"(22385, 28)"
4,1415,"(22167, 27)"
5,1516,"(23090, 27)"
6,1617,"(23129, 29)"
7,1718,"(23240, 33)"
8,1819,"(22900, 33)"


In [182]:
# Inspect features that are present in some but not common to all.
# Start by removing the years from the column names.
# cols_wo_year => Column names without the year
cols_wo_year = [list(map(lambda x: x.replace(y, ""), df.columns))
               for y, df in zip(years, dfs)]
print(set(cols_wo_year[3]) - set(cols_wo_year[0]))
print(set(cols_wo_year[6]) - set(cols_wo_year[0]))
print(set(cols_wo_year[7]) - set(cols_wo_year[0]))
print(set(cols_wo_year[8]) - set(cols_wo_year[0]))

{'INSERT_DATE'}
{'ST_LEAID', 'ST_SCHID'}
{'HOM_COHORT_', 'ST_SCHID', 'FCS_COHORT_', 'HOM_RATE_', 'FCS_RATE_', 'ST_LEAID'}
{'HOM_COHORT_', 'ST_SCHID', 'FCS_COHORT_', 'HOM_RATE_', 'FCS_RATE_', 'ST_LEAID'}


In [183]:
# INSERT_DATE refers to when the data was inserted and is not relevant for our study.
dfs[3].drop(['INSERT_DATE'], axis=1, inplace=True)
# ST_SCHID and ST_LEAID are values assigned by the state which are not found in the other years. From the data, it looks like maybe these
# started being assigned in 2016. If we need another geographical grouping mechanism in the future we can look into it.
dfs[6].drop(['ST_LEAID', 'ST_SCHID'], axis=1, inplace=True)
# HOM_COHORT and FCS_COHORT refer to the subpopulation of homeless and foster care students, which was not tracked before school year 2017-2018
idx7_sr = shape.school_year[7]
idx8_sr = shape.school_year[8]
dfs[7].drop(['ST_LEAID', 'ST_SCHID', 'FCS_RATE_'+idx7_sr, 'FCS_COHORT_'+idx7_sr,
            'HOM_RATE_'+idx7_sr, 'HOM_COHORT_'+idx7_sr], axis=1, inplace=True)
dfs[8].drop(['ST_LEAID', 'ST_SCHID', 'FCS_RATE_'+idx8_sr, 'FCS_COHORT_'+idx8_sr,
            'HOM_RATE_'+idx8_sr, 'HOM_COHORT_'+idx8_sr], axis=1, inplace=True)


In [184]:
# Verify that all dataframes have the same columns (minus the school year) before we combine. 
cols_wo_year = [list(map(lambda x: x.replace(y, ""), df.columns))
               for y, df in zip(years, dfs)]
for num1,num2 in zip(range(0,8),range(1,9)):
    assert cols_wo_year[num1] == cols_wo_year[num2] 

In [185]:
big_df = pd.DataFrame()
print("big_df_columns",big_df.columns)
for idx, df in enumerate(dfs):
    df.columns=cols_wo_year[0]
    df['Year'] = years[idx] 
    # reorder columns to be how we want
    df = df[['Year']+cols_wo_year[0]]
    big_df = pd.concat([big_df,df],axis=1)


big_df_columns Index([], dtype='object')


In [187]:
big_df.head(n=3)

Unnamed: 0,Year,STNAM,FIPST,LEAID,LEANM,NCESSCH,SCHNAM,ALL_COHORT_,ALL_RATE_,MAM_COHORT_,...,MTR_RATE_,MWH_COHORT_,MWH_RATE_,CWD_COHORT_,CWD_RATE_,ECD_COHORT_,ECD_RATE_,LEP_COHORT_,LEP_RATE_,DATE_CUR
0,1011,ALABAMA,1.0,100005.0,Albertville City,10000500000.0,Albertville High Sch,252,80,.,...,PS,175.0,GE95,19.0,GE80,114.0,90-94,67.0,75-79,24JUL20
1,1011,ALABAMA,1.0,100006.0,Marshall County,10000600000.0,Asbury Sch,57,70-79,.,...,,31.0,GE90,7.0,GE50,15.0,GE50,3.0,PS,24JUL20
2,1011,ALABAMA,1.0,100006.0,Marshall County,10000600000.0,Douglas High Sch,125,65-69,2,...,,85.0,90-94,9.0,GE50,99.0,90-94,7.0,GE50,24JUL20
