Load ELA Excel Data
======================
For some reason, the Open Data Portal does not provide the latest
grades 3-8 test data in its live URL. It's only available as an Excel
file to download from the page that should have the API link.
These Excel files have one sheet for each poplulation tested:

- all students
- ethnicity
- SWD status
- econ status
- ELL status

This notebooke loads the sheets from Excel workbook into a single `DataFrame`.
The Excel file has been modified by hand to delete the school name column,
delete sheets that don't contain data, and to rename the column headers.

The core school data set (school demographic data) can now be loaded from
as single function call from the custom `schools.py` module.


In [32]:
# thes Jupyter Notebook "magic" keywords tell Notebook to reload the modules
# every time so that we can work on schools.py and not have to restart the kernel
%load_ext autoreload
%autoreload 2

# load the demographic data and import pandas as pd
import pandas as pd
# our custom school data functions
import schools

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Load from Excel
===============
The excel workbook as a different sheet for each type of data:
    - ethnic breaks
    - students with disabilities
    - ELL
    - poverty
    

Each one of these sheets has the same column headers:

- dbn
- grade
- year
- category
- number_tested
- mean_scale_score
- level_1
- level_1_pct
- level_2
- level_2_pct
- level_3
- level_3_pct
- level_4
- level_4_pct
- level_3_4
- level_3_4_pct

Because they have the same columns, after each sheet we load we're going to
concatentate it to the dataframe (add rows to bottom of the dataset).

In the following code, I use **list comprehension**. This isn't specific to
`pandas` -- it's a built-in feature of the Python language. You might not
be familiar with it, though, so you can read more here:

- https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
- https://www.w3schools.com/python/python_lists_comprehension.asp

In line 6, I use a list comprehension to create a new list based on the
`sheet_names` list. We could do this with a `for` loop, but the list
comprehension is more concise and more "Pythonic". Semantically, 

1. read each string in the `sheet_names` list
2. load that data with the `pd.read()` function
3. append the result of `pd.read()` (a `DataFrame`) to the new list `data`

**Note:** _there's a lot of data in the excel file and the block takes a couple of minutes to run on my fast computer_.

In [33]:
sheet_names = ["all", "swd", "ethnicity", "gender", "econ_status", "ell"]

# open the Excel workbook
xls = pd.ExcelFile('ela.xlsx')
# read each sheet into a list of DataFrames
data = [pd.read_excel(xls, sheet) for sheet in sheet_names]
# combine them into a single dataframe
ela_df = pd.concat(data, ignore_index=True)

ela_df

Unnamed: 0,dbn,grade,year,category,number_tested,mean_scale_score,level_1,level_1_pct,level_2,level_2_pct,level_3,level_3_pct,level_4,level_4_pct,level_3_4,level_3_4_pct
0,01M015,3,2013,All Students,27,289.296295,14,51.851852,11,40.740742,2,7.407407,0,0,2,7.407407
1,01M015,3,2014,All Students,18,285.111114,10,55.555557,8,44.444443,0,0,0,0,0,0
2,01M015,3,2015,All Students,16,281.8125,9,56.25,5,31.25,2,12.5,0,0,2,12.5
3,01M015,3,2016,All Students,20,292.5,10,50,6,30,4,20,0,0,4,20
4,01M015,3,2017,All Students,27,302.370361,10,37.037037,8,29.629629,7,25.925926,2,7.407407,9,33.333332
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
424671,32K562,All Grades,2015,Never ELL,205,281.073181,103,50.243904,85,41.463413,15,7.317073,2,0.97561,17,8.292683
424672,32K562,All Grades,2016,Never ELL,159,290.32074,50,31.446541,93,58.490566,14,8.805032,2,1.257862,16,10.062893
424673,32K562,All Grades,2017,Never ELL,169,291.183441,47,27.810652,105,62.130177,12,7.100592,5,2.95858,17,10.059172
424674,32K562,All Grades,2018,Never ELL,204,594.754883,61,29.90196,91,44.607841,43,21.078432,9,4.411765,52,25.490196


### Clean and save
As we've seen, the DOE inserts an `s` if too few students are represented
by a single value (e.g. 3 ELL students sat for the ELA exam in third grade at a school).

Rather than having mixed data types, we will convert all of the cells with `s` to
the special `numpy` `NaN` (not a number).

Later we can filter these out or search for them by calling `isnull()` (example below).

After cleaning the data, we save it to a .csv file so that we don't have to run
the slow conversion each time we want to work with this data.

In [35]:
# convert all of these columns to numbers
numeric_cols = [
    "mean_scale_score",
    "level_1",
    "level_1_pct",
    "level_2",
    "level_2_pct",
    "level_3",
    "level_3_pct",
    "level_4",
    "level_4_pct",
    "level_3_4",
    "level_3_4_pct"
]

for col in numeric_cols:
    ela_df[col] = pd.to_numeric(ela_df[col], errors='coerce')

# dave the data to a .csv text file
ela_df.to_csv("ela-combined.csv", index=False)
"file saved"

'file saved'

Merge ELA and School Demographics
===============================

This last code block demonstrates the new `schools` package
and how to use the `load_school_demographics()` function.

Here we:
1. load the demographics data
2. `merge` it with the ELA test data
3. report on some results

We'll use some aggregate functions to make a couple of quick tables.


In [37]:
df = schools.load_school_demographics()

combined = df.merge(ela_df, how="inner", on=["dbn", "year"])


cols = ['dbn',
        'year',
        'district',
        'boro',
        'school_name', 
        'total_enrollment',
        'asian_1', 
        'black_1', 
        'hispanic_1', 
        'multi_racial_1', 
        'native_american_1', 
        'white_1', 
        'students_with_disabilities_1', 
        'english_language_learners_1',  
        'poverty_1',
        'economic_need_index']



combined[[year]]


Unnamed: 0,dbn,school_name,year,total_enrollment,grade_3k_pk_half_day_full,grade_k,grade_1,grade_2,grade_3,grade_4,...,level_1,level_1_pct,level_2,level_2_pct,level_3,level_3_pct,level_4,level_4_pct,level_3_4,level_3_4_pct
0,01M015,P.S. 015 Roberto Clemente,2016,178,17,28,33,27,31,24,...,10.0,50.000000,6.0,30.000000,4.0,20.000000,0.0,0.000000,4.0,20.000000
1,01M015,P.S. 015 Roberto Clemente,2016,178,17,28,33,27,31,24,...,5.0,33.333332,7.0,46.666668,3.0,20.000000,0.0,0.000000,3.0,20.000000
2,01M015,P.S. 015 Roberto Clemente,2016,178,17,28,33,27,31,24,...,8.0,50.000000,5.0,31.250000,3.0,18.750000,0.0,0.000000,3.0,18.750000
3,01M015,P.S. 015 Roberto Clemente,2016,178,17,28,33,27,31,24,...,23.0,45.098038,18.0,35.294117,10.0,19.607843,0.0,0.000000,10.0,19.607843
4,01M015,P.S. 015 Roberto Clemente,2016,178,17,28,33,27,31,24,...,3.0,27.272728,4.0,36.363636,4.0,36.363636,0.0,0.000000,4.0,36.363636
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255256,32K562,Evergreen Middle School for Urban Exploration,2019,380,0,0,0,0,0,0,...,10.0,16.393442,28.0,45.901638,14.0,22.950819,9.0,14.754098,23.0,37.704918
255257,32K562,Evergreen Middle School for Urban Exploration,2019,380,0,0,0,0,0,0,...,30.0,45.454544,15.0,22.727272,15.0,22.727272,6.0,9.090909,21.0,31.818182
255258,32K562,Evergreen Middle School for Urban Exploration,2019,380,0,0,0,0,0,0,...,28.0,35.443039,28.0,35.443039,19.0,24.050632,4.0,5.063291,23.0,29.113924
255259,32K562,Evergreen Middle School for Urban Exploration,2019,380,0,0,0,0,0,0,...,2.0,3.125000,33.0,51.562500,25.0,39.062500,4.0,6.250000,29.0,45.312500
