Reshaping Data
==============
This notebook uses `pandas` to convert our "[long](https://www.statology.org/long-vs-wide-data/)"
dataframe to a wide dataframe. This is called re-shaping.

Our default NYSED test scores are in a long format. There is a column called `category`
that repeats the same 14 unique values (_e.g._ "All Students", "Asian", "Never ELL", ...)
In this long format, we have 424,676 rows of data. Each row is the observation of a 
school (dbn), academic year, grade level, and demographic category. The test results
are then reported in the following columns:
- number_tested
- mean_scale_score
- level_1
- level_1_pct
- level_2
- level_2_pct
- level_3
- level_3_pct
- level_4
- level_4_pct
- level_3_4
- level_3_4_pct

There is nothing wrong with this long format and it's the best
format for some types of analysis and visualizations. However, some types of analysis
are easier with a "wide" data format, where we re-shape the categories into columns.

For example, let's say that we want to rank schools on closing the achievment gap between
ELL students and "never ELL" students. We want a new column called `ELL_gap`. The long format
makes this very challenging.

To simplify this working example, we are going to ignore all of the columns that begin with `level`.
We have the following 14 categories:
1. All Students
2. American Indian or Alaska Native
3. Asian
4. Asian or Pacific Islander
5. Black
6. Current ELL
7. Econ Disadv
8. Female
9. Hispanic
10. Homeless
11. In Foster Care
12. Limited English Proficient
13. Male
14. Migrant
15. Multiracial
16. Never ELL
17. Not Econ Disadv
18. Not English Language Learner
19. Not Homeless
20. Not Limited English Proficient
21. Not Migrant
22. Not SWD
23. Not in Foster Care
24. Parent Not in Armed Forces
25. Parent in Armed Forces
26. SWD
27. White

We are going to reshape our data so that we add 2 columns for each category,
"number_tested_all_students", "number_tested_asian", "mean_scale_score_all_students", "mean_scale_score_asian", etc.

In [5]:
import pandas as pd
from nycschools import nysed
# load the data from the csv file
df = nysed.load_nyc_nysed()
df.columns
cats = df.category.unique()
cats = sorted(list(cats))
for i,c in enumerate(cats):
    print(f"{i+1}. {c}")

1. All Students
2. American Indian or Alaska Native
3. Asian
4. Asian or Pacific Islander
5. Black
6. Current ELL
7. Econ Disadv
8. Female
9. Hispanic
10. Homeless
11. In Foster Care
12. Limited English Proficient
13. Male
14. Migrant
15. Multiracial
16. Never ELL
17. Not Econ Disadv
18. Not English Language Learner
19. Not Homeless
20. Not Limited English Proficient
21. Not Migrant
22. Not SWD
23. Not in Foster Care
24. Parent Not in Armed Forces
25. Parent in Armed Forces
26. SWD
27. White


In [2]:

# get only the columns we're interested in
df = df[["school_name", "beds", "category", "exam",
         "grade", "test_year", "mean_scale_score"]]

# drop the rows with NaN (where the pop is too small to report)
df = df[df["mean_scale_score"].notnull()]

df

Unnamed: 0,school_name,beds,category,exam,grade,test_year,mean_scale_score
190143,PS 15 ROBERTO CLEMENTE,310100010015,All Students,ela,3,2021,613.0
190146,PS 15 ROBERTO CLEMENTE,310100010015,Not SWD,ela,3,2021,613.0
190156,PS 15 ROBERTO CLEMENTE,310100010015,Not in Foster Care,ela,3,2021,613.0
190159,PS 15 ROBERTO CLEMENTE,310100010015,Not Migrant,ela,3,2021,613.0
190160,PS 15 ROBERTO CLEMENTE,310100010015,Parent Not in Armed Forces,ela,3,2021,613.0
...,...,...,...,...,...,...,...
2262158,LOIS AND RICHARD NICOTRA EARLY COLLEGE CHARTER...,353100861136,Not Econ Disadv,ela,8,2019,596.0
2262159,LOIS AND RICHARD NICOTRA EARLY COLLEGE CHARTER...,353100861136,Not Migrant,ela,8,2019,588.0
2262160,LOIS AND RICHARD NICOTRA EARLY COLLEGE CHARTER...,353100861136,Not Homeless,ela,8,2019,588.0
2262161,LOIS AND RICHARD NICOTRA EARLY COLLEGE CHARTER...,353100861136,Not in Foster Care,ela,8,2019,588.0


In [2]:
# here we pivot the category column and then
# rename the columns to make them all lowercase with underscors instead of spaces
df = pd.pivot(df, index=['dbn','grade', 'year'], columns='category', values=['number_tested', 'mean_scale_score']).reindex()
df.columns = df.columns.to_series().str.join('_')
df.columns = df.columns.to_series().str.lower()
df.columns = df.columns.to_series().str.replace(" ", "_")

# now our dataframe looks like this:
df


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,number_tested_all_students,number_tested_asian,number_tested_black,number_tested_current_ell,number_tested_econ_disadv,number_tested_ever_ell,number_tested_female,number_tested_hispanic,number_tested_male,number_tested_never_ell,...,mean_scale_score_econ_disadv,mean_scale_score_ever_ell,mean_scale_score_female,mean_scale_score_hispanic,mean_scale_score_male,mean_scale_score_never_ell,mean_scale_score_not_econ_disadv,mean_scale_score_not_swd,mean_scale_score_swd,mean_scale_score_white
dbn,grade,year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
01M015,3,2013,27.0,,,,,,12.0,15.0,15.0,,...,,,285.333344,285.266663,292.466675,,,287.157898,294.375000,
01M015,3,2014,18.0,,10.0,,18.0,,,,,,...,285.111114,,,,,,,290.083344,275.166656,
01M015,3,2015,16.0,,9.0,,16.0,,7.0,7.0,9.0,12.0,...,281.812500,,288.714294,280.428558,276.444458,284.833344,,285.125000,278.500000,
01M015,3,2016,20.0,,,,,,,13.0,,16.0,...,,,,291.230774,,289.937500,,304.454559,277.888886,
01M015,3,2017,27.0,,,,,,13.0,18.0,14.0,24.0,...,,,308.153839,300.333344,297.000000,303.500000,,310.904755,272.500000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32K562,All Grades,2015,309.0,,84.0,71.0,276.0,33.0,148.0,212.0,161.0,205.0,...,278.355072,285.212128,280.162170,275.382080,273.633545,281.073181,263.424255,279.796814,263.620697,
32K562,All Grades,2016,266.0,,64.0,70.0,248.0,37.0,125.0,194.0,141.0,159.0,...,284.592743,292.162170,285.888000,282.025787,281.865234,290.320740,272.222229,287.457550,269.222229,
32K562,All Grades,2017,282.0,,71.0,56.0,247.0,57.0,138.0,204.0,144.0,169.0,...,287.542511,298.140350,289.456512,286.147064,283.631958,291.183441,279.000000,290.506928,273.046143,
32K562,All Grades,2018,317.0,,76.0,51.0,290.0,62.0,171.0,232.0,146.0,204.0,...,592.510315,598.306457,595.736816,593.405151,588.842468,594.754883,593.111084,594.584961,584.562500,


In [3]:
# next, flatten the columns by resetting the index
df = df.reset_index()
df

Unnamed: 0,dbn,grade,year,number_tested_all_students,number_tested_asian,number_tested_black,number_tested_current_ell,number_tested_econ_disadv,number_tested_ever_ell,number_tested_female,...,mean_scale_score_econ_disadv,mean_scale_score_ever_ell,mean_scale_score_female,mean_scale_score_hispanic,mean_scale_score_male,mean_scale_score_never_ell,mean_scale_score_not_econ_disadv,mean_scale_score_not_swd,mean_scale_score_swd,mean_scale_score_white
0,01M015,3,2013,27.0,,,,,,12.0,...,,,285.333344,285.266663,292.466675,,,287.157898,294.375000,
1,01M015,3,2014,18.0,,10.0,,18.0,,,...,285.111114,,,,,,,290.083344,275.166656,
2,01M015,3,2015,16.0,,9.0,,16.0,,7.0,...,281.812500,,288.714294,280.428558,276.444458,284.833344,,285.125000,278.500000,
3,01M015,3,2016,20.0,,,,,,,...,,,,291.230774,,289.937500,,304.454559,277.888886,
4,01M015,3,2017,27.0,,,,,,13.0,...,,,308.153839,300.333344,297.000000,303.500000,,310.904755,272.500000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32760,32K562,All Grades,2015,309.0,,84.0,71.0,276.0,33.0,148.0,...,278.355072,285.212128,280.162170,275.382080,273.633545,281.073181,263.424255,279.796814,263.620697,
32761,32K562,All Grades,2016,266.0,,64.0,70.0,248.0,37.0,125.0,...,284.592743,292.162170,285.888000,282.025787,281.865234,290.320740,272.222229,287.457550,269.222229,
32762,32K562,All Grades,2017,282.0,,71.0,56.0,247.0,57.0,138.0,...,287.542511,298.140350,289.456512,286.147064,283.631958,291.183441,279.000000,290.506928,273.046143,
32763,32K562,All Grades,2018,317.0,,76.0,51.0,290.0,62.0,171.0,...,592.510315,598.306457,595.736816,593.405151,588.842468,594.754883,593.111084,594.584961,584.562500,


In [4]:
df.columns

Index(['dbn', 'grade', 'year', 'number_tested_all_students',
       'number_tested_asian', 'number_tested_black',
       'number_tested_current_ell', 'number_tested_econ_disadv',
       'number_tested_ever_ell', 'number_tested_female',
       'number_tested_hispanic', 'number_tested_male',
       'number_tested_never_ell', 'number_tested_not_econ_disadv',
       'number_tested_not_swd', 'number_tested_swd', 'number_tested_white',
       'mean_scale_score_all_students', 'mean_scale_score_asian',
       'mean_scale_score_black', 'mean_scale_score_current_ell',
       'mean_scale_score_econ_disadv', 'mean_scale_score_ever_ell',
       'mean_scale_score_female', 'mean_scale_score_hispanic',
       'mean_scale_score_male', 'mean_scale_score_never_ell',
       'mean_scale_score_not_econ_disadv', 'mean_scale_score_not_swd',
       'mean_scale_score_swd', 'mean_scale_score_white'],
      dtype='object')

Quick analysis with wide data
-------------------------------------------

### ELL vs All Students
Which schools have the best ELA test scores for ELL students
compared to never-ELL students?

Let's compute a new column called `ell_delta` which will be the positive or negative
difference in mean test scores betewen the ELL group and the Never ELL group.

In [5]:
ell = df[df["mean_scale_score_never_ell"].notnull() & df["mean_scale_score_current_ell"].notnull()].copy()

ell["ell_delta"] = ell["mean_scale_score_never_ell"] - ell["mean_scale_score_current_ell"]
ell[["dbn", "mean_scale_score_never_ell", "mean_scale_score_current_ell", "ell_delta"]].sort_values(by="ell_delta")

Unnamed: 0,dbn,mean_scale_score_never_ell,mean_scale_score_current_ell,ell_delta
6488,07X359,335.414642,351.166656,-15.752014
407,01M188,291.631592,306.500000,-14.868408
10518,10X360,279.279999,292.714294,-13.434296
6515,07X369,267.465118,280.577789,-13.112671
29729,30Q111,266.000000,276.428558,-10.428558
...,...,...,...,...
31669,31R054,313.670319,222.000000,91.670319
17477,18K235,322.792999,231.000000,91.792999
19316,20K187,347.045868,254.875000,92.170868
20988,22K052,306.537048,212.500000,94.037048
