# Read Excel Files

While we must use the data from PASSNYC, its content is outdated and contains information from the school year 2016/2017... It would be better if we used data from 2017/2018 as it is more up-to-date, and, more importantly, it matches the SHSAT applications table that came from the New York Times.

There are three main files we want to parse:

- School Demographics Snapshot (this contains information such as % of hispanic students, ELL, etc)
- Results from the NYS Mathematics test
- Results from the NYS ELA test

## School Demographics Snapshot

This data is in a pretty easy format.

In [2]:
import pandas as pd

df = pd.read_excel('../data/raw/demographicsnapshot201314to201718public_final.xlsx', sheet_name='School', header=0)
df.head()

Unnamed: 0,DBN,School Name,Year,Total Enrollment,Grade PK (Half Day & Full Day),Grade K,Grade 1,Grade 2,Grade 3,Grade 4,...,% Multiple Race Categories Not Represented,# White,% White,# Students with Disabilities,% Students with Disabilities,# English Language Learners,% English Language Learners,# Poverty,% Poverty,Economic Need Index
0,01M015,P.S. 015 Roberto Clemente,2013-14,190,26,39,39,21,16,26,...,0.010526,3,0.015789,65,0.342105,19,0.1,171,0.9,
1,01M015,P.S. 015 Roberto Clemente,2014-15,183,18,27,47,31,19,17,...,0.005464,2,0.010929,64,0.349727,17,0.092896,169,0.923497,0.934525
2,01M015,P.S. 015 Roberto Clemente,2015-16,176,14,32,33,39,23,17,...,0.017045,2,0.011364,60,0.340909,16,0.090909,149,0.846591,0.895551
3,01M015,P.S. 015 Roberto Clemente,2016-17,178,17,28,33,27,31,24,...,0.022472,4,0.022472,51,0.286517,12,0.067416,152,0.853933,0.891916
4,01M015,P.S. 015 Roberto Clemente,2017-18,190,17,28,32,33,23,31,...,0.010526,6,0.031579,45,0.236842,8,0.042105,161,0.847368,0.889605


In [7]:
df.tail()

Unnamed: 0,DBN,School Name,Year,Total Enrollment,Grade PK (Half Day & Full Day),Grade K,Grade 1,Grade 2,Grade 3,Grade 4,...,% Multiple Race Categories Not Represented,# White,% White,# Students with Disabilities,% Students with Disabilities,# English Language Learners,% English Language Learners,# Poverty,% Poverty,Economic Need Index
8967,84X730,Bronx Charter School for the Arts,2013-14,319,0,52,54,56,56,53,...,0.018809,3,0.009404,57,0.178683,27,0.084639,222,0.695925,
8968,84X730,Bronx Charter School for the Arts,2014-15,316,0,53,55,56,53,47,...,0.015823,2,0.006329,53,0.167722,34,0.107595,284,0.898734,0.822158
8969,84X730,Bronx Charter School for the Arts,2015-16,323,0,51,57,55,55,57,...,0.018576,5,0.01548,61,0.188854,49,0.151703,268,0.829721,0.80557
8970,84X730,Bronx Charter School for the Arts,2016-17,320,0,53,53,55,52,53,...,0.009375,3,0.009375,67,0.209375,51,0.159375,235,0.734375,0.834672
8971,84X730,Bronx Charter School for the Arts,2017-18,314,0,50,51,54,51,52,...,0.009554,1,0.003185,63,0.200637,57,0.181529,253,0.805732,0.883169


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8972 entries, 0 to 8971
Data columns (total 39 columns):
DBN                                           8972 non-null object
School Name                                   8972 non-null object
Year                                          8972 non-null object
Total Enrollment                              8972 non-null int64
Grade PK (Half Day & Full Day)                8972 non-null int64
Grade K                                       8972 non-null int64
Grade 1                                       8972 non-null int64
Grade 2                                       8972 non-null int64
Grade 3                                       8972 non-null int64
Grade 4                                       8972 non-null int64
Grade 5                                       8972 non-null int64
Grade 6                                       8972 non-null int64
Grade 7                                       8972 non-null int64
Grade 8                         

**All fine!**

## NYS Math Test

While all the data that was present in the PASSNYC dataset is spread across datasheets here, we will just focus on one sheet.

In [33]:
df = pd.read_excel('../data/raw/school-math-results-2013-2017-public.xlsx', sheet_name='All Students', skiprows=7)
df.head()

Unnamed: 0.1,Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,#,%,#.1,%.1,#.2,%.2,#.3,%.3,#.4,%.4
0,01M01532013All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2013,All Students,27,277.778,16,59.2593,11,40.7407,0,0.0,0,0.0,0,0.0
1,01M01532014All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2014,All Students,18,286.389,6,33.3333,9,50.0,2,11.1111,1,5.55556,3,16.6667
2,01M01532015All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2015,All Students,17,279.588,10,58.8235,4,23.5294,2,11.7647,1,5.88235,3,17.6471
3,01M01532016All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2016,All Students,21,274.81,13,61.9048,4,19.0476,4,19.0476,0,0.0,4,19.0476
4,01M01532017All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2017,All Students,29,301.552,8,27.5862,9,31.0345,7,24.1379,5,17.2414,12,41.3793


In [38]:
df = df.iloc[:,1:]  # drop first column
df.columns = [
    'DBN',
    'School Name',
    'Grade',
    'Year',
    'Category',
    'Number Tested',
    'Mean Scale Score',
    '# Level 1',
    '% Level 1',
    '# Level 2',
    '% Level 2',
    '# Level 3',
    '% Level 3',
    '# Level 4',
    '% Level 4',
    '# Level 3+4',
    '% Level 3+4'
]

df.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,PS 015 ROBERTO CLEMENTE,3,2013,All Students,27,277.778,16,59.2593,11,40.7407,0,0.0,0,0.0,0,0.0
1,01M015,PS 015 ROBERTO CLEMENTE,3,2014,All Students,18,286.389,6,33.3333,9,50.0,2,11.1111,1,5.55556,3,16.6667
2,01M015,PS 015 ROBERTO CLEMENTE,3,2015,All Students,17,279.588,10,58.8235,4,23.5294,2,11.7647,1,5.88235,3,17.6471
3,01M015,PS 015 ROBERTO CLEMENTE,3,2016,All Students,21,274.81,13,61.9048,4,19.0476,4,19.0476,0,0.0,4,19.0476
4,01M015,PS 015 ROBERTO CLEMENTE,3,2017,All Students,29,301.552,8,27.5862,9,31.0345,7,24.1379,5,17.2414,12,41.3793


In [39]:
df.tail()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
23891,32K562,EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION,All Grades,2013,All Students,145,266.297,107,73.7931,35,24.1379,3,2.06897,0,0.0,3,2.06897
23892,32K562,EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION,All Grades,2014,All Students,231,269.372,162,70.1299,65,28.1385,4,1.7316,0,0.0,4,1.7316
23893,32K562,EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION,All Grades,2015,All Students,324,267.944,249,76.8518,70,21.6049,4,1.23457,1,0.308642,5,1.54321
23894,32K562,EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION,All Grades,2016,All Students,269,270.234,176,65.4275,85,31.5985,6,2.23048,2,0.743494,8,2.97398
23895,32K562,EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION,All Grades,2017,All Students,294,268.721,206,70.068,74,25.1701,9,3.06122,5,1.70068,14,4.7619


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23896 entries, 0 to 23895
Data columns (total 17 columns):
DBN                 23896 non-null object
School Name         23896 non-null object
Grade               23896 non-null object
Year                23896 non-null int64
Category            23896 non-null object
Number Tested       23896 non-null int64
Mean Scale Score    23896 non-null object
# Level 1           23896 non-null object
% Level 1           23896 non-null object
# Level 2           23896 non-null object
% Level 2           23896 non-null object
# Level 3           23896 non-null object
% Level 3           23896 non-null object
# Level 4           23896 non-null object
% Level 4           23896 non-null object
# Level 3+4         23896 non-null object
% Level 3+4         23896 non-null object
dtypes: int64(2), object(15)
memory usage: 3.1+ MB


This is fine. Rows with 5 or fewer students have been replaced with an "s". Later on, we shall replace them with NaN.

** Note: charter schools are not present in this dataset **

## NYS ELA Test

The same as with NYS Math Test...

In [47]:
df = pd.read_excel('../data/raw/school-ela-results-2013-2017-public.xlsx', sheet_name='All Students', skiprows=7)
df.head()

Unnamed: 0.1,Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,#,%,#.1,%.1,#.2,%.2,#.3,%.3,#.4,%.4
0,01M01532013All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2013,All Students,27,289.296,14,51.8518,11,40.7407,2,7.40741,0,0.0,2,7.40741
1,01M01532014All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2014,All Students,18,285.111,10,55.5556,8,44.4444,0,0.0,0,0.0,0,0.0
2,01M01532015All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2015,All Students,16,281.812,9,56.25,5,31.25,2,12.5,0,0.0,2,12.5
3,01M01532016All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2016,All Students,20,292.5,10,50.0,6,30.0,4,20.0,0,0.0,4,20.0
4,01M01532017All Students,01M015,PS 015 ROBERTO CLEMENTE,3,2017,All Students,27,302.37,10,37.037,8,29.6296,7,25.9259,2,7.40741,9,33.3333


In [48]:
df = df.iloc[:,1:]  # drop first column
df.columns = [
    'DBN',
    'School Name',
    'Grade',
    'Year',
    'Category',
    'Number Tested',
    'Mean Scale Score',
    '# Level 1',
    '% Level 1',
    '# Level 2',
    '% Level 2',
    '# Level 3',
    '% Level 3',
    '# Level 4',
    '% Level 4',
    '# Level 3+4',
    '% Level 3+4'
]

df.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,PS 015 ROBERTO CLEMENTE,3,2013,All Students,27,289.296,14,51.8518,11,40.7407,2,7.40741,0,0.0,2,7.40741
1,01M015,PS 015 ROBERTO CLEMENTE,3,2014,All Students,18,285.111,10,55.5556,8,44.4444,0,0.0,0,0.0,0,0.0
2,01M015,PS 015 ROBERTO CLEMENTE,3,2015,All Students,16,281.812,9,56.25,5,31.25,2,12.5,0,0.0,2,12.5
3,01M015,PS 015 ROBERTO CLEMENTE,3,2016,All Students,20,292.5,10,50.0,6,30.0,4,20.0,0,0.0,4,20.0
4,01M015,PS 015 ROBERTO CLEMENTE,3,2017,All Students,27,302.37,10,37.037,8,29.6296,7,25.9259,2,7.40741,9,33.3333


## Charter Schools

Charter schools are in a different file.

In [52]:
dfs = pd.read_excel('../data/raw/charter-school-results-2013-2017-public.xlsx', sheet_name=['ELA', 'Math'], skiprows=7)
dfs['ELA'].head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,#,%,#.1,%.1,#.2,%.2,#.3,%.3,#.4,%.4
0,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2015,All Students,43,306.419,9,20.9302,20,46.5116,13,30.2326,1,2.32558,14,32.5581
1,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2016,All Students,52,319.115,4,7.69231,20,38.4615,25,48.0769,3,5.76923,28,53.8462
2,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2017,All Students,50,323.02,2,4.0,16,32.0,29,58.0,3,6.0,32,64.0
3,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,4,2016,All Students,40,315.275,4,10.0,18,45.0,14,35.0,4,10.0,18,45.0
4,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,4,2017,All Students,52,319.615,4,7.69231,18,34.6154,21,40.3846,9,17.3077,30,57.6923


In [53]:
df = dfs['ELA']
df.columns = [
    'DBN',
    'School Name',
    'Grade',
    'Year',
    'Category',
    'Number Tested',
    'Mean Scale Score',
    '# Level 1',
    '% Level 1',
    '# Level 2',
    '% Level 2',
    '# Level 3',
    '% Level 3',
    '# Level 4',
    '% Level 4',
    '# Level 3+4',
    '% Level 3+4'
]

df.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2015,All Students,43,306.419,9,20.9302,20,46.5116,13,30.2326,1,2.32558,14,32.5581
1,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2016,All Students,52,319.115,4,7.69231,20,38.4615,25,48.0769,3,5.76923,28,53.8462
2,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2017,All Students,50,323.02,2,4.0,16,32.0,29,58.0,3,6.0,32,64.0
3,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,4,2016,All Students,40,315.275,4,10.0,18,45.0,14,35.0,4,10.0,18,45.0
4,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,4,2017,All Students,52,319.615,4,7.69231,18,34.6154,21,40.3846,9,17.3077,30,57.6923


In [54]:
df = dfs['Math']
df.columns = [
    'DBN',
    'School Name',
    'Grade',
    'Year',
    'Category',
    'Number Tested',
    'Mean Scale Score',
    '# Level 1',
    '% Level 1',
    '# Level 2',
    '% Level 2',
    '# Level 3',
    '% Level 3',
    '# Level 4',
    '% Level 4',
    '# Level 3+4',
    '% Level 3+4'
]

df.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2015,All Students,44,320.864,4,9.09091,14,31.8182,13,29.5454,13,29.5454,26,59.0909
1,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2016,All Students,52,330.212,2,3.84615,10,19.2308,19,36.5385,21,40.3846,40,76.9231
2,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,3,2017,All Students,50,321.66,2,4.0,17,34.0,19,38.0,12,24.0,31,62.0
3,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,4,2016,All Students,40,334.575,1,2.5,6,15.0,12,30.0,21,52.5,33,82.5
4,84K037,BEGINNING WITH CHILDREN CHARTER SCHOOL II,4,2017,All Students,52,327.327,5,9.61538,12,23.0769,14,26.9231,21,40.3846,35,67.3077
