Load Excel Data
======================
For some reason, the Open Data Portal does not provide the latest
grades 3-8 test data in its live URL. It's only available as an Excel
file to download from the pages. These Excel files have one
sheet for each poplulation tested:

- all students
- ethnicity
- SWD status
- econ status
- ELL status

The Excel files have been modified by hand to delete the school name column,
delete sheets that don't contain data, and to rename the column headers.


In [1]:
# load the demographic data and import pandas as pd
import pandas as pd

demo_url = "https://data.cityofnewyork.us/resource/vmmu-wj3w.csv?$limit=1000000"


In [2]:
# load and prep the school demographics data
# generic function to take percents as Strings and convert to real
# expects data to look like '84.33%', 'Above 95%', 'Below 5%'
def str_pct(row, pct_col, enroll_col):
    pct = row[pct_col][:-1]
    # just call the population size `n`
    n = row[enroll_col]
    try:
        pct = float(pct) / 100
    except:
        if "Above" in pct:
            pct = n * .96 / n
        elif "Below" in pct:
            pct = n * .04 / n
    return float(pct)

# if the score isn't a float, set it to -1 to remove it or filter it easily
def clean_score(score):
    try:
        return float(score)
    except:
        return -1


# because `str_pct` is a generic function, we need to wrap it
# with another function, we can do this with `def` or `lambda`
# we'll do both, here
def pct_pov(row): return str_pct(row, "poverty_1", "total_enrollment")


# add district and boro info
boros = {"K":"Brooklyn", "X":"Bronx", "M": "Manhattan", "Q": "Queens", "R": "Staten Island"}

def district(dbn): return int(dbn[:2])
def boro(dbn): return boros[dbn[2]]

# convert the string format `year` to match the int format academic year
# used in the test score data
def ay(year): return int(year.split("-")[0])


# get the data and clean it a little bit
df = pd.read_csv(demo_url)

df["year"] = df["year"].apply(ay)
df["district"] = df["dbn"].apply(district)
df["boro"] = df["dbn"].apply(boro)
df["poverty_1"] = df.apply(pct_pov, axis = 1)
df["economic_need_index"] = df.apply(lambda row: str_pct(row, "economic_need_index", "total_enrollment"), axis = 1)


# drop the districts that aren't geographic districts b/c we don't have test data for them
df = df[df["district"] < 33]

# (optional) get just the columns we need, to make it more manageable

cols = ['dbn',
        'year',
        'district',
        'boro',
        'school_name', 
        'total_enrollment',
        'female_1',
        'male_1',
        'asian_1', 
        'black_1', 
        'hispanic_1', 
        'multi_racial_1', 
        'native_american_1', 
        'white_1', 
        'students_with_disabilities_1', 
        'english_language_learners_1',  
        'poverty_1',
        'economic_need_index']

df = df[cols]
"loaded demographic data"

'loaded demographic data'

Load from Excel
===============
The excel workbook as a different sheet for each type of data:
    - ethnic breaks
    - students with disabilities
    - ELL
    - poverty
    

Each one of these sheets has the same column headers:

- dbn
- grade
- year
- category
- number_tested
- mean_scale_score
- level_1
- level_1_pct
- level_2
- level_2_pct
- level_3
- level_3_pct
- level_4
- level_4_pct
- level_3_4
- level_3_4_pct

Because they have the same columns, after each sheet we load we're going to
concatentate it to the dataframe (add rows to bottom of the dataset).

In the following code, I use **list comprehension**. This isn't specific to
`pandas` -- it's a built-in feature of the Python language. You might not
be familiar with it, though, so you can read more here:

- https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
- https://www.w3schools.com/python/python_lists_comprehension.asp

In line 6, I use a list comprehension to create a new list based on the
`sheet_names` list. We could do this with a `for` loop, but the list
comprehension is more concise and more "Pythonic". Semantically, 

1. read each string in the `sheet_names` list
2. load that data with the `pd.read()` function
3. append the result of `pd.read()` (a `DataFrame`) to the new list `data`

**Note:** _there's a lot of data in the excel file and the block takes a couple of minutes to run on my fast computer_.

In [3]:
sheet_names = ["all", "swd", "ethnicity", "gender", "econ_status", "ell"]

# open the Excel workbook
xls = pd.ExcelFile('ela.xlsx')
# read each sheet into a list of DataFrames
data = [pd.read_excel(xls, sheet) for sheet in sheet_names]
# combine them into a single dataframe
ela_df = pd.concat(data, ignore_index=True)


ela_df

FileNotFoundError: [Errno 2] No such file or directory: 'ela.xlsx'

In [24]:
ela_df["mean_scale_score"] = ela_df["mean_scale_score"].apply(clean_score)

# ela_df.set_index(["dbn", "grade", "year", "category"])



ela_df.dtypes



dbn                 object
grade               object
year                 int64
category            object
number_tested        int64
mean_scale_score     int64
level_1             object
level_1_pct         object
level_2             object
level_2_pct         object
level_3             object
level_3_pct         object
level_4             object
level_4_pct         object
level_3_4           object
level_3_4_pct       object
dtype: object

In [7]:
t = ela_df.convert_dtypes()
t.dtypes

dbn                 string
grade               object
year                 Int64
category            string
number_tested        Int64
mean_scale_score    object
level_1             object
level_1_pct         object
level_2             object
level_2_pct         object
level_3             object
level_3_pct         object
level_4             object
level_4_pct         object
level_3_4           object
level_3_4_pct       object
dtype: object

In [8]:
t["level_1"].count()

# pd.to_numeric(t["level_1"])

t["level_1"].astype(int)

ValueError: invalid literal for int() with base 10: 's'

In [20]:
# t = ela_df.reset_index()

ela_df.to_feather("ela.feather")

ArrowInvalid: ("Could not convert 'All Grades' with type str: tried to convert to int64", 'Conversion failed for column grade with type object')

In [12]:
t = pd.read_feather("ela.feather")
t

Unnamed: 0,dbn,year,district,boro,school_name,total_enrollment,female_1,male_1,asian_1,black_1,hispanic_1,multi_racial_1,native_american_1,white_1,students_with_disabilities_1,english_language_learners_1,poverty_1,economic_need_index
0,01M015,2016,1,Manhattan,P.S. 015 Roberto Clemente,178,0.466,0.534,0.079,0.287,0.590,0.017,0.006,0.022,0.287,0.067,0.854,0.882
1,01M015,2017,1,Manhattan,P.S. 015 Roberto Clemente,190,0.521,0.479,0.105,0.274,0.579,0.005,0.005,0.032,0.258,0.042,0.847,0.890
2,01M015,2018,1,Manhattan,P.S. 015 Roberto Clemente,174,0.489,0.511,0.138,0.276,0.546,0.000,0.006,0.034,0.224,0.046,0.845,0.888
3,01M015,2019,1,Manhattan,P.S. 015 Roberto Clemente,190,0.495,0.505,0.142,0.295,0.505,0.000,0.011,0.047,0.242,0.089,0.816,0.867
4,01M015,2020,1,Manhattan,P.S. 015 Roberto Clemente,193,0.523,0.477,0.135,0.275,0.528,0.005,0.000,0.057,0.223,0.109,0.819,0.856
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7642,32K564,2016,32,Brooklyn,Bushwick Community High School,236,0.479,0.521,0.000,0.284,0.682,0.000,0.004,0.030,0.237,0.042,0.614,0.715
7643,32K564,2017,32,Brooklyn,Bushwick Community High School,263,0.479,0.521,0.000,0.308,0.673,0.000,0.000,0.019,0.323,0.072,0.859,0.907
7644,32K564,2018,32,Brooklyn,Bushwick Community High School,196,0.398,0.602,0.000,0.311,0.679,0.000,0.000,0.010,0.362,0.046,0.832,0.881
7645,32K564,2019,32,Brooklyn,Bushwick Community High School,214,0.416,0.584,0.000,0.229,0.766,0.000,0.000,0.005,0.327,0.065,0.883,0.904
