New York City School Data Basics
=============================

This notebook loads the NYC School Demographic data.
Working with this dataset, we look at some basic
[Pandas](https://pandas.pydata.org/) operations.

We assume that you have a basic understanding of Python and
Jupyter notebooks. This is a good start if you are new to
Pandas and data science in Python.

In particular:

- loading .csv data from the portal into a `DataFrame`
- use `head()`, `tail()`, and `Series` (columns) to understand the data
- access column `Series` by name using index notation
- use `unique()`, `min()`, `max()`, `sum()`, and `mean()` to understand series data



In [1]:
# import schools from the nycschool package
from nycschools import schools

# load the demographic data into a `DataFrame` called df
df =schools.load_school_demographics()


In the next section we describe the data. 
If we display `df` notebook shows us some of the data from 
the start of the data set and some from the end.

If we call `df.head()` we get the start of the data. `df.tail()` shows us the end of the data.

Comment/uncomment the different options to see how they work

In [3]:
df

# df.head()
# df.tail()

Unnamed: 0,dbn,beds,district,geo_district,boro,school_name,short_name,ay,year,total_enrollment,...,missing_race_ethnicity_data_pct,swd_n,swd_pct,ell_n,ell_pct,poverty_n,poverty_pct,eni_pct,clean_name,zip
0,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2016,2016-17,178,...,0.000000,51,0.287000,12,0.067,152,0.854,0.882,roberto clemente,10009
1,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2017,2017-18,190,...,0.000000,49,0.258000,8,0.042,161,0.847,0.890,roberto clemente,10009
2,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2018,2018-19,174,...,0.000000,39,0.224000,8,0.046,147,0.845,0.888,roberto clemente,10009
3,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2019,2019-20,190,...,0.000000,46,0.242000,17,0.089,155,0.816,0.867,roberto clemente,10009
4,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2020,2020-21,193,...,0.000000,43,0.223000,21,0.109,158,0.819,0.856,roberto clemente,10009
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2016,2016-17,320,...,0.000000,67,0.209375,51,0.159,235,0.734,0.840,bronx charter school for the arts,10474
9997,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2017,2017-18,314,...,0.000000,68,0.216561,57,0.182,258,0.822,0.891,bronx charter school for the arts,10474
9998,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2018,2018-19,430,...,0.000000,103,0.239535,71,0.165,363,0.844,0.888,bronx charter school for the arts,10474
9999,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,MS 730,2019,2019-20,523,...,0.000000,117,0.223709,69,0.132,453,0.866,0.892,bronx charter school for the arts,10474


In [5]:
# The the `columns` property shows us the names of the cols in our `df`.
df.columns

Index(['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name',
       'short_name', 'ay', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n',
       'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct',
       'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
       'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
       'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
       'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
       'eni_pct', 'clean_name', 'zip'],
      dtype='object')

We can access just one column using index notation -- 
`df["poverty"]` gives us just that column. We can then display or sort the data using either
the python built-in function `sorted()` or the pandas `Series` function `sort_values()`.
If we call `unique()` we will get a list of the unique values in the `Series`. In the case
of `poverty` that lets us see that the column contains string data, and not raw numbers
if the poverty level is too high or too low.

In [4]:
# get just the poverty column
poverty = df["poverty"]
poverty = poverty.sort_values()
print("Note these are sorted as strings, not number value...")
print(poverty.unique()) 

sorted(poverty.unique(), reverse=True)

Note these are sorted as strings, not number value...
['10' '100' '1000' ... '997' 'Above 95%' 'Below 5%']


['Below 5%',
 'Above 95%',
 '997',
 '995',
 '994',
 '993',
 '99',
 '987',
 '983',
 '980',
 '98',
 '979',
 '978',
 '977',
 '976',
 '975',
 '974',
 '971',
 '970',
 '97',
 '967',
 '966',
 '965',
 '964',
 '963',
 '962',
 '961',
 '960',
 '96',
 '959',
 '958',
 '957',
 '956',
 '955',
 '953',
 '952',
 '951',
 '95',
 '949',
 '948',
 '947',
 '946',
 '944',
 '943',
 '941',
 '940',
 '94',
 '938',
 '937',
 '936',
 '935',
 '934',
 '933',
 '932',
 '931',
 '930',
 '93',
 '929',
 '928',
 '927',
 '925',
 '923',
 '922',
 '921',
 '92',
 '918',
 '917',
 '916',
 '914',
 '912',
 '911',
 '910',
 '91',
 '909',
 '907',
 '906',
 '905',
 '904',
 '903',
 '902',
 '901',
 '900',
 '90',
 '898',
 '896',
 '895',
 '894',
 '893',
 '892',
 '891',
 '89',
 '889',
 '888',
 '886',
 '885',
 '884',
 '883',
 '882',
 '881',
 '880',
 '88',
 '878',
 '877',
 '876',
 '874',
 '873',
 '872',
 '871',
 '870',
 '87',
 '869',
 '868',
 '867',
 '866',
 '864',
 '863',
 '862',
 '861',
 '860',
 '86',
 '859',
 '858',
 '857',
 '856',
 '855',
 '8

We can get a subset of the data by using index notation with
a list of column names:

`df[ ["dbn", "school_name", "total_enrollment", "poverty_1" ] ]` returns a `DataFrame`
with 4 columns.

In [5]:
df[ ["dbn", "school_name", "total_enrollment", "poverty_1" ] ]

Unnamed: 0,dbn,school_name,total_enrollment,poverty_1
0,01M015,P.S. 015 Roberto Clemente,178,85.4%
1,01M015,P.S. 015 Roberto Clemente,190,84.7%
2,01M015,P.S. 015 Roberto Clemente,174,84.5%
3,01M015,P.S. 015 Roberto Clemente,190,81.6%
4,01M015,P.S. 015 Roberto Clemente,193,81.9%
...,...,...,...,...
9164,84X730,Bronx Charter School for the Arts,320,73.4%
9165,84X730,Bronx Charter School for the Arts,314,82.2%
9166,84X730,Bronx Charter School for the Arts,430,84.4%
9167,84X730,Bronx Charter School for the Arts,523,86.6%


We can run some basic statistics on a single series, to get a sense of the type of data it contains.

In [6]:
# get just the total_enrollment column, called a Series in pandas
enrollment = df["total_enrollment"]
print("The largest school:", enrollment.max())

print("The smallest school:", enrollment.min())
print("Avg (mean) school size:", enrollment.mean())
print("Avg (mode, can return multiple values) percent poverty:", list(df['poverty_1'].mode()))
print("School years included in data:", df["year"].unique())

The largest school: 6040
The smallest school: 7
Avg (mean) school size: 585.9556112989421
Avg (mode, can return multiple values) percent poverty: ['Above 95%']
School years included in data: ['2016-17' '2017-18' '2018-19' '2019-20' '2020-21']
