Pandas Basics
=============

This notebooke file shows you some of the basic techniques for using
`pandas` to load work with the open data from the NYC data portal

It covers:

- loading .csv data from the portal into a `DataFrame`
- use `head()`, `tail()`, and `columns` to understand the data
- access column `Series` by name using index notation
- use `unique()`, `min()`, `max()`, `sum()`, and `mean()` to understand series data



In [1]:
# import the pandas library and call it `pd`
# pd is the convention for pandas and will be used
# in online examples and official docs
import pandas as pd

In the next cell we load the school demographic data snapshot from:

https://data.cityofnewyork.us/Education/2020-2021-Demographic-Snapshot-School/vmmu-wj3w

This URL is copy-pasted from the API tab
and the data type is set to CSV
we append `?$limit=1000000` to the URL to indicate that we want 1,000,000 rows of data
otherwise we won't get the whole dataset for larger data sets.

We call `pd.read_csv` to load the networked data and convert it into a `DataFrame`.
We name the variable `df` -- which is the convention for a dataframe in many docs.

Obviously, if we have more than one dataframe we will give it a meaningful name.

In [2]:
url = "https://data.cityofnewyork.us/resource/vmmu-wj3w.csv?$limit=1000000"
df = pd.read_csv(url)

In the next section we describe the data. 
If we display `df` notebook shows us some of the data from 
the start of the data set and some from the end.

The the `columns` property shows us the names of the cols in our `df`.

If we call `df.head()` we get the start of the data. `df.tail()` shows us the end of the data.


In [3]:
display(df)
display(df.columns)
display(df.head())
display(df.tail())


Unnamed: 0,dbn,school_name,year,total_enrollment,grade_3k_pk_half_day_full,grade_k,grade_1,grade_2,grade_3,grade_4,...,white_1,missing_race_ethnicity_data,missing_race_ethnicity_data_1,students_with_disabilities,students_with_disabilities_1,english_language_learners,english_language_learners_1,poverty,poverty_1,economic_need_index
0,01M015,P.S. 015 Roberto Clemente,2016-17,178,17,28,33,27,31,24,...,0.022000,0,0.000000,51,0.287000,12,0.067,152,85.4%,88.2%
1,01M015,P.S. 015 Roberto Clemente,2017-18,190,17,28,32,33,23,31,...,0.032000,0,0.000000,49,0.258000,8,0.042,161,84.7%,89.0%
2,01M015,P.S. 015 Roberto Clemente,2018-19,174,13,20,33,30,30,20,...,0.034000,0,0.000000,39,0.224000,8,0.046,147,84.5%,88.8%
3,01M015,P.S. 015 Roberto Clemente,2019-20,190,14,29,28,38,33,29,...,0.047000,0,0.000000,46,0.242000,17,0.089,155,81.6%,86.7%
4,01M015,P.S. 015 Roberto Clemente,2020-21,193,17,29,29,27,30,32,...,0.057000,0,0.000000,43,0.223000,21,0.109,158,81.9%,85.6%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9164,84X730,Bronx Charter School for the Arts,2016-17,320,0,53,53,55,52,53,...,0.009375,0,0.000000,67,0.209375,51,0.159,235,73.4%,84.0%
9165,84X730,Bronx Charter School for the Arts,2017-18,314,0,50,51,54,51,52,...,0.003185,0,0.000000,68,0.216561,57,0.182,258,82.2%,89.1%
9166,84X730,Bronx Charter School for the Arts,2018-19,430,0,49,54,49,54,51,...,0.006977,0,0.000000,103,0.239535,71,0.165,363,84.4%,88.8%
9167,84X730,Bronx Charter School for the Arts,2019-20,523,0,51,50,53,52,55,...,0.009560,0,0.000000,117,0.223709,69,0.132,453,86.6%,89.2%


Index(['dbn', 'school_name', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female', 'female_1', 'male',
       'male_1', 'asian', 'asian_1', 'black', 'black_1', 'hispanic',
       'hispanic_1', 'multi_racial', 'multi_racial_1', 'native_american',
       'native_american_1', 'white', 'white_1', 'missing_race_ethnicity_data',
       'missing_race_ethnicity_data_1', 'students_with_disabilities',
       'students_with_disabilities_1', 'english_language_learners',
       'english_language_learners_1', 'poverty', 'poverty_1',
       'economic_need_index'],
      dtype='object')

Unnamed: 0,dbn,school_name,year,total_enrollment,grade_3k_pk_half_day_full,grade_k,grade_1,grade_2,grade_3,grade_4,...,white_1,missing_race_ethnicity_data,missing_race_ethnicity_data_1,students_with_disabilities,students_with_disabilities_1,english_language_learners,english_language_learners_1,poverty,poverty_1,economic_need_index
0,01M015,P.S. 015 Roberto Clemente,2016-17,178,17,28,33,27,31,24,...,0.022,0,0.0,51,0.287,12,0.067,152,85.4%,88.2%
1,01M015,P.S. 015 Roberto Clemente,2017-18,190,17,28,32,33,23,31,...,0.032,0,0.0,49,0.258,8,0.042,161,84.7%,89.0%
2,01M015,P.S. 015 Roberto Clemente,2018-19,174,13,20,33,30,30,20,...,0.034,0,0.0,39,0.224,8,0.046,147,84.5%,88.8%
3,01M015,P.S. 015 Roberto Clemente,2019-20,190,14,29,28,38,33,29,...,0.047,0,0.0,46,0.242,17,0.089,155,81.6%,86.7%
4,01M015,P.S. 015 Roberto Clemente,2020-21,193,17,29,29,27,30,32,...,0.057,0,0.0,43,0.223,21,0.109,158,81.9%,85.6%


Unnamed: 0,dbn,school_name,year,total_enrollment,grade_3k_pk_half_day_full,grade_k,grade_1,grade_2,grade_3,grade_4,...,white_1,missing_race_ethnicity_data,missing_race_ethnicity_data_1,students_with_disabilities,students_with_disabilities_1,english_language_learners,english_language_learners_1,poverty,poverty_1,economic_need_index
9164,84X730,Bronx Charter School for the Arts,2016-17,320,0,53,53,55,52,53,...,0.009375,0,0.0,67,0.209375,51,0.159,235,73.4%,84.0%
9165,84X730,Bronx Charter School for the Arts,2017-18,314,0,50,51,54,51,52,...,0.003185,0,0.0,68,0.216561,57,0.182,258,82.2%,89.1%
9166,84X730,Bronx Charter School for the Arts,2018-19,430,0,49,54,49,54,51,...,0.006977,0,0.0,103,0.239535,71,0.165,363,84.4%,88.8%
9167,84X730,Bronx Charter School for the Arts,2019-20,523,0,51,50,53,52,55,...,0.00956,0,0.0,117,0.223709,69,0.132,453,86.6%,89.2%
9168,84X730,Bronx Charter School for the Arts,2020-21,626,0,38,52,53,55,55,...,0.00639,1,0.001597,153,0.244409,78,0.125,541,86.4%,88.2%


Data Dictionary
=============

From the `df.columns` call, we see all of the columns for our data. Here
are the columns with some information on the type of data they contain. For more
detailed information on these fields, see the first tab in the [Demographic Data
Spreadsheet](https://infohub.nyced.org/docs/default-source/default-document-library/demographic-snapshot-2016-17-to-2020-21---public.xlsx)

- **dbn:** `text` data with the unique id of the school used by NYC DOE. _dbn_
  is in the format [district][boro][school number], for example '01M015'
  represents "district 1" in "Manhattan", "number 15." 
  - _X_ - Bronx
  - _K_ - Brooklyn
  - _M_ - Manhattan
  - _Q_ - Queens
  - _R_ - Staten Island
  
  Districts 1-32 represent geographic school districts in the city. [District 75](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75) supports students with highly specialized needs that cannot be met in the regular school special education program. [District 79](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75) is the alternative school district for older students, students with interrupted education, court-involved yout, etc. [District 84]() designates public charter schools operating within the DOE.
- **school_name:** `text` the full name of the school, e.g. 'P.S. 015 Roberto
  Clemente'. Elementary schools are usually called PS 15, PS 143, etc. They
  often have a descriptive name, too, like Roberto Clemente. Middle schools are
  _usually_ MS 915, but sometimes they are called IS 915.
- **year:** `text`, the academic year in the format yyyy-yy, e.g. '2020-21'
- **total_enrollment:** `integer` the total number of students in the school
- **grade_3k_pk_half_day_full:** `integer` the total number of students in the school in early childhood "3K" or "pre-k"
- **[grade_k..grade_12:]** `integer` each of these columns is the number of students in each grade at the school
- **female:** : `integer` total female students at the school
- **female_1:** `real` the percent of female students as a real number between 0 and 1
- **male:** : `integer` total female students at the school
- **male_1:** `real` the percent of male students as a real number between 0 and 1
- **asian:** `integer` total Asian students at the school
- **asian_1:** `real` the percent of Asian students as a real number between 0 and 1
- **black:** `integer` total Black students at the school
- **black_1:** `real` the percent of Black students as a real number between 0 and 1
- **hispanic:** `integer` total Latinx students at the school
- **hispanic_1:** `real` the percent of Latinx students as a real number between 0 and 1
- **multi_racial:** `integer` total multi-racial students at the school
- **multi_racial_1:** `real` the percent of multi-racial students as a real number between 0 and 1
- **native_american:** `integer` total Native American students at the school
- **native_american_1:** `real` the percent of Native American students as a real number between 0 and 1
- **white:** `integer` total White students at the school
- **white_1:** `real` the percent of White students as a real number between 0 and 1
- **missing_race_ethnicity_data:** `integer` total number of students with missing race/ethnic data
- **missing_race_ethnicity_data_1:** `real` the percent of students with missing race/ethnic data as a real number between 0 and 1
- **students_with_disabilities:** `integer` total number of students with
  disabilities (sometimes written SWD) in the schools. This counts the number of
  students with an IEP (individualized education plan) in special education at
  the school. ([more info about special ed and IEPs](https://www.schools.nyc.gov/learning/special-education/preschool-to-age-21/special-education-in-nyc))
- **students_with_disabilities_1:** `real` the percent of SWDs as a real number between 0 and 1
- **english_language_learners:** `integer` total number of students with who are
  characterized as English Language Learners (sometimes written ELL, but also
  ENL or ESL students) in the schools. This counts the number of students who
  receive modified instruction either through English as a New Language
  instruction and/or bilingual education.([more info on how NYC identifies ELLs](https://www.schools.nyc.gov/learning/multilingual-learners/english-language-learners))
- **english_language_learners_1:** `real` the percent of ENL students as a real number between 0 and 1
- **poverty:** `text` the number of students who qualify for free or reduced
  lunch or HRA benefits. The raw data for this field is `text` data because the
  actual enrollment for schools with greater than 95% poverty rates is not
  returned, instead the string ''
- **poverty_1:** `text` the percent of students in poverty. The poverty
  percentage is a string percentage rounded to one decimal place (e.g. '94.4%'),
  except the highest and lowest values are changed to 'Below 5%' and 'Above
  95%'.
- **economic_need_index:** `text` Economic Need Index (ENI) estimates the
  percentage of students in the school living in poverty. The ENI is a string
  percentage rounded to one decimal place (e.g. '94.4%'), except the highest and
  lowest values are changed to 'Below 5%' and 'Above 95%'.

We can access just one column using index notation -- 
`df["poverty"]` gives us just that column. We can then display or sort the data using either
the python built-in function `sorted()` or the pandas `Series` function `sort_values()`.
If we call `unique()` we will get a list of the unique values in the `Series`. In the case
of `poverty` that lets us see that the column contains string data, and not raw numbers
if the poverty level is too high or too low.

In [4]:
# get just the poverty column
poverty = df["poverty"]
poverty = poverty.sort_values()
print("Note these are sorted as strings, not number value...")
print(poverty.unique()) 

sorted(poverty.unique(), reverse=True)

Note these are sorted as strings, not number value...
['10' '100' '1000' ... '997' 'Above 95%' 'Below 5%']


['Below 5%',
 'Above 95%',
 '997',
 '995',
 '994',
 '993',
 '99',
 '987',
 '983',
 '980',
 '98',
 '979',
 '978',
 '977',
 '976',
 '975',
 '974',
 '971',
 '970',
 '97',
 '967',
 '966',
 '965',
 '964',
 '963',
 '962',
 '961',
 '960',
 '96',
 '959',
 '958',
 '957',
 '956',
 '955',
 '953',
 '952',
 '951',
 '95',
 '949',
 '948',
 '947',
 '946',
 '944',
 '943',
 '941',
 '940',
 '94',
 '938',
 '937',
 '936',
 '935',
 '934',
 '933',
 '932',
 '931',
 '930',
 '93',
 '929',
 '928',
 '927',
 '925',
 '923',
 '922',
 '921',
 '92',
 '918',
 '917',
 '916',
 '914',
 '912',
 '911',
 '910',
 '91',
 '909',
 '907',
 '906',
 '905',
 '904',
 '903',
 '902',
 '901',
 '900',
 '90',
 '898',
 '896',
 '895',
 '894',
 '893',
 '892',
 '891',
 '89',
 '889',
 '888',
 '886',
 '885',
 '884',
 '883',
 '882',
 '881',
 '880',
 '88',
 '878',
 '877',
 '876',
 '874',
 '873',
 '872',
 '871',
 '870',
 '87',
 '869',
 '868',
 '867',
 '866',
 '864',
 '863',
 '862',
 '861',
 '860',
 '86',
 '859',
 '858',
 '857',
 '856',
 '855',
 '8

We can get a subset of the data by using index notation with
a list of column names:

`df[ ["dbn", "school_name", "total_enrollment", "poverty_1" ] ]` returns a `DataFrame`
with 4 columns.

In [5]:
df[ ["dbn", "school_name", "total_enrollment", "poverty_1" ] ]

Unnamed: 0,dbn,school_name,total_enrollment,poverty_1
0,01M015,P.S. 015 Roberto Clemente,178,85.4%
1,01M015,P.S. 015 Roberto Clemente,190,84.7%
2,01M015,P.S. 015 Roberto Clemente,174,84.5%
3,01M015,P.S. 015 Roberto Clemente,190,81.6%
4,01M015,P.S. 015 Roberto Clemente,193,81.9%
...,...,...,...,...
9164,84X730,Bronx Charter School for the Arts,320,73.4%
9165,84X730,Bronx Charter School for the Arts,314,82.2%
9166,84X730,Bronx Charter School for the Arts,430,84.4%
9167,84X730,Bronx Charter School for the Arts,523,86.6%


We can run some basic statistics on a single series, to get a sense of the type of data it contains.

In [6]:
# get just the total_enrollment column, called a Series in pandas
enrollment = df["total_enrollment"]
print("The largest school:", enrollment.max())

print("The smallest school:", enrollment.min())
print("Avg (mean) school size:", enrollment.mean())
print("Avg (mode, can return multiple values) percent poverty:", list(df['poverty_1'].mode()))
print("School years included in data:", df["year"].unique())

The largest school: 6040
The smallest school: 7
Avg (mean) school size: 585.9556112989421
Avg (mode, can return multiple values) percent poverty: ['Above 95%']
School years included in data: ['2016-17' '2017-18' '2018-19' '2019-20' '2020-21']
