Loading Data
============
This notebook loads the NYC School Demographic data.
Working with this dataset, we look at some basic
[Pandas](https://pandas.pydata.org/) operations.

We assume that you have a basic understanding of Python and
Jupyter notebooks. This is a good start if you are new to
Pandas and data science in Python.

In particular:

- loading data from the `nycschools` into a `DataFrame`
- use `head()`, `tail()`, and `Series` (columns) to understand the data
- access column `Series` by name using index notation
- use `unique()`, `min()`, `max()`, `sum()`, and `mean()` to understand series data



In [1]:
# import schools from the nycschool package
from nycschools import schools
# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()


Displaying data tables
-----------------------
If we display `df` notebook shows us some of the data from 
the start of the data set and some from the end.

If we call `df.head()` we get the start of the data. `df.tail()` shows us the end of the data.

Comment/uncomment the different options to see how they work.

In [2]:
df

# df.head()
# df.tail()

Unnamed: 0,dbn,beds,district,geo_district,boro,school_name,short_name,ay,year,school_type,...,missing_race_ethnicity_data_pct,swd_n,swd_pct,ell_n,ell_pct,poverty_n,poverty_pct,eni,clean_name,zip
0,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2016,2016-17,community,...,0.000000,51,0.287000,12,0.067000,152,0.854000,0.882000,roberto clemente,10009
1,01M019,310100010019,1,1,Manhattan,P.S. 019 Asher Levy,PS 19,2016,2016-17,community,...,0.000000,88,0.325000,9,0.033000,207,0.764000,0.578000,asher levy,10003
2,01M020,310100010020,1,1,Manhattan,P.S. 020 Anna Silver,PS 20,2016,2016-17,community,...,0.000000,116,0.215000,93,0.172000,315,0.583000,0.677000,anna silver,10002
3,01M034,310100010034,1,1,Manhattan,P.S. 034 Franklin D. Roosevelt,PS 34,2016,2016-17,community,...,0.003000,130,0.371000,27,0.077000,336,0.960000,0.862000,franklin d roosevelt,10009
4,01M063,310100010063,1,1,Manhattan,The STAR Academy - P.S.63,PS 63,2016,2016-17,community,...,0.000000,66,0.330000,5,0.025000,165,0.825000,0.690000,the star academy - ps63,10009
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11115,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2018,2018-19,charter,...,0.000000,103,0.239535,71,0.165116,363,0.844186,0.887860,bronx charter school for the arts,10474
11116,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,MS 730,2019,2019-20,charter,...,0.000000,117,0.223709,69,0.131931,453,0.866157,0.892342,bronx charter school for the arts,10474
11117,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,MS 730,2020,2020-21,charter,...,0.001597,152,0.242812,78,0.124601,547,0.873802,0.888728,bronx charter school for the arts,10474
11118,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,MS 730,2021,2021-22,charter,...,0.003344,135,0.225753,79,0.132107,540,0.903010,0.903701,bronx charter school for the arts,10474


In [3]:
# The the `columns` property shows us the names of the cols in our `df`.
df.columns

Index(['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name',
       'short_name', 'ay', 'year', 'school_type', 'total_enrollment',
       'grade_pk', 'grade_k', 'grade_1', 'grade_2', 'grade_3', 'grade_4',
       'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10',
       'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n', 'male_pct',
       'asian_n', 'asian_pct', 'black_n', 'black_pct', 'hispanic_n',
       'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
       'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
       'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
       'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
       'eni', 'clean_name', 'zip'],
      dtype='object')

We can access just one column using index notation -- 
`df["poverty"]` gives us just that column. We can then display or sort the data using either
the python built-in function `sorted()` or the pandas `Series` function `sort_values()`.
If we call `unique()` we will get a list of the unique values in the `Series`. In the case
of `poverty` that lets us see that the column contains string data, and not raw numbers
if the poverty level is too high or too low.

In [4]:
# get just the poverty column
poverty = df["poverty_pct"]

# pandas also supports "dot notation" for access to columns
# but you can't use dot notation if the column name is not a valid identifier, python keyword, 
# or name of a property or member function of the dataframe

poverty = df.poverty_pct # same as line 2 above

poverty = poverty.sort_values()
print("Note: precentages are displayed as real numbers between 0..1")
print(poverty.unique()) 


Note: precentages are displayed as real numbers between 0..1
[0.04       0.05294118 0.05947137 ... 0.94999999 0.95       0.96      ]


We can get a subset of the data by using index notation with
a list of column names:

`df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]` returns a `DataFrame`
with 4 columns.

In [5]:
df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]

Unnamed: 0,dbn,school_name,total_enrollment,poverty_n
0,01M015,P.S. 015 Roberto Clemente,178,152
1,01M019,P.S. 019 Asher Levy,271,207
2,01M020,P.S. 020 Anna Silver,540,315
3,01M034,P.S. 034 Franklin D. Roosevelt,350,336
4,01M063,The STAR Academy - P.S.63,200,165
...,...,...,...,...
11115,84X730,Bronx Charter School for the Arts,430,363
11116,84X730,Bronx Charter School for the Arts,523,453
11117,84X730,Bronx Charter School for the Arts,626,547
11118,84X730,Bronx Charter School for the Arts,598,540
