Querying Data
============

In this Notebook we go over some of the ways of querying data in a dataframe.

- `query()` method in `DataFrame`
- `[]` slice notation for subsets of data


In [7]:
# import schools from the nycschool package
from nycschools import schools

# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()

In [8]:
# we already saw how to get a subset of columns
# let's get just these columns from out data
cols = ["dbn", "ay", "school_name", "district", "poverty_pct", "ell_pct", "swd_pct"]
df = df[cols]
df

Unnamed: 0,dbn,ay,school_name,district,poverty_pct,ell_pct,swd_pct
0,01M015,2016,P.S. 015 Roberto Clemente,1,0.854,0.067,0.287000
1,01M015,2017,P.S. 015 Roberto Clemente,1,0.847,0.042,0.258000
2,01M015,2018,P.S. 015 Roberto Clemente,1,0.845,0.046,0.224000
3,01M015,2019,P.S. 015 Roberto Clemente,1,0.816,0.089,0.242000
4,01M015,2020,P.S. 015 Roberto Clemente,1,0.819,0.109,0.223000
...,...,...,...,...,...,...,...
9996,84X730,2016,Bronx Charter School for the Arts,84,0.734,0.159,0.209375
9997,84X730,2017,Bronx Charter School for the Arts,84,0.822,0.182,0.216561
9998,84X730,2018,Bronx Charter School for the Arts,84,0.844,0.165,0.239535
9999,84X730,2019,Bronx Charter School for the Arts,84,0.866,0.132,0.223709


Selecting just one year
---------------------------------
We can see that our data contains multiple years of data for each school.
We can filter or query the data to just get a single year.

To do this, we will use the slice notation similary to above, however,
instead of a list of columns, we have a Boolean expression using
one of the columns in our data set.

`df[df.ay == 2020]` returns only the rows where `ay` equals `2020`


In [9]:
ay_2020 = df[df.ay == 2020]
ay_2020.head()

Unnamed: 0,dbn,ay,school_name,district,poverty_pct,ell_pct,swd_pct
4,01M015,2020,P.S. 015 Roberto Clemente,1,0.819,0.109,0.223
9,01M019,2020,P.S. 019 Asher Levy,1,0.712,0.042,0.392
14,01M020,2020,P.S. 020 Anna Silver,1,0.709,0.119,0.218
19,01M034,2020,P.S. 034 Franklin D. Roosevelt,1,0.96,0.062,0.392
24,01M063,2020,The STAR Academy - P.S.63,1,0.769,0.014,0.279


Compound expressions
----------------------------------
We can use Boolean operators to make more complex queries. 
`pandas` uses the `&` operator for Boolean **and**
and the `|` operator for Boolean **or**. Note that you should wrap
your equality expressions inside of parenethesis.

To write bug-free code, we do not recommend mixing `and` and `or`
clauses in the same query.

Below are two examples:

- `df[(df.district == 13) & (df.ay == 2020)]`
   find data forschools in district 13 _and_ academic year 2020-21
- `df[(df.poverty_pct > .9) | (df.ell_pct >= .4)]`
   find school data where the school's poverty percent is greater than 90% _or_
   the percent of studnts classified as English Language Learners is greater
   than or equal to 40%


In [11]:
df[(df.district == 13) & (df.ay == 2020)].head()

Unnamed: 0,dbn,ay,school_name,district,poverty_pct,ell_pct,swd_pct
1556,13K869,2020,District 13 PRE-K Center,13,0.076,0.0,0.015
2129,13K915,2020,I.S. 915,13,0.309,0.023,0.201
2130,13K915,2020,I.S. 915,13,0.309,0.023,0.201
2131,13K915,2020,I.S. 915,13,0.309,0.023,0.201
2132,13K915,2020,I.S. 915,13,0.309,0.023,0.201


In [12]:
df[(df.poverty_pct > .9) | (df.ell_pct >= .4)].head()

Unnamed: 0,dbn,ay,school_name,district,poverty_pct,ell_pct,swd_pct
15,01M034,2016,P.S. 034 Franklin D. Roosevelt,1,0.96,0.077,0.371
16,01M034,2017,P.S. 034 Franklin D. Roosevelt,1,0.96,0.075,0.372
17,01M034,2018,P.S. 034 Franklin D. Roosevelt,1,0.96,0.072,0.384
18,01M034,2019,P.S. 034 Franklin D. Roosevelt,1,0.96,0.057,0.395
19,01M034,2020,P.S. 034 Franklin D. Roosevelt,1,0.96,0.062,0.392


In [16]:
# this works, but it is confusing and error prone
df[(df.ay == 2020)  & ((data.poverty_pct > .9) | (data.ell_pct >= .4))].head()


Unnamed: 0,dbn,ay,school_name,district,poverty_pct,ell_pct,swd_pct
19,01M034,2020,P.S. 034 Franklin D. Roosevelt,1,0.96,0.062,0.392
29,01M064,2020,P.S. 064 Robert Simon,1,0.914,0.018,0.264
39,01M134,2020,P.S. 134 Henrietta Szold,1,0.949,0.055,0.419
44,01M140,2020,P.S. 140 Nathan Straus,1,0.95,0.073,0.372
49,01M142,2020,P.S. 142 Amalia Castro,1,0.946,0.066,0.284


In [14]:
# rather than mixing | and &, we can write our code more clearly in two lines
data = df[df.ay == 2020]
data = data[(data.poverty_pct > .9) | (data.ell_pct >= .4)]
data.head()

Unnamed: 0,dbn,ay,school_name,district,poverty_pct,ell_pct,swd_pct
19,01M034,2020,P.S. 034 Franklin D. Roosevelt,1,0.96,0.062,0.392
29,01M064,2020,P.S. 064 Robert Simon,1,0.914,0.018,0.264
39,01M134,2020,P.S. 134 Henrietta Szold,1,0.949,0.055,0.419
44,01M140,2020,P.S. 140 Nathan Straus,1,0.95,0.073,0.372
49,01M142,2020,P.S. 142 Amalia Castro,1,0.946,0.066,0.284
