Cleaning and Querying Pandas Data
==============================

This notebook shows us how to **modify** and **create new** columns in
a pandas `DataFrame`. This is an important part of cleaning data to
prepare it for analysis and visualization.

We also look at some techniques for _searching data_ in order
to get subsets of data (rows) that match certain criteria.

Finally, we learn how to save and load local copies of the data so that
we can easily use the cleaned data.

Some relevant ressources:
- <https://www.statology.org/create-column-based-on-condition-pandas/>
- <https://datascientyst.com/select-rows-column-value-pandas/>

In [2]:
# import pandas and load the school demographic data set into df
import pandas as pd
url = "https://data.cityofnewyork.us/resource/vmmu-wj3w.csv?$limit=1000000"
df = pd.read_csv(url)

Selecting rows with boolean indexing
===============================

Pandas has several ways to select data -- to get a subset of data from a `DataFrame`.
One of the easiest ways is to use "boolan indexing". This approach lets us use the
Python comparison operators to find rows that match our criteria. We can then use
the "index" of these rows to get a subset from a dataframe.

In this example we use the "dbn" field look for a match with a specific string. We
capture the results in a variable, `ps9` that holds the new `DataFrame`.

In [57]:
# let's find PS 9 in Brooklyn, District 13
ps9 = df[ df["dbn"] == "13K009"]
ps9[ ["dbn", "school_name", "year", "total_enrollment", "grade_3k_pk_half_day_full", "prek"] ]

Unnamed: 0,dbn,school_name,year,total_enrollment,grade_3k_pk_half_day_full,prek
3247,13K009,Public School 9 The Sarah Smith Garnet School,2016-17,866,90,True
3248,13K009,Public School 9 The Sarah Smith Garnet School,2017-18,912,132,True
3249,13K009,Public School 9 The Sarah Smith Garnet School,2018-19,942,130,True
3250,13K009,Public School 9 The Sarah Smith Garnet School,2019-20,937,126,True
3251,13K009,Public School 9 The Sarah Smith Garnet School,2020-21,852,111,True


Sometimes, we want to filter our main dataframe. To make the rest of our examples easier, let's
filter our dataframe so that we only have the data from the most recent academic year.

In [61]:
df = df[df["year"] == "2020-21"]

We can use boolean indexing with more complex examples, too where we combine boolean expressions.
For this, we use the `&` **and** operator and the `|` **or** operator.

Here we find schools that where more than half the students are Asian and more than half the students are classified as ELL.

In [62]:
asian_enl = df[ (df["asian_1"] > .5) & (df["english_language_learners_1"] > .5) ]
# df[ df["english_language_learners_1"] > .5 ]
asian_enl[["dbn", "school_name", "asian_1", "english_language_learners_1"]]

Unnamed: 0,dbn,school_name,asian_1,english_language_learners_1
3686,15K094,P.S. 094 The Henry Longfellow,0.703,0.554
4660,20K069,P.S. 69 Vincent D. Grippo School,0.891,0.501
4675,20K105,P.S. 105 The Blythebourne,0.926,0.515
4690,20K160,P.S. 160 William T. Sampson,0.885,0.632
5850,25Q244,The Active Learning Elementary School,0.889,0.546


Here we use the `|` to find schools that are either >70% Black or >70 Latinx.

In [68]:
black_or_latinx = ay20[ (df["black_1"] > .7) | (df["hispanic_1"] > .7) ]
print(f"""{len(black_or_latinx)} of {len(df)} schools in our data set have more than 70% Black or Latinx student populations.""")


564 of 1878 schools in our data set have more than 70% Black or Latinx student populations.


Creating new columns
===================
Create a new column based on data from other columns
---------------------------------------------------------------------------------

One of the most straightforward ways to modify a `DataFrame` is to create
a new column based on data in an existing column. Let's do that with our data.

In New York City some elementary schools offer "universal pre-k" -- school for 3 year olds
and 4 year olds. We can infer whether a school has pre-k based on the
number of students enrolled in prek-3 or prek-4, indicated by 'grade_3k_pk_half_day_full'.

Let's start simple and add a new column called `prek` which will hold a `Boolean`
value. `True` if the school has more than zero students in pre-k and `False` if it doesn't.

In [25]:
df["prek"] = df["grade_3k_pk_half_day_full"] > 0

# let's see some cols in our df to see if it worked
df[["dbn", "school_name", "prek"]]


Unnamed: 0,dbn,school_name,prek
0,01M015,P.S. 015 Roberto Clemente,True
1,01M015,P.S. 015 Roberto Clemente,True
2,01M015,P.S. 015 Roberto Clemente,True
3,01M015,P.S. 015 Roberto Clemente,True
4,01M015,P.S. 015 Roberto Clemente,True
...,...,...,...
9164,84X730,Bronx Charter School for the Arts,False
9165,84X730,Bronx Charter School for the Arts,False
9166,84X730,Bronx Charter School for the Arts,False
9167,84X730,Bronx Charter School for the Arts,False


In the code cell above, on the left hand side of the expression we create the new column
with `df["prek"]` on the right hand side we have a Boolean expression on the using
the field "grade_3k_pk_half_day_full" and the `> 0` comparison operator. Once that expression
resolves, it's assigned to the new "prek" field.

Now let's do a basic query to find how many schools have prek and how many don't.


In [39]:
schools_with_prek = df[ df["prek"] == True]
schools_without_prek = df[ df["prek"] == False]

print(f"""{len(schools_with_prek["dbn"].unique())} offer pre-k""")
print(f"""{len(schools_without_prek["dbn"].unique())} don't offer pre-k""")

743 offer pre-k
1227 don't offer pre-k


We can use this approach to make more complex columns, too. Let's add
2 new columns, `black_hispanic` with the combined enrollment of Black and Latinx
students, and `black_hispanic_1` with the percent of Black and Latinx.

First we use the `+` operator to add the values of two columns, then the `\` operator to divide two columns.

In [76]:
df["black_hispanic"] = df["black"] + df["hispanic"]

df["black_hispanic_1"] = df["black_hispanic"] / df["total_enrollment"]
df[["dbn", "school_name", "total_enrollment", "black", "hispanic", "black_hispanic", "black_hispanic_1"]]

Unnamed: 0,dbn,school_name,total_enrollment,black,hispanic,black_hispanic,black_hispanic_1
4,01M015,P.S. 015 Roberto Clemente,193,53,102,155,0.803109
9,01M019,P.S. 019 Asher Levy,212,41,130,171,0.806604
14,01M020,P.S. 020 Anna Silver,412,55,215,270,0.655340
19,01M034,P.S. 034 Franklin D. Roosevelt,273,104,152,256,0.937729
24,01M063,The STAR Academy - P.S.63,208,40,132,172,0.826923
...,...,...,...,...,...,...,...
9148,84X705,Family Life Academy Charter School,416,88,323,411,0.987981
9153,84X706,Harriet Tubman Charter School,647,399,232,631,0.975270
9158,84X717,Icahn Charter School,328,164,157,321,0.978659
9163,84X718,Bronx Charter School for Better Learning,570,482,57,539,0.945614


Create a new column with a function using `apply()`
--------------------------------------------------------------------------
Sometimes the column we want can't be easily calculated using arithmetic. Pandas
lets use use a function to calculate values with more complex logic.
You can create a new `Series` by using the `apply()` function on a column from your dataframe.

In this example we parse the dbn to create a new field called `district`.

In [77]:
def parse_district(dbn):
    # use Python string "slice" notation to get the district part of the DBN
    # we know it's always the first two characters
    return int(dbn[:2])

df["district"] = df["dbn"].apply(parse_district)
df[["dbn", "district", "school_name"]]

Unnamed: 0,dbn,district,school_name
4,01M015,1,P.S. 015 Roberto Clemente
9,01M019,1,P.S. 019 Asher Levy
14,01M020,1,P.S. 020 Anna Silver
19,01M034,1,P.S. 034 Franklin D. Roosevelt
24,01M063,1,The STAR Academy - P.S.63
...,...,...,...
9148,84X705,84,Family Life Academy Charter School
9153,84X706,84,Harriet Tubman Charter School
9158,84X717,84,Icahn Charter School
9163,84X718,84,Bronx Charter School for Better Learning


**Using lambda**

You can always define a function and use `apply()` the way I do in the example.
Often, though, professional programmers will use lambda functions as a shortcut
to do this. You can read about lambdas here:
<https://www.freecodecamp.org/news/lambda-expressions-in-python/>
    
Basically, a lambada is an anonymous function (it has no name and only exists briefly) which
takes arguments and returns a value from a single expression. You are not required to use lambdas,
but there's a good chance that you will see them being used in examples online.

We would re-write our above example more concisely using lambda.

In [78]:
df["district_lambda"] = df["dbn"].apply(lambda dbn: dbn[:2])
df[["dbn", "district_lambda", "school_name"]]

Unnamed: 0,dbn,district_lambda,school_name
4,01M015,01,P.S. 015 Roberto Clemente
9,01M019,01,P.S. 019 Asher Levy
14,01M020,01,P.S. 020 Anna Silver
19,01M034,01,P.S. 034 Franklin D. Roosevelt
24,01M063,01,The STAR Academy - P.S.63
...,...,...,...
9148,84X705,84,Family Life Academy Charter School
9153,84X706,84,Harriet Tubman Charter School
9158,84X717,84,Icahn Charter School
9163,84X718,84,Bronx Charter School for Better Learning


Now that we've made these changes to our core data, we can export it so that we can either
load it into a different notebook file, or another program (maybe for visualization, or a web server).
Pandas makes this _very_ easy. Here we save `df` as a .csv file (you can open it in Excel).
It will save it into the same folder as our notebook file.

In [79]:
df.to_csv("school-demographics-ay2020.csv")