Calculating New Columns
=======================

`pandas` makes it easy to calculate new fields based on
existing data. This Notebook looks at the _easy_ cases
and then takes on some more advanced cases using our 
schooldemographcs data set.

Topics in this notebook:

- calculating columns with _vectorization_
- calculating columns with `apply()` on a series
- lambda functions
- use `copy()` to make a (shallow) copy of a data frame

In [1]:
from nycschools import schools

df = schools.load_school_demographics()


In [2]:
# with "vectorization"
# ---------------------

# calculate a new fields for the pct of school that is black or hispanic
df["black_hispanic_n"] = df.black_n + df.hispanic_n

# now calculate that as a pct of the total enrollment
df["black_hispanic_pct"] = df.black_hispanic_n / df.total_enrollment

df[ ["dbn","school_name","total_enrollment", "black_hispanic_n", "black_hispanic_pct"] ]

Unnamed: 0,dbn,school_name,total_enrollment,black_hispanic_n,black_hispanic_pct
0,01M015,P.S. 015 Roberto Clemente,178,156,0.876404
1,01M019,P.S. 019 Asher Levy,271,231,0.852399
2,01M020,P.S. 020 Anna Silver,540,316,0.585185
3,01M034,P.S. 034 Franklin D. Roosevelt,350,318,0.908571
4,01M063,The STAR Academy - P.S.63,200,166,0.830000
...,...,...,...,...,...
11115,84X730,Bronx Charter School for the Arts,430,419,0.974419
11116,84X730,Bronx Charter School for the Arts,523,509,0.973231
11117,84X730,Bronx Charter School for the Arts,626,609,0.972843
11118,84X730,Bronx Charter School for the Arts,598,581,0.971572


In [3]:
# we can also use boolean expressions -- let's mark all of the schools not in districts 1-32 as "special_district"
df["special_district"] = df.district > 32
df[["dbn", "district", "special_district"]]

Unnamed: 0,dbn,district,special_district
0,01M015,1,False
1,01M019,1,False
2,01M020,1,False
3,01M034,1,False
4,01M063,1,False
...,...,...,...
11115,84X730,84,True
11116,84X730,84,True
11117,84X730,84,True
11118,84X730,84,True


In [4]:
# vectorization is the best way to create cols based on calculations but can't handle more advanced logic
# here we use apply() to format total enrollment to make it easier to read
# we'll call the new field total_enrollment_pp -- pp: pretty print

# create a function that we will "apply" to that columns
def fmt_enroll(n):
    return f"{n:,}"

df["total_enrollment_pp"] = df.total_enrollment.apply(fmt_enroll)
big_schools = df.sort_values(by="total_enrollment", ascending=False)[0:20]
big_schools[["dbn","total_enrollment", "total_enrollment_pp"]].head()
    

Unnamed: 0,dbn,total_enrollment,total_enrollment_pp
5153,13K430,6040,6040
5155,13K430,5958,5958
5156,13K430,5942,5942
5152,13K430,5937,5937
5154,13K430,5921,5921


In [5]:
# since our function is so simple, it's a good candidate for a lambda function
# lambdas in python are anonymous functions that have a reduced syntax
# see examples here:
# https://www.freecodecamp.org/news/python-lambda-function-explained/

# make a copy of our data
data = df.copy()

# use lambda to format the percentages
# we use the f-string syntax to round the number to a 2 decimal float
data["black_pct_pp"] = data.black_pct.apply(lambda x: f"{x*100:.02f}%")
data["black_pct_pp"]

0        28.70%
1        18.80%
2         9.60%
3        29.10%
4        18.50%
          ...  
11115    22.79%
11116    25.05%
11117    26.84%
11118    27.26%
11119    25.72%
Name: black_pct_pp, Length: 11120, dtype: object

In [6]:
# we can put the whole thing in a loop, too
# here we replace the original value with the formatted value
data = df.copy()
for c in data.columns:
    if c.endswith("_pct"):
        data[c] = data[c].apply(lambda x: f"{x*100:.02f}%")


data[["dbn", "asian_pct", "black_pct", "hispanic_pct", "white_pct"]].head()

Unnamed: 0,dbn,asian_pct,black_pct,hispanic_pct,white_pct
0,01M015,7.90%,28.70%,59.00%,2.20%
1,01M019,8.90%,18.80%,66.40%,5.50%
2,01M020,32.40%,9.60%,48.90%,5.00%
3,01M034,5.40%,29.10%,61.70%,3.10%
4,01M063,4.00%,18.50%,64.50%,10.50%
