<a href="https://colab.research.google.com/github/afeld/python-public-policy/blob/main/lecture_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NYU Wagner - Python Coding for Public Policy**

# Class 2: Manipulating/combining DataFrames and writing functions

# LECTURE

## Operator precedence

a.k.a. order of operations

```python
answer = "No"

answer == "Yes" or "yes"
```

What will this evaluate to?

[Evaluation order](https://docs.python.org/3/reference/expressions.html#evaluation-order) and [operator precedence](https://docs.python.org/3/reference/expressions.html#operator-precedence)

Does Python follow [PEMDAS](https://en.wikipedia.org/wiki/Order_of_operations#Mnemonics)?

```python
answer = "No"


result =  answer == "Yes"  or "yes"
#           ↓
result =   "No"  == "Yes"  or "yes"
```

```python
#                 ↓
result = ( "No"  == "Yes") or "yes"
# `==` has higher precedence than `or`
```

```python
#                 ↓
result =        False      or "yes"
```

```python
#                          ↓
result =                 "yes"
```

In [1]:
answer = "No"

answer == "Yes" or "yes"

'yes'

## **Today's goal**: Which Community Districts have the most 311 requests? Why might that be?

### What's a Community District?

- 59 local governance districts each run by an appointed Community Board
- Community boards advise on land use and zoning, participate in the city budget process, and address service delivery in their district.
- Community boards are each composed of <= 50 volunteer members appointed by the local borough president, half from nominations by the local City Council members.

[More info](https://en.wikipedia.org/wiki/Community_boards_of_New_York_City)

![Map of community districts from Wikipedia](https://upload.wikimedia.org/wikipedia/commons/4/41/New_York_City_community_districts.svg)

## Setup

In [2]:
import pandas as pd

In [3]:
# Display more rows and columns in the DataFrames
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

## Read our cleaned 311 Service Requests dataset

In [4]:
url = 'https://storage.googleapis.com/python-public-policy/data/311_Service_Requests_2018-19_clean.csv.zip'
df = pd.read_csv(url)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Dealing with dtypes

More data cleaning!

![Minion character vacuuming](https://impulsecreative.com/hs-fs/hubfs/cleaning-minion-gif.gif?width=490&name=cleaning-minion-gif.gif)

```
DtypeWarning: Columns (9,18,21,32,35) have mixed types.
```

In [5]:
df["Incident Zip"].unique().tolist()

['10029',
 '10031',
 '11216',
 '10010',
 '11413',
 '11211',
 '10033',
 '10022',
 '11213',
 '10040',
 nan,
 '10456',
 '11235',
 '11378',
 '11221',
 '11693',
 '10457',
 '10474',
 '10460',
 '11232',
 '11219',
 '10032',
 '11106',
 '10465',
 '11217',
 '11432',
 '11367',
 '11206',
 '10023',
 '11249',
 '11215',
 '11373',
 '10467',
 '10459',
 '10036',
 '10016',
 '10035',
 '11223',
 '10003',
 '11101',
 '10025',
 '10128',
 '11362',
 '10014',
 '10461',
 '11234',
 '11436',
 '11207',
 '10039',
 '10024',
 '11358',
 '11368',
 '10314',
 '11214',
 '10305',
 '10011',
 '11417',
 '11435',
 '11230',
 '10002',
 '11208',
 '10451',
 '10462',
 '11426',
 '10468',
 '11419',
 '11238',
 '11434',
 '10304',
 '11220',
 '11385',
 '11237',
 '10001',
 '10452',
 '10009',
 '10469',
 '11433',
 '10472',
 '11228',
 '10463',
 '10026',
 '11225',
 '11421',
 '11212',
 '10306',
 '10027',
 '11204',
 '10458',
 '10470',
 '11694',
 '10466',
 '10019',
 '11429',
 '11428',
 '11226',
 '10453',
 '11369',
 '10454',
 '10473',
 '11004',
 '10

ZIP codes _look_ numeric, but aren't really.

Read the ZIP codes in as strings.

In [6]:
df2 = pd.read_csv(url, dtype={"Incident Zip": object})

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


We fixed the dtype warning for column 9 (`Incident Zip`).

In [7]:
df2["Incident Zip"].unique().tolist()

['10029',
 '10031',
 '11216',
 '10010',
 '11413',
 '11211',
 '10033',
 '10022',
 '11213',
 '10040',
 nan,
 '10456',
 '11235',
 '11378',
 '11221',
 '11693',
 '10457',
 '10474',
 '10460',
 '11232',
 '11219',
 '10032',
 '11106',
 '10465',
 '11217',
 '11432',
 '11367',
 '11206',
 '10023',
 '11249',
 '11215',
 '11373',
 '10467',
 '10459',
 '10036',
 '10016',
 '10035',
 '11223',
 '10003',
 '11101',
 '10025',
 '10128',
 '11362',
 '10014',
 '10461',
 '11234',
 '11436',
 '11207',
 '10039',
 '10024',
 '11358',
 '11368',
 '10314',
 '11214',
 '10305',
 '10011',
 '11417',
 '11435',
 '11230',
 '10002',
 '11208',
 '10451',
 '10462',
 '11426',
 '10468',
 '11419',
 '11238',
 '11434',
 '10304',
 '11220',
 '11385',
 '11237',
 '10001',
 '10452',
 '10009',
 '10469',
 '11433',
 '10472',
 '11228',
 '10463',
 '10026',
 '11225',
 '11421',
 '11212',
 '10306',
 '10027',
 '11204',
 '10458',
 '10470',
 '11694',
 '10466',
 '10019',
 '11429',
 '11428',
 '11226',
 '10453',
 '11369',
 '10454',
 '10473',
 '11004',
 '10

### Find invalid ZIP codes

Use a [regular expression (regex)](https://regexone.com/):

```
^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string
```

In [8]:
invalid_zips = df2["Incident Zip"].str.contains(r'^\d{5}(?:-\d{4})?$') == False
df2[invalid_zips]["Incident Zip"]

70618            0000
96525       NOT GIVEN
146742        UNKNOWN
164830           1451
179244           1178
249718        NJ07114
315767     HARRISBURG
327505         NEWARK
333729         N5X3A6
345002           0000
390257           NONE
430639        UNKNOWN
445201      NOT KNOWN
463948         100000
522146              0
562129           ANON
569620           ????
641525              0
644336             00
689861            IDK
706912           1801
762810              0
786024           0765
822817       NJ 07310
842565           0000
912126       NJ 07114
912278         000000
994251      14614-195
1077000    JFK AIRPOR
1099747        979113
1153423           100
1156231           LE4
1186303          8682
1241526        000000
1266817             0
1355194        000000
1427637       UNKNOWN
1476022        921008
1633955         NJ070
1810106           UNK
1840968    98046 1548
1847360        000000
1861435      NJ 07114
1880071            W1
1913353     601488479
2141590   

Clear any invalid ZIP codes:

In [9]:
import numpy as np

df2.loc[invalid_zips, "Incident Zip"] = np.nan

General data cleaning tips:

- Hard part is finding what needs to be done
- Will be specific to your use case
- Document what you did, since it will affect your results

## View the contents of the `community_board` column in our 311 data

In [10]:
df["Community Board"].unique()

array(['11 MANHATTAN', '09 MANHATTAN', '08 BROOKLYN', '05 MANHATTAN',
       '12 QUEENS', '01 BROOKLYN', '12 MANHATTAN', '06 MANHATTAN',
       '0 Unspecified', '04 BRONX', '15 BROOKLYN', '05 QUEENS',
       '03 BROOKLYN', '14 QUEENS', '02 BRONX', '05 BRONX', '06 BRONX',
       '12 BROOKLYN', '11 BROOKLYN', '01 QUEENS', '10 BRONX',
       '06 BROOKLYN', '08 QUEENS', '07 MANHATTAN', '07 BROOKLYN',
       '04 QUEENS', '07 BRONX', '03 MANHATTAN', '03 BRONX',
       '08 MANHATTAN', '11 QUEENS', '02 MANHATTAN', '11 BRONX',
       '18 BROOKLYN', '05 BROOKLYN', '10 MANHATTAN', '02 STATEN ISLAND',
       '10 BROOKLYN', '01 STATEN ISLAND', '04 MANHATTAN', '10 QUEENS',
       '12 BRONX', '01 BRONX', '09 BRONX', '13 QUEENS', '08 BRONX',
       '09 QUEENS', 'Unspecified BROOKLYN', '04 BROOKLYN',
       'Unspecified MANHATTAN', 'Unspecified BRONX', '09 BROOKLYN',
       '16 BROOKLYN', '03 STATEN ISLAND', '14 BROOKLYN', '03 QUEENS',
       '07 QUEENS', '02 BROOKLYN', '06 QUEENS', 'Unspecified QUEENS

## Get the count of 311 requests per Community District

In [11]:
cb_counts = df.groupby('Community Board').size().reset_index(name='count_of_311_requests')
cb_counts = cb_counts.sort_values('count_of_311_requests', ascending=False)
cb_counts

Unnamed: 0,Community Board,count_of_311_requests
50,12 MANHATTAN,81402
23,05 QUEENS,71506
51,12 QUEENS,70361
2,01 BROOKLYN,68101
12,03 BROOKLYN,66360
5,01 STATEN ISLAND,65145
31,07 QUEENS,63634
21,05 BROOKLYN,61836
16,04 BRONX,61086
4,01 QUEENS,60425


## **Research Question:** What may account for the variance in count of requests per community district?

## **Hypothesis:** Population size may help explain the variance.

We can combine the counts per community district dataset with population data for each community district.

## Let's load the population dataset and check out its contents

[Data source for population by Community District](https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Community-Districts/xi7c-iiu2/data)

In [12]:
population = pd.read_csv('https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv')
population.head()

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200


## In order to join the two dataframes, we need to create a common ID in each.

[`BORO CODE`](https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/pluto_datadictionary.pdf#page=38) (a.k.a. `BoroCode`, `borocd`, and `boro_cd`) is a commonly-used a unique ID for community districts. Let's create functions that create that unique ID in our datasets.

**BoroCD** is a 3 digit integer that captures the borough and district number. The borough is represented by the first digit. The district number is padded with zeros so it's always two digits long.

Boroughs are recoded into the following numbers:
- 1: Manhattan
- 2: Bronx
- 3: Brooklyn
- 4: Queens
- 5: Staten Island

Ex: 
- Manhattan 12 --> 112
- Brooklyn 6 --> 306


### First, let's create a `borocd` column in `cb_counts` dataframe

In [13]:
cb_counts.head()

Unnamed: 0,Community Board,count_of_311_requests
50,12 MANHATTAN,81402
23,05 QUEENS,71506
51,12 QUEENS,70361
2,01 BROOKLYN,68101
12,03 BROOKLYN,66360


Let's create a function called `recode_borocd_counts` that converts the `Community Board` value into a `borocd` value.

In [14]:
def recode_borocd_counts(row):
  if 'MANHATTAN' in row["Community Board"]:
    return '1' + row["Community Board"][0:2]
    # [0:2] provides the first 2 characters, i.e. characters at indexes 0 and 1.
    # you could also use [:2] without the zero.
  elif 'BRONX' in row["Community Board"]:
    return '2' + row["Community Board"][0:2]
  elif 'BROOKLYN' in row["Community Board"]:
    return '3' + row["Community Board"][0:2]
  elif 'QUEENS' in row["Community Board"]:
    return '4' + row["Community Board"][0:2]
  elif 'STATEN ISLAND' in row["Community Board"]:
    return '5' + row["Community Board"][0:2]
  else:
    return 'Invalid BoroCD'

As we can see, `def` creates functions:

In [15]:
type(recode_borocd_counts)

function

**Note:** The order of steps is important. You have to define a function before you can apply it to a dataframe.

In [16]:
cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)

- `axis = 1` specifies that you want to apply the function across the rows instead of columns
- `cb_counts['borocd']` creates a new column in the dataframe called `borocd`
- `cb_counts.apply(function, axis)` applies the function we defined across the specified axis

In [17]:
cb_counts

Unnamed: 0,Community Board,count_of_311_requests,borocd
50,12 MANHATTAN,81402,112
23,05 QUEENS,71506,405
51,12 QUEENS,70361,412
2,01 BROOKLYN,68101,301
12,03 BROOKLYN,66360,303
5,01 STATEN ISLAND,65145,501
31,07 QUEENS,63634,407
21,05 BROOKLYN,61836,305
16,04 BRONX,61086,204
4,01 QUEENS,60425,401


Uh oh, there are some unexpected `Unspecified` values in here - how can we get around them?

Let's only recode records that don't start with "U".

In [18]:
def recode_borocd_counts(row):
  if 'MANHATTAN' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '1' + row["Community Board"][:2]
  elif 'BRONX' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '2' + row["Community Board"][:2]
  elif 'BROOKLYN' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '3' + row["Community Board"][:2]
  elif 'QUEENS' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '4' + row["Community Board"][:2]
  elif 'STATEN ISLAND' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '5' + row["Community Board"][:2]
  else:
    return 'Invalid BoroCD'

cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

Unnamed: 0,Community Board,count_of_311_requests,borocd
50,12 MANHATTAN,81402,112
23,05 QUEENS,71506,405
51,12 QUEENS,70361,412
2,01 BROOKLYN,68101,301
12,03 BROOKLYN,66360,303
5,01 STATEN ISLAND,65145,501
31,07 QUEENS,63634,407
21,05 BROOKLYN,61836,305
16,04 BRONX,61086,204
4,01 QUEENS,60425,401


We can make this function easier to read by isolating the logic that applies to all the conditions. This is called "refactoring".

In [19]:
def recode_borocd_counts(row):
    board = row["Community Board"]

    if board[0] != 'U':
        num = board[0:2]
        
        if 'MANHATTAN' in board:
            return '1' + num
        elif 'BRONX' in board:
            return '2' + num
        elif 'BROOKLYN' in board:
            return '3' + num
        elif 'QUEENS' in board:
            return '4' + num
        elif 'STATEN ISLAND' in board:
            return '5' + num
    else:
        return 'Invalid BoroCD'

In [20]:
cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

Unnamed: 0,Community Board,count_of_311_requests,borocd
50,12 MANHATTAN,81402,112
23,05 QUEENS,71506,405
51,12 QUEENS,70361,412
2,01 BROOKLYN,68101,301
12,03 BROOKLYN,66360,303
5,01 STATEN ISLAND,65145,501
31,07 QUEENS,63634,407
21,05 BROOKLYN,61836,305
16,04 BRONX,61086,204
4,01 QUEENS,60425,401


### Next, let's create the `borocd` column in the population dataset

In [21]:
population.head()

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200


Create a function `recode_borocd_pop` that combines and recodes the Borough and CD Number values to create a BoroCD unique ID.

In [22]:
def recode_borocd_pop(row):
  if row.Borough == 'Manhattan':
    return str(100 + row['CD Number'])
  elif row.Borough == 'Bronx':
    return str(200 + row['CD Number'])
  elif row.Borough == 'Brooklyn':
    return str(300 + row['CD Number'])
  elif row.Borough == 'Queens':
    return str(400 + row['CD Number'])
  elif row.Borough == 'Staten Island':
    return str(500 + row['CD Number'])
  else:
    return 'Invalid BoroCD'

In [23]:
population['borocd'] = population.apply(recode_borocd_pop, axis=1)
population

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population,borocd
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497,201
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246,202
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762,203
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441,204
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200,205
5,Bronx,6,"East Tremont, Belmont",114137,65016,68061,75688,83268,206
6,Bronx,7,"Bedford Park, Norwood, Fordham",113764,116827,128588,141411,139286,207
7,Bronx,8,"Riverdale, Kingsbridge, Marble Hill",103543,98275,97030,101332,101731,208
8,Bronx,9,"Soundview, Parkchester",166442,167627,155970,167859,172298,209
9,Bronx,10,"Throgs Nk., Co-op City, Pelham Bay",84948,106516,108093,115948,120392,210


## Join the population data onto the counts data after creating shared `borocd` unique ID

To join dataframes together, we will use the [pandas `.merge()` function](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html#join-tables-using-a-common-identifier).

![merge diagram](https://pandas.pydata.org/pandas-docs/stable/_images/08_merge_left.svg)

In [24]:
merged_data = pd.merge(left=cb_counts, right=population, left_on='borocd', right_on='borocd')
merged_data

Unnamed: 0,Community Board,count_of_311_requests,borocd,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,12 MANHATTAN,81402,112,Manhattan,12,"Washington Heights, Inwood",180561,179941,198192,208414,190020
1,05 QUEENS,71506,405,Queens,5,"Ridgewood, Glendale, Maspeth",161022,150142,149126,165911,169190
2,12 QUEENS,70361,412,Queens,12,"Jamaica, St. Albans, Hollis",206639,189383,201293,223602,225919
3,01 BROOKLYN,68101,301,Brooklyn,1,"Williamsburg, Greenpoint",179390,142942,155972,160338,173083
4,03 BROOKLYN,66360,303,Brooklyn,3,Bedford Stuyvesant,203380,133379,138696,143867,152985
5,01 STATEN ISLAND,65145,501,Staten Island,1,"Stapleton, Port Richmond",135875,138489,137806,162609,175756
6,07 QUEENS,63634,407,Queens,7,"Flushing, Bay Terrace",207589,204785,220508,242952,247354
7,05 BROOKLYN,61836,305,Brooklyn,5,"East New York, Starrett City",170791,154931,161350,173198,182896
8,04 BRONX,61086,204,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
9,01 QUEENS,60425,401,Queens,1,"Astoria, Long Island City",185925,185198,188549,211220,191105


## Calculate 311 requests per capita

In [25]:
# divide request count by 2010 population to get request per capita

merged_data['request_per_capita'] = merged_data.count_of_311_requests / merged_data['2010 Population']

merged_data.head()

Unnamed: 0,Community Board,count_of_311_requests,borocd,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population,request_per_capita
0,12 MANHATTAN,81402,112,Manhattan,12,"Washington Heights, Inwood",180561,179941,198192,208414,190020,0.428386
1,05 QUEENS,71506,405,Queens,5,"Ridgewood, Glendale, Maspeth",161022,150142,149126,165911,169190,0.422637
2,12 QUEENS,70361,412,Queens,12,"Jamaica, St. Albans, Hollis",206639,189383,201293,223602,225919,0.311443
3,01 BROOKLYN,68101,301,Brooklyn,1,"Williamsburg, Greenpoint",179390,142942,155972,160338,173083,0.393459
4,03 BROOKLYN,66360,303,Brooklyn,3,Bedford Stuyvesant,203380,133379,138696,143867,152985,0.433768


Let's create a simplified new dataframe that only include the columns we care about and in a better order.

In [26]:
columns = ['borocd', 'Borough', 'CD Name', '2010 Population', 'count_of_311_requests', 'request_per_capita']
cd_data = merged_data[columns]

cd_data

Unnamed: 0,borocd,Borough,CD Name,2010 Population,count_of_311_requests,request_per_capita
0,112,Manhattan,"Washington Heights, Inwood",190020,81402,0.428386
1,405,Queens,"Ridgewood, Glendale, Maspeth",169190,71506,0.422637
2,412,Queens,"Jamaica, St. Albans, Hollis",225919,70361,0.311443
3,301,Brooklyn,"Williamsburg, Greenpoint",173083,68101,0.393459
4,303,Brooklyn,Bedford Stuyvesant,152985,66360,0.433768
5,501,Staten Island,"Stapleton, Port Richmond",175756,65145,0.370656
6,407,Queens,"Flushing, Bay Terrace",247354,63634,0.257259
7,305,Brooklyn,"East New York, Starrett City",182896,61836,0.338094
8,204,Bronx,"Highbridge, Concourse Village",146441,61086,0.417137
9,401,Queens,"Astoria, Long Island City",191105,60425,0.316187


Let's check out which Community Districts have the highest complaints per capita

In [27]:
cd_data.sort_values('request_per_capita', ascending=False).head(10)

Unnamed: 0,borocd,Borough,CD Name,2010 Population,count_of_311_requests,request_per_capita
40,105,Manhattan,Midtown Business District,51673,37466,0.72506
31,308,Brooklyn,Crown Heights North,96317,43872,0.455496
29,302,Brooklyn,"Brooklyn Heights, Fort Greene",99617,44061,0.442304
21,304,Brooklyn,Bushwick,112634,49552,0.439938
4,303,Brooklyn,Bedford Stuyvesant,152985,66360,0.433768
32,309,Brooklyn,"Crown Heights South, Wingate",98429,42655,0.433358
20,110,Manhattan,Central Harlem,115723,50024,0.432274
0,112,Manhattan,"Washington Heights, Inwood",190020,81402,0.428386
1,405,Queens,"Ridgewood, Glendale, Maspeth",169190,71506,0.422637
8,204,Bronx,"Highbridge, Concourse Village",146441,61086,0.417137


While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it's low residential population.

## Next class we'll produce charts and maps to better visualize the differences in magnitude of the 311 requests per capita values.

# Homework 2

1. Open [the notebook](https://colab.research.google.com/github/afeld/python-public-policy/blob/main/hw_2.ipynb)
1. Save a copy in Drive