# **NYU Wagner - Python Coding for Public Policy**

# Class 2: Manipulating/combining DataFrames and writing functions

# LECTURE

## Feeling overwhlemed?

Reminder that learning to code is like learning a spoken language. It's not obvious, and people will pick it up at different speeds at different spots. Try:

- Taking notes in [the lecture notebooks](https://padmgp-4506-spring.rcnyu.org/user-redirect/tree/class_materials/)
- Using [another Python/pandas learning resource](https://github.com/afeld/python-public-policy#resources)
   - Hear things explained another way
- [Comment-driven development](https://www.sitepoint.com/comment-driven-development/)
   - Otherwise, trying to do two steps in your head:
      1. Figuring out the logic
      1. Figuring out the syntax

## Operator precedence

a.k.a. order of operations, like [PEMDAS](https://en.wikipedia.org/wiki/Order_of_operations#Mnemonics) from math

```python
answer = "No"

answer == "Yes" or "yes"
```

What will this evaluate to?

[Evaluation order](https://docs.python.org/3/reference/expressions.html#evaluation-order) and [operator precedence](https://docs.python.org/3/reference/expressions.html#operator-precedence)

```python
answer = "No"


result =  answer == "Yes"  or "yes"
#           ↓
result =   "No"  == "Yes"  or "yes"
```

```python
#                 ↓
result = ( "No"  == "Yes") or "yes"
# `==` has higher precedence than `or`
```

```python
#                 ↓
result =        False      or "yes"
```

```python
#                          ↓
result =                 "yes"
```

In [1]:
answer = "No"

result = answer == "Yes" or "yes"
result

'yes'

**Takeaway:** Code is better when readable. Use parentheses so the reader doesn't have to think!

## **Today's goal**: Which Community Districts have the most 311 requests? Why might that be?

### What's a Community District?

- 59 local governance districts each run by an appointed [Community Board](https://en.wikipedia.org/wiki/Community_boards_of_New_York_City)
- Community boards advise on land use and zoning, participate in the city budget process, and address service delivery in their district.
- Community boards are each composed of up to 50 volunteer members appointed by the local borough president, half from nominations by the local City Council members.

![Map of community districts from Wikipedia](https://upload.wikimedia.org/wikipedia/commons/4/41/New_York_City_community_districts.svg)

## Setup

In [2]:
import pandas as pd

In [3]:
# Display more rows and columns in the DataFrames
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

## Read our cleaned 311 Service Requests dataset

In [5]:
url = 'https://storage.googleapis.com/python-public-policy/data/311_requests_2018-19_sample_clean.csv.zip'
requests = pd.read_csv(url)

  df = pd.read_csv(url)


## Dealing with dtypes

More data cleaning!

![Minion character vacuuming](https://impulsecreative.com/hs-fs/hubfs/cleaning-minion-gif.gif?width=490&name=cleaning-minion-gif.gif)

```
DtypeWarning: Columns (8,17,20,31,34) have mixed types.
```

In [6]:
requests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      499958 non-null  int64  
 1   Created Date                    499958 non-null  object 
 2   Closed Date                     476140 non-null  object 
 3   Agency                          499958 non-null  object 
 4   Agency Name                     499958 non-null  object 
 5   Complaint Type                  499958 non-null  object 
 6   Descriptor                      492496 non-null  object 
 7   Location Type                   392573 non-null  object 
 8   Incident Zip                    480394 non-null  object 
 9   Incident Address                434529 non-null  object 
 10  Street Name                     434504 non-null  object 
 11  Cross Street 1                  300825 non-null  object 
 12  Cross Street 2  

In [7]:
requests["Incident Zip"].unique().tolist()

['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 nan,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10

ZIP codes _look_ numeric, but aren't really.

[Read the ZIP codes in as strings.](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-data-types)

In [8]:
df2 = pd.read_csv(url, dtype={"Incident Zip": "string"})

  df2 = pd.read_csv(url, dtype={"Incident Zip": "string"})


We fixed the dtype warning for column 8 (`Incident Zip`).

In [9]:
list(df2["Incident Zip"].unique())

['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 <NA>,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '1

### Find invalid ZIP codes

Use a [regular expression (regex)](https://regexone.com/) to [find strings that match a pattern](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#testing-for-strings-that-match-or-contain-a-pattern):

```
^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string
```

In [10]:
valid_zips = df2["Incident Zip"].str.contains(r'^\d{5}(?:-\d{4})?$')
invalid_zips = valid_zips == False
df2[invalid_zips]["Incident Zip"]

55017     HARRISBURG
58100         N5X3A6
80798         100000
120304           IDK
123304          1801
173518     14614-195
192034        979113
201463           100
207158          8682
216745        000000
325071      NJ 07114
425985          1101
441166         DID N
Name: Incident Zip, dtype: string

Clear any invalid ZIP codes:

In [11]:
import numpy as np

df2.loc[invalid_zips, "Incident Zip"] = np.nan

General data cleaning tips:

- Hard part is finding what needs to be done
- Will be specific to your use case
- Document what you did, since it will affect your results

## View the contents of the `community_board` column in our 311 data

In [12]:
requests["Community Board"].unique()

array(['15 BROOKLYN', '03 BROOKLYN', '14 QUEENS', '10 BRONX', '08 QUEENS',
       '02 BRONX', '01 QUEENS', '11 QUEENS', '02 MANHATTAN',
       '18 BROOKLYN', '12 QUEENS', '01 STATEN ISLAND', '12 BRONX',
       '05 BROOKLYN', '01 BRONX', '09 QUEENS', '04 BROOKLYN',
       '10 BROOKLYN', '02 STATEN ISLAND', '05 QUEENS', '04 MANHATTAN',
       '11 BRONX', 'Unspecified BROOKLYN', '09 BRONX', '12 MANHATTAN',
       '09 BROOKLYN', '14 BROOKLYN', '06 MANHATTAN', '10 MANHATTAN',
       'Unspecified QUEENS', '01 MANHATTAN', '03 MANHATTAN', '05 BRONX',
       '08 BROOKLYN', '02 QUEENS', '12 BROOKLYN', '01 BROOKLYN',
       '16 BROOKLYN', '13 BROOKLYN', '06 QUEENS', '07 MANHATTAN',
       '11 BROOKLYN', 'Unspecified BRONX', '08 MANHATTAN',
       '03 STATEN ISLAND', '06 BROOKLYN', '03 BRONX', '05 MANHATTAN',
       '07 QUEENS', '13 QUEENS', '17 BROOKLYN', '06 BRONX', '02 BROOKLYN',
       '10 QUEENS', 'Unspecified MANHATTAN', '03 QUEENS', '04 BRONX',
       '11 MANHATTAN', '08 BRONX', '07 BROOKLY

## Get the count of 311 requests per Community District

In [13]:
cb_counts = requests.groupby('Community Board').size().reset_index(name='num_311_requests')
cb_counts = cb_counts.sort_values('num_311_requests', ascending=False)
cb_counts

Unnamed: 0,Community Board,num_311_requests
50,12 MANHATTAN,14110
23,05 QUEENS,12487
51,12 QUEENS,12228
2,01 BROOKLYN,11863
12,03 BROOKLYN,11615
5,01 STATEN ISLAND,11438
31,07 QUEENS,11210
21,05 BROOKLYN,10862
16,04 BRONX,10628
4,01 QUEENS,10410


## **Research Question:** What may account for the variance in count of requests per community district?

## **Hypothesis:** Population size may help explain the variance.

We can combine the counts per community district dataset with population data for each community district.

We'll use [pandas' `.merge()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging), comparable to:

- [SQL `JOIN`](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#join)
- [Spreadsheet `VLOOKUP`](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_spreadsheets.html#merging)

In general, called ["record linkage" or "entity resolution"](https://en.wikipedia.org/wiki/Record_linkage).

## Let's load the population dataset and check out its contents

[Data source for population by Community District](https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Community-Districts/xi7c-iiu2/data)

In [14]:
population = pd.read_csv('https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv')
population.head()

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

## In order to join the two dataframes, we need to create a common ID in each.

[`BORO CODE`](https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/pluto_datadictionary.pdf#page=38) (a.k.a. `BoroCode`, `borocd`, and `boro_cd`) is a commonly-used a unique ID for community districts. Let's create functions that create that unique ID in our datasets.

**BoroCD** is a 3 digit integer that captures the borough and district number. The borough is represented by the first digit. The district number is padded with zeros so it's always two digits long.

Boroughs are recoded into the following numbers:
- 1: Manhattan
- 2: Bronx
- 3: Brooklyn
- 4: Queens
- 5: Staten Island

Ex: 
- Manhattan 12 --> 112
- Brooklyn 6 --> 306


### First, let's create a `borocd` column in `cb_counts` dataframe

In [15]:
cb_counts.head()

Unnamed: 0,Community Board,num_311_requests
50,12 MANHATTAN,14110
23,05 QUEENS,12487
51,12 QUEENS,12228
2,01 BROOKLYN,11863
12,03 BROOKLYN,11615


[`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) can be used for transforming data with a custom function. How does it work?

```python
def my_function(row):
    # do stuff
    return some_value

new_values = requests.apply(my_function, axis=1)
```

Let's create a function called `recode_borocd_counts` that takes a `row` and converts the `Community Board` value into a `borocd` value.

In [16]:
def recode_borocd_counts(row):
  if 'MANHATTAN' in row["Community Board"]:
    return '1' + row["Community Board"][0:2]
    # [0:2] provides the first 2 characters, i.e. characters at indexes 0 and 1.
    # you could also use [:2] without the zero.
  elif 'BRONX' in row["Community Board"]:
    return '2' + row["Community Board"][0:2]
  elif 'BROOKLYN' in row["Community Board"]:
    return '3' + row["Community Board"][0:2]
  elif 'QUEENS' in row["Community Board"]:
    return '4' + row["Community Board"][0:2]
  elif 'STATEN ISLAND' in row["Community Board"]:
    return '5' + row["Community Board"][0:2]
  else:
    return 'Invalid BoroCD'

Let's test out that function in isolation. We'll grab one of the rows and pass it into the function.

In [17]:
sample_row = cb_counts.iloc[0]
sample_row

Community Board     12 MANHATTAN
num_311_requests           14110
Name: 50, dtype: object

In [18]:
recode_borocd_counts(sample_row)

'112'

Now we use `apply()` to do that across _all_ the rows.

In [None]:
cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)

- `apply()` (the way we're using it) takes a function and runs it against each row of a DataFrame, returning the results as a Series
- `axis=1` specifies that you want to apply the function across the rows instead of columns
- `cb_counts['borocd'] = …` creates a new column in the DataFrame called `borocd`

In [None]:
cb_counts

Uh oh, there are some unexpected `Unspecified` values in here - how can we get around them?

Let's only recode records that don't start with "U".

In [None]:
def recode_borocd_counts(row):
  if 'MANHATTAN' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '1' + row["Community Board"][:2]
  elif 'BRONX' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '2' + row["Community Board"][:2]
  elif 'BROOKLYN' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '3' + row["Community Board"][:2]
  elif 'QUEENS' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '4' + row["Community Board"][:2]
  elif 'STATEN ISLAND' in row["Community Board"] and row["Community Board"][0] != 'U':
      return '5' + row["Community Board"][:2]
  else:
    return 'Invalid BoroCD'

cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

We can make this function easier to read by isolating the logic that applies to all the conditions. This is called "refactoring".

In [None]:
def recode_borocd_counts(row):
    board = row["Community Board"]

    if board[0] != 'U':
        num = board[0:2]
        
        if 'MANHATTAN' in board:
            return '1' + num
        elif 'BRONX' in board:
            return '2' + num
        elif 'BROOKLYN' in board:
            return '3' + num
        elif 'QUEENS' in board:
            return '4' + num
        elif 'STATEN ISLAND' in board:
            return '5' + num
    else:
        return 'Invalid BoroCD'

In [None]:
cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

### Next, let's create the `borocd` column in the population dataset

In [None]:
population.head()

In [None]:
population.info()

Create a function `recode_borocd_pop` that combines and recodes the Borough and CD Number values to create a BoroCD unique ID.

In [None]:
def recode_borocd_pop(row):
  if row.Borough == 'Manhattan':
    return str(100 + row['CD Number'])
  elif row.Borough == 'Bronx':
    return str(200 + row['CD Number'])
  elif row.Borough == 'Brooklyn':
    return str(300 + row['CD Number'])
  elif row.Borough == 'Queens':
    return str(400 + row['CD Number'])
  elif row.Borough == 'Staten Island':
    return str(500 + row['CD Number'])
  else:
    return 'Invalid BoroCD'

In [None]:
population['borocd'] = population.apply(recode_borocd_pop, axis=1)
population

## Join the population data onto the counts data after creating shared `borocd` unique ID

To join dataframes together, we will use the [pandas `.merge()` function](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html#join-tables-using-a-common-identifier).

![merge diagram](https://pandas.pydata.org/pandas-docs/stable/_images/08_merge_left.svg)

In [None]:
merged_data = pd.merge(left=cb_counts, right=population, left_on='borocd', right_on='borocd')
merged_data

[Different types of merges](https://pandas.pydata.org/docs/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra)

## Calculate 311 requests per capita

Divide request count by 2010 population to get requests per capita

In [None]:
merged_data['request_per_capita'] = merged_data['num_311_requests'] / merged_data['2010 Population']

merged_data.head()

Let's create a simplified new dataframe that only include the columns we care about and in a better order.

In [None]:
columns = ['borocd', 'Borough', 'CD Name', '2010 Population', 'num_311_requests', 'request_per_capita']
cd_data = merged_data[columns]

cd_data

Let's check out which Community Districts have the highest complaints per capita

In [None]:
cd_data.sort_values('request_per_capita', ascending=False).head(10)

While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it's low residential population.

## Next class we'll produce charts and maps to better visualize the differences in magnitude of the 311 requests per capita values.

# [Homework 2](https://padmgp-4506-spring.rcnyu.org/user-redirect/notebooks/class_materials/hw_2.ipynb)

## Automated testing

We tested `recode_borocd_counts()` above by calling it with an abitrary row and seeing if the result was what we expect. We can do the same with code!

Setup code:

In [19]:
import ipytest
ipytest.autoconfig()

In [21]:
assert 2 == 2

In [20]:
assert 1 == 2

AssertionError: assert 1 == 2

In [22]:
a = 2
assert a == 1

AssertionError: assert 2 == 1

In [26]:
%%ipytest -qq

def test_boroughs():
    boroughs = requests['Borough'].unique()
    assert len(boroughs) == 5

[31mF[0m[31m                                                                                            [100%][0m
[31m[1m__________________________________________ test_boroughs ___________________________________________[0m

    [94mdef[39;49;00m [92mtest_boroughs[39;49;00m():
        boroughs = df[[33m'[39;49;00m[33mBorough[39;49;00m[33m'[39;49;00m].unique()
>       [94massert[39;49;00m [96mlen[39;49;00m(boroughs) == [94m5[39;49;00m
[1m[31mE       AssertionError: assert 6 == 5[0m
[1m[31mE        +  where 6 = len(array(['BROOKLYN', 'QUEENS', 'BRONX', 'MANHATTAN', 'STATEN ISLAND',\n       'Unspecified'], dtype=object))[0m

[1m[31m/var/folders/kg/1ys0dccx4237f5wsd_w10dt80000gn/T/ipykernel_84679/521626110.py[0m:3: AssertionError
FAILED tmpd2bma_nm.py::test_boroughs - AssertionError: assert 6 == 5
