<a href="https://colab.research.google.com/github/afeld/nyu-python-public-policy/blob/master/lecture_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NYU Wagner - Python Coding for Public Policy**
# Class 2 (Nov 7): Manipulating/combining DataFrames and writing functions

# LECTURE

## **Today's goal**: Which Community Districts have the most 311 requests? Why might that be?

What's a Community District?
- 59 local governance districts each run by an appointed Community Board
- Community boards advise on land use and zoning, participate in the city budget process, and address service delivery in their district.
- Community boards are each composed of <= 50 volunteer members appointed by the local borough president, half from nominations by the local City Council members.

[More info](https://en.wikipedia.org/wiki/Community_boards_of_New_York_City)

![Map of community districts from Wikipedia](https://upload.wikimedia.org/wikipedia/commons/4/41/New_York_City_community_districts.svg)


## Start by importing necessary packages

In [None]:
import pandas as pd
from google.colab import drive

In [None]:
# You can use pd.set_option() to make sure you see all the rows and columns in your dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Read and save our cleaned 311 Service Requests dataset as a pandas dataframe named "df"

In [None]:
drive.mount('/content/drive')
# follow the link it generates, choose your account, and then paste in the authorization code it provides

In [None]:
df = pd.read_csv('/content/drive/My Drive/Data for Python/cleaned_311_data.csv', header='infer')

## View the contents of the `community_board` column in our 311 data

In [None]:
set(df.community_board)

{'0 Unspecified',
 '01 BRONX',
 '01 BROOKLYN',
 '01 MANHATTAN',
 '01 QUEENS',
 '01 STATEN ISLAND',
 '02 BRONX',
 '02 BROOKLYN',
 '02 MANHATTAN',
 '02 QUEENS',
 '02 STATEN ISLAND',
 '03 BRONX',
 '03 BROOKLYN',
 '03 MANHATTAN',
 '03 QUEENS',
 '03 STATEN ISLAND',
 '04 BRONX',
 '04 BROOKLYN',
 '04 MANHATTAN',
 '04 QUEENS',
 '05 BRONX',
 '05 BROOKLYN',
 '05 MANHATTAN',
 '05 QUEENS',
 '06 BRONX',
 '06 BROOKLYN',
 '06 MANHATTAN',
 '06 QUEENS',
 '07 BRONX',
 '07 BROOKLYN',
 '07 MANHATTAN',
 '07 QUEENS',
 '08 BRONX',
 '08 BROOKLYN',
 '08 MANHATTAN',
 '08 QUEENS',
 '09 BRONX',
 '09 BROOKLYN',
 '09 MANHATTAN',
 '09 QUEENS',
 '10 BRONX',
 '10 BROOKLYN',
 '10 MANHATTAN',
 '10 QUEENS',
 '11 BRONX',
 '11 BROOKLYN',
 '11 MANHATTAN',
 '11 QUEENS',
 '12 BRONX',
 '12 BROOKLYN',
 '12 MANHATTAN',
 '12 QUEENS',
 '13 BROOKLYN',
 '13 QUEENS',
 '14 BROOKLYN',
 '14 QUEENS',
 '15 BROOKLYN',
 '16 BROOKLYN',
 '17 BROOKLYN',
 '18 BROOKLYN',
 '26 BRONX',
 '27 BRONX',
 '28 BRONX',
 '55 BROOKLYN',
 '56 BROOKLYN',
 '64

## Get the count of 311 requests per Community District

In [None]:
cb_counts = df.groupby('community_board').size().reset_index(name='count_of_311_requests').sort_values('count_of_311_requests', ascending=False)

In [None]:
cb_counts

Unnamed: 0,community_board,count_of_311_requests
50,12 MANHATTAN,81402
23,05 QUEENS,71506
51,12 QUEENS,70361
2,01 BROOKLYN,68101
12,03 BROOKLYN,66360
5,01 STATEN ISLAND,65145
31,07 QUEENS,63634
21,05 BROOKLYN,61836
16,04 BRONX,61086
4,01 QUEENS,60425


## **Research Question:** What may account for the variance in count of requests per community district?

## **Hypothesis:** Population size may help explain the variance. We can combine the counts per community district dataset with population data for each community district.

## Let's load the population dataset and check out its contents

In [None]:
# Data source for population by Community District: https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Community-Districts/xi7c-iiu2/data
population = pd.read_csv('/content/drive/My Drive/Data for Python/New_York_City_Population_By_Community_Districts.csv', header='infer')
population.head()

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200


## In order to join the two dataframes, we need to create a common ID in each. `borocd` is commonly used a unique ID for community districts. Let's create functions that create that unique ID in our datasets.

**BoroCD** is a 3 digit integer that captures the borough and district number. The borough is represented by the first digit. The district number is padded with zeros so it's always two digits long.

Boroughs are recoded into the following numbers:
- 1: Manhattan
- 2: Bronx
- 3: Brooklyn
- 4: Queens
- 5: Staten Island

Ex: 
- Manhattan 12 --> 112
- Brooklyn 6 --> 306


### First, let's create a `borocd` column in `cb_counts` dataframe

In [None]:
cb_counts.head()

Unnamed: 0,community_board,count_of_311_requests
50,12 MANHATTAN,81402
23,05 QUEENS,71506
51,12 QUEENS,70361
2,01 BROOKLYN,68101
12,03 BROOKLYN,66360


In [None]:
# let's create a function called recode_borocd_counts that converts the community_board value into a borocd value

def recode_borocd_counts(row):
  if 'MANHATTAN' in row.community_board:
    return '1' + row.community_board[0:2]
    # [0:2] provides the first 2 characters, i.e. characters at indexes 0 and 1.
    # you could also use [:2] without the zero.
  elif 'BRONX' in row.community_board:
    return '2' + row.community_board[0:2]
  elif 'BROOKLYN' in row.community_board:
    return '3' + row.community_board[0:2]
  elif 'QUEENS' in row.community_board:
    return '4' + row.community_board[0:2]
  elif 'STATEN ISLAND' in row.community_board:
    return '5' + row.community_board[0:2]
  else:
    return 'Invalid BoroCD'

In [None]:
type(recode_borocd_counts)

function

**Note:** The order of steps is important. You have to define a function before you can apply it to a dataframe.

In [None]:
cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1) # axis = 1 specifies that you want to apply the function across the rows instead of columns
# cb_counts['borocd'] creates a new column in the dataframe called borocd
# cb_counts.apply(function, axis) applies the function we defined across the specified axis

cb_counts

Unnamed: 0,community_board,count_of_311_requests,borocd
50,12 MANHATTAN,81402,112
23,05 QUEENS,71506,405
51,12 QUEENS,70361,412
2,01 BROOKLYN,68101,301
12,03 BROOKLYN,66360,303
5,01 STATEN ISLAND,65145,501
31,07 QUEENS,63634,407
21,05 BROOKLYN,61836,305
16,04 BRONX,61086,204
4,01 QUEENS,60425,401


In [None]:
# uh oh, there are some unexpected "Unspecified" values in here - how can we get around them? let's only recode records that don't start with "U"
def recode_borocd_counts(row):
  if 'MANHATTAN' in row.community_board and row.community_board[0] != 'U':
      return '1' + row.community_board[:2]
  elif 'BRONX' in row.community_board and row.community_board[0] != 'U':
      return '2' + row.community_board[:2]
  elif 'BROOKLYN' in row.community_board and row.community_board[0] != 'U':
      return '3' + row.community_board[:2]
  elif 'QUEENS' in row.community_board and row.community_board[0] != 'U':
      return '4' + row.community_board[:2]
  elif 'STATEN ISLAND' in row.community_board and row.community_board[0] != 'U':
      return '5' + row.community_board[:2]
  else:
    return 'Invalid BoroCD'

cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

Unnamed: 0,community_board,count_of_311_requests,borocd
50,12 MANHATTAN,81402,112
23,05 QUEENS,71506,405
51,12 QUEENS,70361,412
2,01 BROOKLYN,68101,301
12,03 BROOKLYN,66360,303
5,01 STATEN ISLAND,65145,501
31,07 QUEENS,63634,407
21,05 BROOKLYN,61836,305
16,04 BRONX,61086,204
4,01 QUEENS,60425,401


In [None]:
# we can make this function more concise by isolating the logic that applies to all the conditions
  def recode_borocd_counts(row):
    if row.community_board[0] != 'U':
      if 'MANHATTAN' in row.community_board:
        return '1' + row.community_board[:2]
      elif 'BRONX' in row.community_board:
        return '2' + row.community_board[:2]
      elif 'BROOKLYN' in row.community_board:
        return '3' + row.community_board[:2]
      elif 'QUEENS' in row.community_board:
        return '4' + row.community_board[:2]
      elif 'STATEN ISLAND' in row.community_board:
        return '5' + row.community_board[:2]
    else:
      return 'Invalid BoroCD'

cb_counts['borocd'] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

Unnamed: 0,community_board,count_of_311_requests,borocd
50,12 MANHATTAN,81402,112
23,05 QUEENS,71506,405
51,12 QUEENS,70361,412
2,01 BROOKLYN,68101,301
12,03 BROOKLYN,66360,303
5,01 STATEN ISLAND,65145,501
31,07 QUEENS,63634,407
21,05 BROOKLYN,61836,305
16,04 BRONX,61086,204
4,01 QUEENS,60425,401


### Next, let's create the `borocd` column in the population dataset

In [None]:
population.head()

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200


In [None]:
# Create a function recode_borocd_pop that combines and recodes the Borough and CD Number values to create a BoroCD unique ID

def recode_borocd_pop(row):
  if row.Borough == 'Manhattan':
    return str(100 + row['CD Number'])
  elif row.Borough == 'Bronx':
    return str(200 + row['CD Number'])
  elif row.Borough == 'Brooklyn':
    return str(300 + row['CD Number'])
  elif row.Borough == 'Queens':
    return str(400 + row['CD Number'])
  elif row.Borough == 'Staten Island':
    return str(500 + row['CD Number'])
  else:
    return 'Invalid BoroCD'

population['borocd'] = population.apply(recode_borocd_pop, axis=1)
population

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population,borocd
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497,201
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246,202
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762,203
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441,204
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200,205
5,Bronx,6,"East Tremont, Belmont",114137,65016,68061,75688,83268,206
6,Bronx,7,"Bedford Park, Norwood, Fordham",113764,116827,128588,141411,139286,207
7,Bronx,8,"Riverdale, Kingsbridge, Marble Hill",103543,98275,97030,101332,101731,208
8,Bronx,9,"Soundview, Parkchester",166442,167627,155970,167859,172298,209
9,Bronx,10,"Throgs Nk., Co-op City, Pelham Bay",84948,106516,108093,115948,120392,210


## Join the population data onto the counts data after creating shared `borocd` unique ID

To join dataframes together, we will use the pandas merge function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

![Types of joins](https://i.stack.imgur.com/hMKKt.jpg)

In [None]:
# now lets join this population data onto our CD dataset
# the default setting for merge is an inner join

merged_data = pd.merge(left=cb_counts, right=population, left_on='borocd', right_on='borocd')
merged_data

Unnamed: 0,community_board,count_of_311_requests,borocd,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,12 MANHATTAN,81402,112,Manhattan,12,"Washington Heights, Inwood",180561,179941,198192,208414,190020
1,05 QUEENS,71506,405,Queens,5,"Ridgewood, Glendale, Maspeth",161022,150142,149126,165911,169190
2,12 QUEENS,70361,412,Queens,12,"Jamaica, St. Albans, Hollis",206639,189383,201293,223602,225919
3,01 BROOKLYN,68101,301,Brooklyn,1,"Williamsburg, Greenpoint",179390,142942,155972,160338,173083
4,03 BROOKLYN,66360,303,Brooklyn,3,Bedford Stuyvesant,203380,133379,138696,143867,152985
5,01 STATEN ISLAND,65145,501,Staten Island,1,"Stapleton, Port Richmond",135875,138489,137806,162609,175756
6,07 QUEENS,63634,407,Queens,7,"Flushing, Bay Terrace",207589,204785,220508,242952,247354
7,05 BROOKLYN,61836,305,Brooklyn,5,"East New York, Starrett City",170791,154931,161350,173198,182896
8,04 BRONX,61086,204,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
9,01 QUEENS,60425,401,Queens,1,"Astoria, Long Island City",185925,185198,188549,211220,191105


## Calculate 311 requests per capita

In [None]:
# divide request count by 2010 population to get request per capita

merged_data['request_per_capita'] = merged_data.count_of_311_requests / merged_data['2010 Population']

merged_data.head()

Unnamed: 0,community_board,count_of_311_requests,borocd,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population,request_per_capita
0,12 MANHATTAN,81402,112,Manhattan,12,"Washington Heights, Inwood",180561,179941,198192,208414,190020,0.428386
1,05 QUEENS,71506,405,Queens,5,"Ridgewood, Glendale, Maspeth",161022,150142,149126,165911,169190,0.422637
2,12 QUEENS,70361,412,Queens,12,"Jamaica, St. Albans, Hollis",206639,189383,201293,223602,225919,0.311443
3,01 BROOKLYN,68101,301,Brooklyn,1,"Williamsburg, Greenpoint",179390,142942,155972,160338,173083,0.393459
4,03 BROOKLYN,66360,303,Brooklyn,3,Bedford Stuyvesant,203380,133379,138696,143867,152985,0.433768


In [None]:
# let's create a simplified new dataframe that only include the columns we care about and in a better order

cd_data = merged_data[['borocd', 'Borough', 'CD Name', '2010 Population', 'count_of_311_requests', 'request_per_capita']]

cd_data

Unnamed: 0,borocd,Borough,CD Name,2010 Population,count_of_311_requests,request_per_capita
0,112,Manhattan,"Washington Heights, Inwood",190020,81402,0.428386
1,405,Queens,"Ridgewood, Glendale, Maspeth",169190,71506,0.422637
2,412,Queens,"Jamaica, St. Albans, Hollis",225919,70361,0.311443
3,301,Brooklyn,"Williamsburg, Greenpoint",173083,68101,0.393459
4,303,Brooklyn,Bedford Stuyvesant,152985,66360,0.433768
5,501,Staten Island,"Stapleton, Port Richmond",175756,65145,0.370656
6,407,Queens,"Flushing, Bay Terrace",247354,63634,0.257259
7,305,Brooklyn,"East New York, Starrett City",182896,61836,0.338094
8,204,Bronx,"Highbridge, Concourse Village",146441,61086,0.417137
9,401,Queens,"Astoria, Long Island City",191105,60425,0.316187


Let's check out which Community Districts have the highest complaints per capita

In [None]:
cd_data.sort_values('request_per_capita', ascending=False).head(20)

Unnamed: 0,borocd,Borough,CD Name,2010 Population,count_of_311_requests,request_per_capita
40,105,Manhattan,Midtown Business District,51673,37466,0.72506
31,308,Brooklyn,Crown Heights North,96317,43872,0.455496
29,302,Brooklyn,"Brooklyn Heights, Fort Greene",99617,44061,0.442304
21,304,Brooklyn,Bushwick,112634,49552,0.439938
4,303,Brooklyn,Bedford Stuyvesant,152985,66360,0.433768
32,309,Brooklyn,"Crown Heights South, Wingate",98429,42655,0.433358
20,110,Manhattan,Central Harlem,115723,50024,0.432274
0,112,Manhattan,"Washington Heights, Inwood",190020,81402,0.428386
1,405,Queens,"Ridgewood, Glendale, Maspeth",169190,71506,0.422637
8,204,Bronx,"Highbridge, Concourse Village",146441,61086,0.417137


While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it's low residential population.

## Next class we'll produce charts and maps to better visualize the differences in magnitude of the 311 requests per capita values.

# HOMEWORK 2 Coding: Using keywords to categorize 311 requests

Create a new Google Colab notebook called "HW2". 

**Problem Statement:** When you read through the `descriptor` column in the 311 data, you will see that complaints related to graffiti are actually scattered throughout multiple `complaint_type` categories. We want to identify all complaints related to graffiti and see which community districts have the most instances of graffiti.


To help make this assignment easier, I have created a smaller subset of the 311 data for you to use. It's named  `cleaned_311_data_hw2.csv` and it's shared with you on Google Drive. This smaller dataset only contains ~65,000 records from relevant complaint type categories. `df = pd.read_csv('/content/drive/My Drive/Data for Python/cleaned_311_data_hw2.csv', header='infer')`


- **Step 1**. Create a function that checks each row in the 311 dataframe to see if the word "graffiti" is present in either the `complaint_type` value or `descriptor` value. Both columns may contain the word, so you should check both. If the word "graffiti" is found, the function should return the boolean value True. If "graffiti" is not found, the function should return the boolean value False.
  - Hint 1: The same way that we checked if the string "BRONX" existed in the `community_board` value in the `cb_counts` table during the lecture, you can check if "graffiti" exists in the `descriptor` or `complaint_type` values.
  - Hint 2: The `descriptor` column contains some null/NaN values. You'll have trouble running a function on these. Before your function checks if "graffiti" is present in `descriptor`, the function should make sure that a `descriptor` value exists using `pd.notnull(row.descriptor)`
  - Hint 3: Capitalization may be inconsistent. It could help to use `.lower()` to convert strings to lowercase.

- **Step 2**. Apply the function created in Step 1 to the 311 dataframe and create a new column called `graffiti_flag` that captures the output from the function.
  - Tip: There are two checks you can use to confirm that the function worked as expected.
    - Group by `graffiti_flag` to make sure there are records tagged as True.
    - Using only the records where `graffiti_flag` is True (`sample[sample.graffiti_flag]`), group by `complaint_type` to make sure that more than one `complaint_type` is included.

- **Step 3**. Create another dataframe `df_graffiti` that only contains records where `graffiti_flag` is True. `df_graffiti = df[df.graffiti_flag]`

- **Step 4**. Group your dataframe `df_graffiti` to get the count of requests per `community_board`. Use `.nlargest()` to identify which Community District has the highest count.

Upload your Google Colab "HW2" notebook to your GitHub repository. You don't need to send me anything or submit anything via NYU Classes; I'll find your submission on GitHub.

# HOMEWORK 2 Tutorial: Intro to making charts in Python

- [Intro to producing charts using Python packages](https://colab.research.google.com/notebooks/charts.ipynb). You don't have to work through every one of these examples; just review to get familiar with what types of charts are possible and which packages are used.

## Examples to help answer HW0 questions

In [None]:
# this works but you don't need to use double square brackets
df_filtered[['created_date']].min()

created_date    01/01/2019 01:06:59 PM
dtype: object

In [None]:
# you only need one set of square brackets
df_filtered['created_date'].min()

'01/01/2019 01:06:59 PM'

In [None]:
# selecting the created_date column where the created_date values are greater than '12/31'
df_filtered['created_date'][df['created_date'] > '12/31']

1153906    12/31/2018 02:55:42 AM
1154126    12/31/2018 06:02:33 AM
1154286    12/31/2018 07:41:42 AM
1154501    12/31/2018 08:29:47 AM
1154513    12/31/2018 08:31:36 AM
1154534    12/31/2018 08:37:55 AM
1154554    12/31/2018 08:40:38 AM
1154565    12/31/2018 08:42:15 AM
1154566    12/31/2018 08:42:47 AM
1154637    12/31/2018 08:56:28 AM
1154690    12/31/2018 09:07:29 AM
1154696    12/31/2018 09:08:20 AM
1154705    12/31/2018 09:08:48 AM
1154712    12/31/2018 09:09:31 AM
1154717    12/31/2018 09:10:17 AM
1154720    12/31/2018 09:10:30 AM
1154802    12/31/2018 09:23:37 AM
1154957    12/31/2018 09:47:16 AM
1155100    12/31/2018 10:00:51 AM
1155131    12/31/2018 10:07:44 AM
1155161    12/31/2018 10:11:58 AM
1155256    12/31/2018 10:21:04 AM
1155275    12/31/2018 10:24:38 AM
1155306    12/31/2018 10:29:15 AM
1155329    12/31/2018 10:31:30 AM
1155330    12/31/2018 10:31:32 AM
1155402    12/31/2018 10:38:45 AM
1155516    12/31/2018 10:49:49 AM
1155521    12/31/2018 10:50:07 AM
1155530    12/

In [None]:
# if you want to select more than one column, you have to provide the column names as a list
# so you will have two sets of square brackets df_filtered[['created_date','unique_key']]
df_filtered[['created_date','unique_key']][df_filtered['created_date'] > '12/31']

In [None]:
# it's also possible to select the rows first then columns, but it's best practice to select columns first then rows
df_filtered[df_filtered.created_date > '12/31'][['created_date','unique_key']]

Unnamed: 0,created_date,unique_key
1153906,12/31/2018 02:55:42 AM,41314681
1154126,12/31/2018 06:02:33 AM,41314737
1154286,12/31/2018 07:41:42 AM,41307887
1154501,12/31/2018 08:29:47 AM,41309747
1154513,12/31/2018 08:31:36 AM,41307590
1154534,12/31/2018 08:37:55 AM,41309141
1154554,12/31/2018 08:40:38 AM,41313808
1154565,12/31/2018 08:42:15 AM,41312845
1154566,12/31/2018 08:42:47 AM,41306554
1154637,12/31/2018 08:56:28 AM,41313608
