<a href="https://colab.research.google.com/github/afeld/python-public-policy/blob/main/lecture_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NYU Wagner - Python Coding for Public Policy**
# Class 5: APIs

## Ways to get data

Method | How it happens | Pros | Cons
--- | :--- | :--- | :---
**Bulk** | Download, someone hands you a flash drive, etc. | Fast, one-time transfer | Can be large
**Scraping** | Data only available through a web site, PDF, or doc | You can turn anything into data | Tedious; fragile
**APIs** | If organization makes one available | Usually allows some filtering; can always pull latest-and-greatest | Requires network connection for every pull; higher barrier to entry (reading documentation); subject to availability and performance of API

### Scraping

Common tools:

- [Beautiful Soup package](https://realpython.com/beautiful-soup-web-scraper-python/)
- [pandas' `read_html()`](https://pandas.pydata.org/docs/user_guide/io.html#html)

In [1]:
import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', match='Rank')
pop = tables[0]
pop

Unnamed: 0,Rank,Country(or dependent territory),Population,% of world,Date,Source(official or United Nations)
0,1,China[b],1407765560,,2 May 2021,National population clock[3]
1,2,India[c],1376439313,,2 May 2021,National population clock[4]
2,3,United States[d],331449281,,1 Apr 2020,2020 census result[5]
3,4,Indonesia,271350000,,31 Dec 2020,National annual estimate[6]
4,5,Pakistan[e],225200000,,1 Jul 2021,UN projection[2]
...,...,...,...,...,...,...
237,–,Tokelau (NZ),1501,,1 Jul 2021,National annual projection[92]
238,195,Vatican City[ab],825,,1 Feb 2019,Monthly national estimate[197]
239,–,Cocos (Keeling) Islands (Australia),573,,30 Jun 2020,National annual estimate[196]
240,–,Pitcairn Islands (UK),40,,1 Jan 2021,National annual estimate[198]


## Data is only available if it's available

## API calls in the wild

1. Go to [Candidates page on fec.gov](https://www.fec.gov/data/candidates/?has_raised_funds=true&is_active_candidate=true).
1. Right click and `Inspect`.
   - [More info about opening Developer Tools in various browsers.](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools#how_to_open_the_devtools_in_your_browser)
1. Go to the `Network` tab and reload.
1. Filter to `XHR`.
1. Click the API call.

### Parts of a URL

![URL structure](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png)

[source](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL#basics_anatomy_of_a_url)

For APIs:

- Often split into "base URL" + "endpoint"
- Anchors aren't relevant

### API documentation

[FEC API](https://api.open.fec.gov/developers/)

### Try it out

1. In the Network tab's request list, right-click the API call.
1. Click `Open in New Tab`.
1. Replace the API key with `DEMO_KEY`.

## API calls from Python

Usually one of two ways:

- A software development kit (SDK)
   - Only if the API provider offers one
   - Abstracts the details away
   - May have limitations
- [The `requests` package](https://docs.python-requests.org/)

In [2]:
import requests

params = {
    'api_key': 'DEMO_KEY',
    'q': 'Jimmy McMillan',
    'sort': '-first_file_date'
}
response = requests.get('https://api.open.fec.gov/v1/candidates/', params=params)
data = response.json()
data

{'api_version': '1.0',
 'pagination': {'pages': 1, 'per_page': 20, 'count': 2, 'page': 1},
 'results': [{'party_full': 'REPUBLICAN PARTY',
   'district_number': 0,
   'federal_funds_flag': False,
   'has_raised_funds': False,
   'candidate_status': 'N',
   'election_districts': ['00'],
   'candidate_inactive': False,
   'last_file_date': '2015-10-13',
   'office_full': 'President',
   'inactive_election_years': None,
   'name': 'MCMILLAN, JIMMY "RENT IS TOO DAMN HIGH',
   'incumbent_challenge': 'O',
   'incumbent_challenge_full': 'Open seat',
   'first_file_date': '2015-10-13',
   'active_through': 2016,
   'party': 'REP',
   'cycles': [2016, 2018],
   'load_date': '2018-02-17T09:16:20+00:00',
   'state': 'US',
   'last_f2_date': '2015-10-13',
   'election_years': [2016],
   'district': '00',
   'flags': 'P60016805',
   'office': 'P',
   'candidate_id': 'P60016805'},
  {'party_full': 'REPUBLICAN PARTY',
   'district_number': 0,
   'federal_funds_flag': False,
   'has_raised_funds': Fal

### Retrieving nested data

In [3]:
data['results'][0]['name']

'MCMILLAN, JIMMY "RENT IS TOO DAMN HIGH'

### Reading into a DataFrame

In [4]:
params = {'api_key': 'DEMO_KEY'}
response = requests.get('https://api.open.fec.gov/v1/candidates/', params=params)
data = response.json()

pd.DataFrame(data['results'])

Unnamed: 0,name,party,state,party_full,office,office_full,election_years,first_file_date,inactive_election_years,active_through,...,election_districts,district,load_date,last_f2_date,candidate_id,cycles,district_number,incumbent_challenge,federal_funds_flag,last_file_date
0,"12-INCH COCK PLEASE, CAN YOU SUCK MY",PAF,US,PEACE AND FREEDOM,P,President,[2024],2021-04-28,,2024,...,[00],0,2021-04-28T22:02:34+00:00,2021-04-28,P40006033,[2022],0,,False,2021-04-28
1,"753, JO",NNE,US,NONE,P,President,[2020],2019-04-23,,2020,...,[00],0,2021-04-07T08:02:01+00:00,2019-04-23,P00011569,"[2020, 2022]",0,C,False,2019-04-23
2,"AABBATTE, MICHAEL THOMAS WITORT",IND,US,INDEPENDENT,P,President,[2004],2002-01-30,,2004,...,[00],0,2002-04-12T00:00:00+00:00,2002-01-30,P40002172,"[2002, 2004]",0,C,False,2002-01-30
3,"AADLER, TIM",REP,UT,REPUBLICAN PARTY,H,House,[2020],2020-03-24,,2020,...,[03],3,2020-05-05T21:11:57+00:00,2020-03-24,H0UT03227,[2020],3,C,False,2020-03-24
4,"AALDERS, TIM",IAP,UT,INDEPENDENT AMERICAN PARTY,H,House,[2014],,,2014,...,[04],4,2014-03-22T21:40:34+00:00,,H4UT04052,[2014],4,O,False,
5,"AALDERS, TIMOTHY NOEL",CON,UT,CONSTITUTION PARTY,S,Senate,"[2012, 2018]",2012-02-08,,2018,...,"[00, 00]",0,2019-03-27T16:02:41+00:00,2018-04-23,S2UT00229,"[2012, 2014, 2016, 2018, 2020]",0,O,False,2018-04-23
6,"AALOORI, BANGAR REDDY",REP,TX,REPUBLICAN PARTY,H,House,[2020],2019-10-17,,2020,...,[22],22,2020-03-18T21:13:37+00:00,2019-10-17,H0TX22260,[2020],22,O,False,2019-10-17
7,"AAMODT, NORMAN O.",REP,PA,REPUBLICAN PARTY,H,House,"[1976, 1978]",1976-04-12,,1978,...,"[16, 16]",16,2002-03-30T00:00:00+00:00,1978-07-05,H6PA16106,"[1976, 1978, 1980]",16,,False,1978-07-05
8,"AANESTAD, SAMUEL",REP,CA,REPUBLICAN PARTY,H,House,[2012],2012-02-22,,2012,...,[01],1,2013-04-26T09:04:30+00:00,2012-02-22,H2CA01110,"[2012, 2014, 2016]",1,C,False,2012-02-22
9,"AARESTAD, DAVID",DEM,CO,DEMOCRATIC PARTY,H,House,[2018],2017-04-26,,2018,...,[06],6,2017-08-01T20:57:28+00:00,2017-04-26,H8CO06237,[2018],6,C,False,2017-04-26


## Back to 311 data

From [NYC Open Data Portal dataset page](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/data), click `Export` -> `SODA API` -> `API Docs`.

### Most open data sites have APIs

Often built on platforms that provide them, e.g.

- [NYC Open Data Portal](https://opendata.cityofnewyork.us/) built on [Socrata](https://dev.socrata.com/)
- [data.gov built on CKAN](https://www.data.gov/developers/apis)

In [5]:
from datetime import datetime, timedelta

now = datetime.utcnow()
now

datetime.datetime(2021, 5, 2, 23, 36, 44, 594222)

In [6]:
start = now - timedelta(weeks=1)
start

datetime.datetime(2021, 4, 25, 23, 36, 44, 594222)

In [7]:
start.isoformat()

'2021-04-25T23:36:44.594222'

311 requests from past week, using the [Socrata query language (SoQL)](https://dev.socrata.com/docs/queries/)

In [8]:
data_id = 'erm2-nwe9'
params = {
    '$where': f"created_date between '{start.isoformat()}' and '{now.isoformat()}'"
}

url = f'https://data.cityofnewyork.us/resource/{data_id}.json'
response = requests.get(url, params=params)
data = response.json()

data

[{'unique_key': '50370979',
  'created_date': '2021-04-25T23:36:51.000',
  'closed_date': '2021-04-26T04:34:31.000',
  'agency': 'NYPD',
  'agency_name': 'New York City Police Department',
  'complaint_type': 'Noise - Residential',
  'descriptor': 'Banging/Pounding',
  'location_type': 'Residential Building/House',
  'incident_zip': '10453',
  'incident_address': '1630 MACOMBS ROAD',
  'street_name': 'MACOMBS ROAD',
  'cross_street_1': 'FEATHERBED LANE',
  'cross_street_2': 'WEST  174 STREET',
  'intersection_street_1': 'FEATHERBED LANE',
  'intersection_street_2': 'WEST  174 STREET',
  'city': 'BRONX',
  'landmark': 'MACOMBS ROAD',
  'status': 'Closed',
  'resolution_description': 'The Police Department responded to the complaint but officers were unable to gain entry into the premises.',
  'resolution_action_updated_date': '2021-04-26T08:34:35.000',
  'community_board': '05 BRONX',
  'bbl': '2028660004',
  'borough': 'BRONX',
  'x_coordinate_state_plane': '1007329',
  'y_coordinate_s

In [9]:
pd.DataFrame(data)

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,address_type,facility_type,taxi_pick_up_location
0,50370979,2021-04-25T23:36:51.000,2021-04-26T04:34:31.000,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,10453,1630 MACOMBS ROAD,...,247819,ONLINE,Unspecified,BRONX,40.84685082902726,-73.91658293443733,"{'latitude': '40.84685082902726', 'longitude':...",,,
1,50373841,2021-04-25T23:36:59.000,2021-04-26T00:15:56.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10453,1600 SEDGWICK AVENUE,...,248896,ONLINE,Unspecified,BRONX,40.84981124185689,-73.9228541258539,"{'latitude': '40.84981124185689', 'longitude':...",,,
2,50374105,2021-04-25T23:37:05.000,2021-04-28T19:44:45.000,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,11209,8202 3 AVENUE,...,167443,ONLINE,Unspecified,BROOKLYN,40.62626497292973,-74.02992569952757,"{'latitude': '40.62626497292973', 'longitude':...",ADDRESS,,
3,50372976,2021-04-25T23:37:08.000,2021-04-26T00:32:43.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10457,EAST 181 STREET,...,249968,PHONE,Unspecified,BRONX,40.85273289849362,-73.89646316992017,"{'latitude': '40.85273289849362', 'longitude':...",,,
4,50375161,2021-04-25T23:37:27.000,2021-04-26T05:53:58.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10460,932 EAST 173 STREET,...,243694,MOBILE,Unspecified,BRONX,40.835504863783285,-73.88818185441696,"{'latitude': '40.835504863783285', 'longitude'...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,50393226,2021-04-26T07:39:35.000,2021-04-26T08:29:38.000,NYPD,New York City Police Department,Illegal Parking,Posted Parking Sign Violation,Street/Sidewalk,11218,615 AVENUE C,...,172807,ONLINE,Unspecified,BROOKLYN,40.64098874526124,-73.97302237265508,"{'latitude': '40.64098874526124', 'longitude':...",,,
996,50387794,2021-04-26T07:40:01.000,2021-04-30T13:21:43.000,DPR,Department of Parks and Recreation,Damaged Tree,Branch or Limb Has Fallen Down,Street,11422,131-18 242 STREET,...,185997,PHONE,Unspecified,QUEENS,40.67687849362251,-73.72988704739927,"{'latitude': '40.67687849362251', 'longitude':...",,,
997,50383863,2021-04-26T07:40:56.000,2021-04-26T08:43:01.000,NYPD,New York City Police Department,Blocked Driveway,Partial Access,Street/Sidewalk,11234,3724 QUENTIN ROAD,...,163441,ONLINE,Unspecified,BROOKLYN,40.61526629586832,-73.93582180145765,"{'latitude': '40.61526629586832', 'longitude':...",,,
998,50383936,2021-04-26T07:41:07.000,2021-04-26T08:16:37.000,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,10009,290 EAST FOURTH STREET,...,202284,MOBILE,Unspecified,MANHATTAN,40.72189770233355,-73.97927051772604,"{'latitude': '40.72189770233355', 'longitude':...",,,


Coincidence there were exactly 1,000 results?

## Pagination

- Most APIs limit the number of results returned.
- [Socrata defaults to 1,000.](https://dev.socrata.com/docs/queries/limit.html)
- Need to use a loop with parameters like [`$limit`](https://dev.socrata.com/docs/queries/limit.html)+[`$offset`](https://dev.socrata.com/docs/queries/offset.html) (Socrata) or `page`+`per_page` ([FEC](https://api.open.fec.gov/developers/))
   - [`append()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) to DataFrame

## Things are going to differ by API

- Endpoints
- Supported parameters
- Response structure
   - [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) can help
- Quality of documentation
- Helpfulness of errors
- Size/helpfulness of community

Gotta read and experiment.

## [Homework 5](https://colab.research.google.com/github/afeld/python-public-policy/blob/main/hw_5.ipynb)

## Homework 6

- a.k.a. the final project
- Open-ended
   - Be creative, and just the right amount of ambitious
- **Goal:** Prove or disprove a hypothesis using skills learned in this class

In real/ideal world, start with specific question and find data to answer it:

![project flow](https://textbook.coleridgeinitiative.org/ChapterIntro/figures/projectflow.png)

_Source: [Big Data and Social Science](https://textbook.coleridgeinitiative.org/chap-intro.html#the-structure-of-the-book)_

Data needed often doesn't exist or is hard (or impossible) to find/access

![project flow](https://textbook.coleridgeinitiative.org/ChapterIntro/figures/projectflow.png)

Safer approach, for the purposes of this assignment:

1. Find dataset that seems interesting
  - Use at least one dataset that you aren't familiar with, from this class or elsewhere.
  - [NYC OpenData](https://opendata.cityofnewyork.us/), [data.gov](https://www.data.gov/), and [Kaggle](https://www.kaggle.com/datasets) have many many options.
  - Finding a dataset available in CSV or JSON is recommended, though [pandas can read other formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).
1. Inspect the data a bit.
1. Come up with a question that the data is capable of answering and _isn't trivial to answer_.
  - If you aren't sure, ask.
1. Come up with a hypothesis (a.k.a. a guess of the answer to the question).
1. Submit the proposal.

### Proposal

Post responses to the following as a new Conversation under [the `HW6 proposals` Discussion](https://brightspace.nyu.edu/d2l/le/82428/discussions/topics/226474/View) by next class:

- **What dataset are you going to use?**
   - Please include a link. If multiple, how are you going to merge/join them?
- **What's the question you are trying to answer?**
- **What's your hypothesis?**

- Your question/hypothesis doesn't need to be something novel; confirming something you read in the news is fine.
- You won't be graded on the scientific soundness of your work.
   - Important to think through and note assumptions/caveats of your approach

#### Simplified example

- **Dataset:** [Recycling Diversion and Capture Rates](https://data.cityofnewyork.us/Environment/Recycling-Diversion-and-Capture-Rates/gaq9-z3hz)
- **Question:** From 2016 to 2019, what community district increased their diversion (recycling) rate the most?
- **Hypothesis:** [Bushwick](https://communityprofiles.planning.nyc.gov/brooklyn/4), because it's gentrified over that time, and hipsters love to recycle.

### Once you start

- Create a new notebook to do the actual analysis; you will turn that in separately.
- Go back and find any information that's available _around_ the data, to get a better understanding of what it contains and means.
  - Might include a data dictionary
  - Might involve poking around a government agency's web site to understand their processes
  - Understand what all the different columns and values represent

### Analysis requirements

_on top of [general assignment requirements](https://github.com/afeld/python-public-policy/blob/main/syllabus.md#assignments)_

<!-- make sure edits here are reflected in scripts/hw_6_check.py -->

- **Read like a blog post**
    - Pretend you're explaining to a Wagner student who hasn't taken this class. You don't need to teach them Python, but they should be able to follow what's going on.
    - Re-state the question, hypothesis, and data source(s) with link(s)
    - Walk the reader through what you're doing in every step and what they should be taking away from it.
    - [Markdown](https://www.markdownguide.org/basic-syntax/) can be used in text cells for formatting.
    - Include any dead ends you hit.
    - Have a conclusion that speaks to your question and hypothesis.
- **Use pandas**
- **Not be trivial**, requiring:
    - At least 40 lines of code to come to a conclusion
    - Transforming data through [grouping](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html), [merging](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging), and/or [reshaping](https://pandas.pydata.org/docs/user_guide/reshaping.html) of DataFrames
    - Operations that aren't easily done in a spreadsheet.
- **Have a visualization** (chart or map) of some kind
- Don't leave any sensitive information in the notebook: API keys, personally-identifiable information (PII), etc.

If you answer the first question easily, that's fine; dig into / build off of it.

### Peer assessment

The assignment will be peer-reviewed. Each student will need to review two others, evaluating against the Requirements above and the [general scoring rules](https://github.com/afeld/python-public-policy/blob/main/syllabus.md#assignments). These are due 72 hours after homework 6.

## No homework/resubmissions will be accepted after Thursday 5/13 at 6:45pm ET

In other words, homework 6 cannot be late.

## Lecture 6