# Lecture 5: APIs

_Please sign attendance sheet_

## APIs

- They are very powerful
- Can be used from any programming language
- Not expecting you to use them in your Final Project

## APIs, conceptually

Talk me through buying a plane ticket.

- What are the steps?
- What information do you provide?
- What do you imagine is happening behind the scenes?

![Diagram showing how online payments work: Expedia talks to Delta, Delta talks to Stripe, Stripe talks to Visa, and Visa talks to Chase](extras/img/apis_conceptually/payments.png)

![Diagram showing how notifications flow through systems](extras/img/apis_conceptually/notifications.png)

![Diagram showing relationship between human languages, programming languages, and APIs](extras/img/apis_conceptually/languages.png)

interactions between systems ↔️

## Ways to get data

Method | How it happens | Pros | Cons
--- | :--- | :--- | :---
**Bulk** | Download, someone hands you a flash drive, etc. | Fast, one-time transfer | Can be large; data gets out of date easily
**APIs** | If organization makes one available | Usually allows some filtering; can always pull latest-and-greatest | Requires network connection for every call; higher barrier to entry (reading documentation); subject to availability and performance of API
**Scraping** | Data only available through a web site, PDF, or doc | You can turn anything into data | Tedious; fragile

## Scraping

These are open source Python tools, as well as commercial services (with APIs!). Examples:

### Web pages

- [Beautiful Soup package](https://realpython.com/beautiful-soup-web-scraper-python/)
- [pandas' `read_html()`](https://pandas.pydata.org/docs/user_guide/io.html#html)
- [Playwright](https://playwright.dev/python/docs/api/class-playwright)
- No-code tools [ParseHub](https://www.parsehub.com/)

### PDFs

- [How to Extract Document Information From a PDF in Python](https://realpython.com/pdf-python/#how-to-extract-document-information-from-a-pdf-in-python)
- [pypdf](https://pypdf.readthedocs.io/)
- [PyMuPDF](https://pymupdf.readthedocs.io/)
- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [_Others_](https://www.google.com/search?q=extract+table+from+pdf)

_Please pray to the Demo Gods that these all work and there's no profanity_

### Wikipedia example

See [Wikipedia's list of countries by area](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area#Countries_and_dependencies_by_area). How would you turn this into a spreadsheet?

What happens when you want to update it?

#### Scrape the data

To comply with [Wikimedia's User-Agent Policy](https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy), we need to [override](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) the [default `User-Agent`](https://docs.python.org/3/library/urllib.request.html#urllib.request.Request).

- Feel free to ignore this part
- Blame the AI!

In [1]:
import sys

user_agent = f"PythonPublicPolicyDemo/0.0 (https://python-public-policy.afeld.me/en/nyu/lecture_5.html#scraping; alf9@nyu.edu) Python-urllib/{sys.version_info.major}.{sys.version_info.minor}"
user_agent

'PythonPublicPolicyDemo/0.0 (https://python-public-policy.afeld.me/en/nyu/lecture_5.html#scraping; alf9@nyu.edu) Python-urllib/3.12'

In [2]:
import pandas as pd

tables = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area",
    match="Country / dependency",
    storage_options={"User-Agent": user_agent}
)
countries = tables[0]
countries

Unnamed: 0.1,Unnamed: 0,Country / dependency,Total in km2 (mi2),Land in km2 (mi2),Water in km2 (mi2),% water,Unnamed: 6
0,–,Earth,"510,072,000 (196,940,000)","148,940,000 (57,506,000)","361,132,000 (139,434,000)",70.8,
1,1,Russia,"17,098,246 (6,601,667)","16,376,870 (6,323,142)","721,380 (278,530)",4.2,[b]
2,–,Antarctica,"14,200,000 (5,480,000)","14,200,000 (5,480,000)",0,0.0,[c]
3,2,Canada,"9,984,670 (3,855,100)","9,093,507 (3,511,021)","891,163 (344,080)",8.9,[d]
4,3/4 [e],China,"9,596,960 (3,705,410)","9,326,410 (3,600,950)","270,550 (104,460)",2.8,[f]
...,...,...,...,...,...,...,...
257,–,Ashmore and Cartier Islands (Australia),5.0 (1.9),5.0 (1.9),0,0.0,[q]
258,–,Coral Sea Islands (Australia),3.0 (1.2),3.0 (1.2),0,0.0,[da]
259,–,Spratly Islands (disputed),2.0 (0.77),2.0 (0.77),0,0.0,[54]
260,194,Monaco,2.0 (0.77),2.0 (0.77),0,0.0,[db]


### Data is only available if it's available

## API calls in the wild

1. Go to [Candidates page on fec.gov](https://www.fec.gov/data/candidates/?has_raised_funds=true&is_active_candidate=true).
1. Right click and `Inspect`.
   - [More info about opening Developer Tools in various browsers.](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools#how_to_open_the_devtools_in_your_browser)
1. Go to the `Network` tab and reload.
1. Filter to `XHR`.
1. Click the API call.

We only see this because the tables on [fec.gov](https://fec.gov) are [rendered client-side](https://www.solutelabs.com/blog/client-side-vs-server-side-rendering-what-to-choose-when) using their JSON API. That won't be the case for all tables on all sites.

### Parts of a URL

![URL structure](extras/img/url.png)

[source](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL#basics_anatomy_of_a_url)

For APIs:

- Often split into "base URL" + "endpoint"
- Endpoints are like function names: they represent the information you are retrieving or thing you are trying to do
- Parameters are like function arguments:
   - They allow options to be specified
   - Some are required, some are optional
   - They will differ from one endpoint/function to another
- Anchors won't be used

### API documentation

[FEC API](https://api.open.fec.gov/developers/)

## API calls from Python

Usually one of two ways:

- A software development kit (SDK) like [sodapy](https://pypi.org/project/sodapy/)
   - Abstracts the details away
   - Not available for all APIs
   - May have limitations
- [The `requests` package](https://docs.python-requests.org/) (nothing to do with 311 requests)

Get Jimmy McMillan's latest candidacy information:

In [3]:
import requests

jimmy = {
    "api_key": "DEMO_KEY",
    "q": "Jimmy McMillan",
    "sort": "-first_file_date",
}
response = requests.get("https://api.open.fec.gov/v1/candidates/", params=jimmy)
data = response.json()
data

{'api_version': '1.0',
 'pagination': {'count': 2,
  'is_count_exact': True,
  'page': 1,
  'pages': 1,
  'per_page': 20},
 'results': [{'active_through': 2016,
   'candidate_id': 'P60016805',
   'candidate_inactive': False,
   'candidate_status': 'N',
   'cycles': [2016, 2018],
   'district': '00',
   'district_number': 0,
   'election_districts': ['00'],
   'election_years': [2016],
   'federal_funds_flag': False,
   'first_file_date': '2015-10-13',
   'has_raised_funds': False,
   'inactive_election_years': None,
   'incumbent_challenge': 'O',
   'incumbent_challenge_full': 'Open seat',
   'last_f2_date': '2015-10-13',
   'last_file_date': '2015-10-13',
   'load_date': '2018-02-17T09:16:20',
   'name': 'MCMILLAN, JIMMY "RENT IS TOO DAMN HIGH',
   'office': 'P',
   'office_full': 'President',
   'party': 'REP',
   'party_full': 'REPUBLICAN PARTY',
   'state': 'US'},
  {'active_through': 2012,
   'candidate_id': 'P60003290',
   'candidate_inactive': False,
   'candidate_status': 'N',


### Retrieving nested data

In [4]:
data["results"][0]["name"]

'MCMILLAN, JIMMY "RENT IS TOO DAMN HIGH'

### In-class exercise

Open a new notebook in [Google Colab](https://colab.research.google.com), adapt the previous example to retrieve Democratic candidates for President in 2024 who raised funds via the [FEC API](https://api.open.fec.gov/developers/).

## ELT

Extract-load-transform. You'll sometimes see "ETL".

## Back to 311 data

From [NYC Open Data Portal dataset page](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/explore), click `Export` -> `SODA API` -> `API Docs`.

### Most open data sites have APIs

Often built on platforms that provide them, e.g.

- [NYC Open Data Portal](https://opendata.cityofnewyork.us/) built on [Socrata](https://dev.socrata.com/)
- [data.gov built on CKAN](https://www.data.gov/developers/apis)

### Example: 311 requests from the last week

How would you do this?

_The dates shown are from the last time the code was run._

In [5]:
from datetime import datetime, timedelta

now = datetime.utcnow()
now

  now = datetime.utcnow()


datetime.datetime(2025, 10, 22, 15, 51, 58, 176831)

In [6]:
start = now - timedelta(weeks=1)
start

datetime.datetime(2025, 10, 15, 15, 51, 58, 176831)

In [7]:
start.isoformat()

'2025-10-15T15:51:58.176831'

Using the [Socrata query language (SoQL)](https://dev.socrata.com/docs/queries/):

In [8]:
data_id = "erm2-nwe9"
in_past_week = {
    "$where": f"created_date > '{start.isoformat()}'",
    # just so it's not huge
    "$limit": 100,
}

url = f"https://data.cityofnewyork.us/resource/{data_id}.json"
response = requests.get(url, params=in_past_week)
data = response.json()

data

[{'unique_key': '66488249',
  'created_date': '2025-10-15T15:51:59.000',
  'closed_date': '2025-10-16T20:58:44.000',
  'agency': 'HPD',
  'agency_name': 'Department of Housing Preservation and Development',
  'complaint_type': 'UNSANITARY CONDITION',
  'descriptor': 'PESTS',
  'location_type': 'RESIDENTIAL BUILDING',
  'incident_zip': '10456',
  'incident_address': '1145 CLAY AVENUE',
  'street_name': 'CLAY AVENUE',
  'address_type': 'ADDRESS',
  'city': 'BRONX',
  'status': 'Closed',
  'resolution_description': "HPD inspected this condition so the complaint has been closed. Violations were issued. The law provides the property owner time to correct the condition(s).  Violation descriptions and the dates for the property owner to correct any violations are available at HPDONLINE.  If the owner has not corrected the condition by the date provided, you may wish to bring a case in housing court seeking the correction of these conditions.To find out more about how to start a housing court 

Like the FEC, Socrata uses their own API to populate the tables when browsing data on sites powered by them.

**At-home exercise:** Try filtering a table on the [NYC Open Data Portal](https://data.cityofnewyork.us/), and find the API calls that makes.

### Reading into a DataFrame

In [9]:
pd.DataFrame(data)

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,longitude,location,cross_street_1,cross_street_2,facility_type,intersection_street_1,intersection_street_2,landmark,vehicle_type,taxi_pick_up_location
0,66488249,2025-10-15T15:51:59.000,2025-10-16T20:58:44.000,HPD,Department of Housing Preservation and Develop...,UNSANITARY CONDITION,PESTS,RESIDENTIAL BUILDING,10456,1145 CLAY AVENUE,...,-73.91148262087205,"{'latitude': '40.83083438527125', 'longitude':...",,,,,,,,
1,66495801,2025-10-15T15:51:59.000,2025-10-16T20:58:44.000,HPD,Department of Housing Preservation and Develop...,PAINT/PLASTER,CEILING,RESIDENTIAL BUILDING,10456,1145 CLAY AVENUE,...,-73.91148262087205,"{'latitude': '40.83083438527125', 'longitude':...",,,,,,,,
2,66495911,2025-10-15T15:51:59.000,2025-10-16T20:58:44.000,HPD,Department of Housing Preservation and Develop...,UNSANITARY CONDITION,PESTS,RESIDENTIAL BUILDING,10456,1145 CLAY AVENUE,...,-73.91148262087205,"{'latitude': '40.83083438527125', 'longitude':...",,,,,,,,
3,66495303,2025-10-15T15:52:00.000,,DOT,Department of Transportation,Street Light Condition,Lamppost Damaged,,11365,162 STREET,...,,,161 STREET,65 AVENUE,,,,,,
4,66488892,2025-10-15T15:52:01.000,2025-10-16T00:00:00.000,DOB,Department of Buildings,General Construction/Plumbing,Landmark Bldg - Illegal Work,,11385,60-16 70 AVENUE,...,-73.89817999017804,"{'latitude': '40.70172444729474', 'longitude':...",,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,66489211,2025-10-15T16:01:00.000,,DOT,Department of Transportation,Street Light Condition,Street Light Out,,10306,3240 AMBOY ROAD,...,-74.13362356846557,"{'latitude': '40.56188050400172', 'longitude':...",CHESTERTON AVE,EMMET AVE,,,,,,
96,66491373,2025-10-15T16:01:02.000,,HPD,Department of Housing Preservation and Develop...,PLUMBING,TOILET,RESIDENTIAL BUILDING,10459,941 HOE AVENUE,...,-73.89032552130838,"{'latitude': '40.82193704510607', 'longitude':...",,,,,,,,
97,66494212,2025-10-15T16:01:02.000,2025-10-16T14:35:52.000,HPD,Department of Housing Preservation and Develop...,PAINT/PLASTER,WALL,RESIDENTIAL BUILDING,11205,147 SANDFORD STREET,...,-73.95328960528542,"{'latitude': '40.69431894077128', 'longitude':...",,,,,,,,
98,66497403,2025-10-15T16:01:08.000,2025-10-16T20:48:52.000,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10467,2768 MATTHEWS AVENUE,...,-73.86253950618823,"{'latitude': '40.866415490080364', 'longitude'...",,,,,,,,


### Pagination

- Most APIs limit the number of results returned.
- [Socrata defaults to 1,000.](https://dev.socrata.com/docs/queries/limit.html)
- Need to use a loop with parameters like [`$limit`](https://dev.socrata.com/docs/queries/limit.html)+[`$offset`](https://dev.socrata.com/docs/queries/offset.html) (Socrata) or `page`+`per_page` ([FEC](https://api.open.fec.gov/developers/))
   - [`concat()`](https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/merging.html#concatenating-objects) to DataFrame

## Things are going to differ by API

- Endpoints
- Supported parameters
- Response structure
   - [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#normalization) can help
- Quality of documentation
- Helpfulness of errors
- Size/helpfulness of community

Gotta read and experiment.

## [Final Project](https://python-public-policy.afeld.me/en/nyu/final_project.html)

- You should have received feedback on your proposal.
- Reminder that it's peer-graded.
   - You should see the notebooks you need to review come through Brightspace.
   - This is an opportunity to see how different people solve different problems.
   - You will lose points if you don't complete your peer grading.
- [Submission](https://python-public-policy.afeld.me/en/nyu/final_project.html#submission)

## [Schedule](https://python-public-policy.afeld.me/en/nyu/syllabus.html#schedule)