# Lecture 5: APIs

_Please sign attendance sheet_

## APIs

- They are very powerful
- Can be used from any programming language
- Not expecting you to use them in your Final Project

## APIs, conceptually

Talk me through buying a plane ticket.

- What are the steps?
- What information do you provide?
- What do you imagine is happening behind the scenes?

![Diagram showing how online payments work: Expedia talks to Delta, Delta talks to Stripe, Stripe talks to Visa, and Visa talks to Chase](extras/img/apis_conceptually/payments.png)

![Diagram showing how notifications flow through systems](extras/img/apis_conceptually/notifications.png)

![Diagram showing relationship between human languages, programming languages, and APIs](extras/img/apis_conceptually/languages.png)

interactions between systems ↔️

## Ways to get data

Method | How it happens | Pros | Cons
--- | :--- | :--- | :---
**Bulk** | Download, someone hands you a flash drive, etc. | Fast, one-time transfer | Can be large; data gets out of date easily
**APIs** | If organization makes one available | Usually allows some filtering; can always pull latest-and-greatest | Requires network connection for every call; higher barrier to entry (reading documentation); subject to availability and performance of API
**Scraping** | Data only available through a web site, PDF, or doc | You can turn anything into data | Tedious; fragile

## Scraping

These are open source Python tools, as well as commercial services (with APIs!). Examples:

### Web pages

- [Beautiful Soup package](https://realpython.com/beautiful-soup-web-scraper-python/)
- [pandas' `read_html()`](https://pandas.pydata.org/docs/user_guide/io.html#html)
- [Playwright](https://playwright.dev/python/docs/api/class-playwright)
- No-code tools [ParseHub](https://www.parsehub.com/)

### PDFs

- [How to Extract Document Information From a PDF in Python](https://realpython.com/pdf-python/#how-to-extract-document-information-from-a-pdf-in-python)
- [pypdf](https://pypdf.readthedocs.io/)
- [PyMuPDF](https://pymupdf.readthedocs.io/)
- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [_Others_](https://www.google.com/search?q=extract+table+from+pdf)

_Please pray to the Demo Gods that these all work and there's no profanity_

Pull table from [Wikipedia's list of countries by area](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area#Countries_and_dependencies_by_area):

In [1]:
import pandas as pd

tables = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area",
    match="Country / dependency",
)
countries = tables[0]
countries

Unnamed: 0.1,Unnamed: 0,Country / dependency,Total in km2 (mi2),Land in km2 (mi2),Water in km2 (mi2),% water,Unnamed: 6
0,–,Earth,"510,072,000 (196,940,000)","148,940,000 (57,506,000)","361,132,000 (139,434,000)",70.8,
1,1,Russia,"17,098,246 (6,601,667)","16,376,870 (6,323,142)","721,380 (278,530)",4.2,[b]
2,–,Antarctica,"14,200,000 (5,480,000)","14,200,000 (5,480,000)",0,0.0,[c]
3,2,Canada,"9,984,670 (3,855,100)","9,093,507 (3,511,021)","891,163 (344,080)",8.9,[d]
4,3/4 [e],China,"9,596,960 (3,705,410)","9,326,410 (3,600,950)","270,550 (104,460)",2.8,[f]
...,...,...,...,...,...,...,...
257,–,Ashmore and Cartier Islands (Australia),5.0 (1.9),5.0 (1.9),0,0.0,[q]
258,–,Coral Sea Islands (Australia),3.0 (1.2),3.0 (1.2),0,0.0,[db]
259,–,Spratly Islands (disputed),2.0 (0.77),2.0 (0.77),0,0.0,[56]
260,194,Monaco,2.0 (0.77),2.0 (0.77),0,0.0,[dc]


### Data is only available if it's available

## API calls in the wild

1. Go to [Candidates page on fec.gov](https://www.fec.gov/data/candidates/?has_raised_funds=true&is_active_candidate=true).
1. Right click and `Inspect`.
   - [More info about opening Developer Tools in various browsers.](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools#how_to_open_the_devtools_in_your_browser)
1. Go to the `Network` tab and reload.
1. Filter to `XHR`.
1. Click the API call.

We only see this because the tables on [fec.gov](https://fec.gov) are [rendered client-side](https://www.solutelabs.com/blog/client-side-vs-server-side-rendering-what-to-choose-when) using their JSON API. That won't be the case for all tables on all sites.

### Parts of a URL

![URL structure](extras/img/url.png)

[source](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL#basics_anatomy_of_a_url)

For APIs:

- Often split into "base URL" + "endpoint"
- Endpoints are like function names: they represent the information you are retrieving or thing you are trying to do
- Parameters are like function arguments:
   - They allow options to be specified
   - Some are required, some are optional
   - They will differ from one endpoint/function to another
- Anchors won't be used

### API documentation

[FEC API](https://api.open.fec.gov/developers/)

### Try it out

1. Visit https://www.fec.gov/data/candidates/
1. [Open Developer Tools](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools#how_to_open_the_devtools_in_your_browser).
1. Reload the page.
1. In the Network tab's request list:
   1. Filter to Fetch/XHR/AJAX (terminology will differ by browser)
   1. Right-click the API call row.
1. Click `Open in New Tab`. You will see an error.
1. In the URL bar, replace the `api_key` value with `DEMO_KEY`. The URL should therefore contain `api_key=DEMO_KEY`.

You should see a big wall of JSON data.


## API calls from Python

Usually one of two ways:

- A software development kit (SDK) like [sodapy](https://pypi.org/project/sodapy/)
   - Abstracts the details away
   - Not available for all APIs
   - May have limitations
- [The `requests` package](https://docs.python-requests.org/) (nothing to do with 311 requests)

Get Jimmy McMillan's latest candidacy information:


In [2]:
import requests

jimmy = {
    "api_key": "DEMO_KEY",
    "q": "Jimmy McMillan",
    "sort": "-first_file_date",
}
response = requests.get("https://api.open.fec.gov/v1/candidates/", params=jimmy)
data = response.json()
data

{'api_version': '1.0',
 'pagination': {'count': 2,
  'is_count_exact': True,
  'page': 1,
  'pages': 1,
  'per_page': 20},
 'results': [{'active_through': 2016,
   'candidate_id': 'P60016805',
   'candidate_inactive': False,
   'candidate_status': 'N',
   'cycles': [2016, 2018],
   'district': '00',
   'district_number': 0,
   'election_districts': ['00'],
   'election_years': [2016],
   'federal_funds_flag': False,
   'first_file_date': '2015-10-13',
   'has_raised_funds': False,
   'inactive_election_years': None,
   'incumbent_challenge': 'O',
   'incumbent_challenge_full': 'Open seat',
   'last_f2_date': '2015-10-13',
   'last_file_date': '2015-10-13',
   'load_date': '2018-02-17T09:16:20',
   'name': 'MCMILLAN, JIMMY "RENT IS TOO DAMN HIGH',
   'office': 'P',
   'office_full': 'President',
   'party': 'REP',
   'party_full': 'REPUBLICAN PARTY',
   'state': 'US'},
  {'active_through': 2012,
   'candidate_id': 'P60003290',
   'candidate_inactive': False,
   'candidate_status': 'N',


### Retrieving nested data

In [3]:
data["results"][0]["name"]

'MCMILLAN, JIMMY "RENT IS TOO DAMN HIGH'

### In-class exercise

Open a new notebook in [Google Colab](https://colab.research.google.com), adapt the previous example to retrieve Democratic candidates for President in 2024 who raised funds via the [FEC API](https://api.open.fec.gov/developers/).

## ELT

Extract-load-transform. You'll sometimes see "ETL".

## Back to 311 data

From [NYC Open Data Portal dataset page](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/explore), click `Export` -> `SODA API` -> `API Docs`.

### Most open data sites have APIs

Often built on platforms that provide them, e.g.

- [NYC Open Data Portal](https://opendata.cityofnewyork.us/) built on [Socrata](https://dev.socrata.com/)
- [data.gov built on CKAN](https://www.data.gov/developers/apis)

### Example: 311 requests from the last week

How would you do this?

_The dates shown are from the last time the code was run._

In [4]:
from datetime import datetime, timedelta

now = datetime.utcnow()
now

datetime.datetime(2025, 6, 30, 3, 2, 34, 144579)

In [5]:
start = now - timedelta(weeks=1)
start

datetime.datetime(2025, 6, 23, 3, 2, 34, 144579)

In [6]:
start.isoformat()

'2025-06-23T03:02:34.144579'

Using the [Socrata query language (SoQL)](https://dev.socrata.com/docs/queries/):

In [7]:
data_id = "erm2-nwe9"
in_past_week = {
    "$where": f"created_date > '{start.isoformat()}'",
    # just so it's not huge
    "$limit": 100,
}

url = f"https://data.cityofnewyork.us/resource/{data_id}.json"
response = requests.get(url, params=in_past_week)
data = response.json()

data

[{'unique_key': '65351421',
  'created_date': '2025-06-23T03:03:52.000',
  'closed_date': '2025-06-23T06:35:17.000',
  'agency': 'NYPD',
  'agency_name': 'New York City Police Department',
  'complaint_type': 'Noise - Street/Sidewalk',
  'descriptor': 'Loud Music/Party',
  'location_type': 'Street/Sidewalk',
  'incident_zip': '10460',
  'incident_address': '1577 HOE AVENUE',
  'street_name': 'HOE AVENUE',
  'cross_street_1': 'EAST  172 STREET',
  'cross_street_2': 'EAST  173 STREET',
  'intersection_street_1': 'EAST  172 STREET',
  'intersection_street_2': 'EAST  173 STREET',
  'address_type': 'ADDRESS',
  'city': 'BRONX',
  'landmark': 'HOE AVENUE',
  'status': 'Closed',
  'resolution_description': 'The Police Department responded to the complaint and with the information available observed no evidence of the violation at that time.',
  'resolution_action_updated_date': '2025-06-23T06:35:22.000',
  'community_board': '03 BRONX',
  'borough': 'BRONX',
  'x_coordinate_state_plane': '101

Like the FEC, Socrata uses their own API to populate the tables when browsing data on sites powered by them.

**At-home exercise:** Try filtering a table on the [NYC Open Data Portal](https://data.cityofnewyork.us/), and find the API calls that makes.

### Reading into a DataFrame

In [8]:
pd.DataFrame(data)

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,bbl,facility_type,vehicle_type
0,65351421,2025-06-23T03:03:52.000,2025-06-23T06:35:17.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10460,1577 HOE AVENUE,...,243539,MOBILE,Unspecified,BRONX,40.835079272630594,-73.88801633595516,"{'latitude': '40.835079272630594', 'longitude'...",,,
1,65347190,2025-06-23T03:04:21.000,2025-06-23T06:38:21.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,11369,109-44 DITMARS BOULEVARD,...,216943,PHONE,Unspecified,QUEENS,40.762051181138425,-73.86113769887103,"{'latitude': '40.762051181138425', 'longitude'...",4016770030,,
2,65351000,2025-06-23T03:04:46.000,2025-06-23T03:17:27.000,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,10030,672 ST NICHOLAS AVENUE,...,239154,MOBILE,Unspecified,MANHATTAN,40.82308504635165,-73.94519976314912,"{'latitude': '40.82308504635165', 'longitude':...",1020510039,,
3,65345784,2025-06-23T03:05:18.000,2025-06-23T06:36:19.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10460,HOE AVENUE,...,,PHONE,Unspecified,BRONX,,,,,,
4,65348418,2025-06-23T03:05:52.000,2025-06-23T06:25:28.000,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,11372,37 AVENUE,...,212307,PHONE,Unspecified,QUEENS,40.749355803786514,-73.88802689907322,"{'latitude': '40.749355803786514', 'longitude'...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,65348622,2025-06-23T04:21:22.000,2025-06-23T06:35:51.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10460,1569 HOE AVENUE,...,243474,PHONE,Unspecified,BRONX,40.834900968256505,-73.88812143532677,"{'latitude': '40.834900968256505', 'longitude'...",2029820027,,
96,65348263,2025-06-23T04:22:42.000,2025-06-23T04:31:28.000,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,11208,184 ETNA STREET,...,189241,ONLINE,Unspecified,BROOKLYN,40.68603366963854,-73.87681854478991,"{'latitude': '40.68603366963854', 'longitude':...",3041150026,,
97,65345085,2025-06-23T04:23:00.000,2025-06-24T23:00:00.000,DEP,Department of Environmental Protection,Noise,Noise: Manufacturing Noise (NK1),,10007,50 MURRAY STREET,...,199447,ONLINE,Unspecified,MANHATTAN,40.71411228877783,-74.00954829968862,"{'latitude': '40.71411228877783', 'longitude':...",1001260027,,
98,65345756,2025-06-23T04:23:03.000,2025-06-23T04:53:14.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10025,200 WEST 94 STREET,...,228127,MOBILE,Unspecified,MANHATTAN,40.79282848310588,-73.97151625130907,"{'latitude': '40.79282848310588', 'longitude':...",1012410036,,


### Pagination

- Most APIs limit the number of results returned.
- [Socrata defaults to 1,000.](https://dev.socrata.com/docs/queries/limit.html)
- Need to use a loop with parameters like [`$limit`](https://dev.socrata.com/docs/queries/limit.html)+[`$offset`](https://dev.socrata.com/docs/queries/offset.html) (Socrata) or `page`+`per_page` ([FEC](https://api.open.fec.gov/developers/))
   - [`concat()`](https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/merging.html#concatenating-objects) to DataFrame

## Things are going to differ by API

- Endpoints
- Supported parameters
- Response structure
   - [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#normalization) can help
- Quality of documentation
- Helpfulness of errors
- Size/helpfulness of community

Gotta read and experiment.

## [Final Project](https://python-public-policy.afeld.me/en/{{school_slug}}/final_project.html)

- You should have received feedback on your proposal.
- Reminder that it's peer-graded.
   - You should see the notebooks you need to review come through {{lms_name}}.
   - This is an opportunity to see how different people solve different problems.
   - You will lose points if you don't complete your peer grading.
- [Submission](https://python-public-policy.afeld.me/en/{{school_slug}}/final_project.html#submission)

All together, let's make sure our notebooks are properly shared for peer grading:

1. Create a notebook for your Final Project, if you haven't already.
1. [Make it visible to your peers.](https://python-public-policy.afeld.me/en/{{school_slug}}/final_project.html#sharing)

## [Schedule](https://python-public-policy.afeld.me/en/{{school_slug}}/syllabus.html#schedule)