Skip to content

Lobbying data pipeline#2158

Draft
nesanders wants to merge 3 commits into
codeforboston:mainfrom
nesanders:lobbying-data-pipeline
Draft

Lobbying data pipeline#2158
nesanders wants to merge 3 commits into
codeforboston:mainfrom
nesanders:lobbying-data-pipeline

Conversation

@nesanders
Copy link
Copy Markdown
Collaborator

Summary

This PR introduces a new lobbying data ingestion pipeline and establishes the associated data model in firebase. It does not make any frontend modifications, which will happen in a subsequent PR.

It adapts the lobbying data scraper from nesanders/MAenvironmentaldata#71

c.f. #855 #1365

Checklist

  • On the frontend, I've made my strings translate-able. -- N/A
  • If I've added shared components, I've added a storybook story. -- N/A
  • I've made pages responsive and look good on mobile. -- N/A
  • If I've added new Firestore queries, I've added any new required indexes to firestore.indexes.json (Please do not only create indexes through the Firebase Web UI, even though the error messages may reccommend it - indexes created this way may be obliterated by subsequent deploys) — Added 4 composite indexes on lobbyingFilings (generalCourt + billId, chamber, entityNameNorm, clientNameNorm)

Screenshots

N/A (no frontend changes)

Known issues

  • Initial backfill required before the Cloud Function is useful. scrapeLobbying only covers current and prior year in steady state. Historical data (2005–present) must be loaded by running backfillLobbying manually first. See the test plan in the doc for the recommended sequence.
  • Bill joins only resolve for court 192+ (2021–present). MAPLE's bill collection starts around 2020; lobbying filings before that will have a valid billId field but no matching bill document. It is my intention to load the full historical lobbying data (back to 2005), despite not having bill data back that far.
  • Portal scraping is slow. The MA SoS portal requires ~1s between requests. First-time processing of a year (~500 registrants × 2 disclosures) takes roughly 20–30 minutes. The function caps at 200 new disclosure pages per invocation and resumes on the next scheduled run.
  • Legacy pre-2013 filings use a different HTML layout; compensation is stored as a single entity-level total under the sentinel clientName: "total_salary" rather than broken down per client. This is handled in the scraper logic.

Steps to test/reproduce

See the Incremental Test Plan in the doc for the full sequence. For a quick reviewer smoke test:

  • Run the normalization and chamber unit tests (Steps 1–2 of the test plan).
  • Run a live portal fetch with limit 1 against the current year (Step 3–4) — verifies the portal is reachable and the HTML parsing returns valid data.
  • Run the backfill script against the dev project with --year 2024 --limit 3 (Step 5) — writes 3 registrants and their filings to Firestore; verify documents appear in the console with correct billId values for legislative rows and null for Executive rows.

nesanders and others added 2 commits June 4, 2026 07:01
Scrapes the MA Secretary of State lobbying portal (sec.state.ma.us/LobbyistPublicSearch)
and writes structured data to Firestore for joining with MAPLE bill data.

New collections:
- /lobbyingRegistrants — one doc per (registrant, year), regType Lobbyist|Employer
- /lobbyingFilings — one doc per (registrant, client, bill, court), with billId
  null for Executive/Other chambers so the join guard is type-level

Key design points:
- billId is constructed as {chamberPrefix}{integer} (e.g. H1234, SD56) to match
  Bill.id in the existing bills collection; raw integer + chamber stored separately
- Entity name normalization pipeline ported from reference implementation (10 steps:
  d/b/a stripping, legal entity words, punctuation, THE, ampersand, typo fix, etc.)
- Both raw and *Norm name fields stored for provenance and grouping
- Live Cloud Function scrapes current+prior year on a 24h schedule with a
  summaryDiscCache to avoid re-fetching summary pages in steady state
- Backfill admin script handles full 2005-present history with a Firestore
  subcollection cursor (/scrapers/lobbyingBackfill/processedUrls) that scales
  to ~50k URLs and is safely resumable

Files:
- functions/src/lobbying/{types,normalize,portal,scrapeLobbying,index}.ts
- scripts/firebase-admin/backfillLobbying.ts
- firestore.rules + firestore.indexes.json updated
- docs/lobbying-disclosure-ingestion.md: full plan, test plan, future work

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nesanders nesanders self-assigned this Jun 4, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
maple-dev Ready Ready Preview, Comment Jun 5, 2026 8:59pm

Request Review

The MA SoS portal is protected by Imperva WAF, which uses TLS fingerprinting
to classify HTTP clients before examining headers. Python's requests library
produces a fingerprint that Imperva allows through; Node.js does not. A
standalone Cloud Run container (Python 3.12) is therefore used for the
scheduled ingestion instead of a Cloud Function.

lobbying-scraper/ — Cloud Run container (3 pip deps: requests, beautifulsoup4,
google-cloud-firestore):
- scrape.py: entry point with --mode weekly (incremental, fast exit if nothing
  new) and --mode backfill (full 2005-present history, resumable subcollection
  cursor). Weekly mode caches summary URL→disc URL mappings so prior-year
  registrants with no new filings require zero additional HTTP requests.
- portal.py: HTTP session management + HTML parsing for all three portal page
  levels (search POST, summary GET, disclosure GET). Handles both modern
  (>=2013) and legacy (<2013) disclosure formats.
- normalize.py: port of functions/src/lobbying/normalize.ts — 10-step entity
  name normalization pipeline, must match the TypeScript version exactly.
- writer.py: Firestore document construction and batch writes. Schema matches
  types.ts (lobbyingRegistrants, lobbyingFilings collections).

scripts/firebase-admin/backfillLobbying.ts — simplified to spawn scrape.py
as a subprocess; all HTTP and Firestore logic moved to Python.

functions/src/lobbying/http/ — thin Python HTTP helper kept for reference;
not used in the current architecture.

Note: server-side IP reputation behavior with Imperva untested. Build and run
the container on Cloud Run with --dry-run to validate before full deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant