Lobbying data pipeline by nesanders · Pull Request #2158 · codeforboston/maple

nesanders · 2026-06-04T12:46:14Z

Summary

This PR introduces a new lobbying data ingestion pipeline and establishes the associated data model in firebase. It does not make any frontend modifications, which will happen in a subsequent PR.

It adapts the lobbying data scraper from nesanders/MAenvironmentaldata#71

c.f. #855 #1365

Checklist

On the frontend, I've made my strings translate-able. -- N/A
If I've added shared components, I've added a storybook story. -- N/A
I've made pages responsive and look good on mobile. -- N/A
If I've added new Firestore queries, I've added any new required indexes to firestore.indexes.json (Please do not only create indexes through the Firebase Web UI, even though the error messages may reccommend it - indexes created this way may be obliterated by subsequent deploys) — Added 4 composite indexes on lobbyingFilings (generalCourt + billId, chamber, entityNameNorm, clientNameNorm)

Screenshots

N/A (no frontend changes)

Known issues

Initial backfill required before the Cloud Function is useful. scrapeLobbying only covers current and prior year in steady state. Historical data (2005–present) must be loaded by running backfillLobbying manually first. See the test plan in the doc for the recommended sequence.
Bill joins only resolve for court 192+ (2021–present). MAPLE's bill collection starts around 2020; lobbying filings before that will have a valid billId field but no matching bill document. It is my intention to load the full historical lobbying data (back to 2005), despite not having bill data back that far.
Portal scraping is slow. The MA SoS portal requires ~1s between requests. First-time processing of a year (~500 registrants × 2 disclosures) takes roughly 20–30 minutes. The function caps at 200 new disclosure pages per invocation and resumes on the next scheduled run.
Legacy pre-2013 filings use a different HTML layout; compensation is stored as a single entity-level total under the sentinel clientName: "total_salary" rather than broken down per client. This is handled in the scraper logic.

Steps to test/reproduce

See the Incremental Test Plan in the doc for the full sequence. For a quick reviewer smoke test:

Run the normalization and chamber unit tests (Steps 1–2 of the test plan).
Run a live portal fetch with limit 1 against the current year (Step 3–4) — verifies the portal is reachable and the HTML parsing returns valid data.
Run the backfill script against the dev project with --year 2024 --limit 3 (Step 5) — writes 3 registrants and their filings to Firestore; verify documents appear in the console with correct billId values for legislative rows and null for Executive rows.

Scrapes the MA Secretary of State lobbying portal (sec.state.ma.us/LobbyistPublicSearch) and writes structured data to Firestore for joining with MAPLE bill data. New collections: - /lobbyingRegistrants — one doc per (registrant, year), regType Lobbyist|Employer - /lobbyingFilings — one doc per (registrant, client, bill, court), with billId null for Executive/Other chambers so the join guard is type-level Key design points: - billId is constructed as {chamberPrefix}{integer} (e.g. H1234, SD56) to match Bill.id in the existing bills collection; raw integer + chamber stored separately - Entity name normalization pipeline ported from reference implementation (10 steps: d/b/a stripping, legal entity words, punctuation, THE, ampersand, typo fix, etc.) - Both raw and *Norm name fields stored for provenance and grouping - Live Cloud Function scrapes current+prior year on a 24h schedule with a summaryDiscCache to avoid re-fetching summary pages in steady state - Backfill admin script handles full 2005-present history with a Firestore subcollection cursor (/scrapers/lobbyingBackfill/processedUrls) that scales to ~50k URLs and is safely resumable Files: - functions/src/lobbying/{types,normalize,portal,scrapeLobbying,index}.ts - scripts/firebase-admin/backfillLobbying.ts - firestore.rules + firestore.indexes.json updated - docs/lobbying-disclosure-ingestion.md: full plan, test plan, future work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-06-04T12:46:21Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
maple-dev	Ready	Preview, Comment	Jun 5, 2026 8:59pm

The MA SoS portal is protected by Imperva WAF, which uses TLS fingerprinting to classify HTTP clients before examining headers. Python's requests library produces a fingerprint that Imperva allows through; Node.js does not. A standalone Cloud Run container (Python 3.12) is therefore used for the scheduled ingestion instead of a Cloud Function. lobbying-scraper/ — Cloud Run container (3 pip deps: requests, beautifulsoup4, google-cloud-firestore): - scrape.py: entry point with --mode weekly (incremental, fast exit if nothing new) and --mode backfill (full 2005-present history, resumable subcollection cursor). Weekly mode caches summary URL→disc URL mappings so prior-year registrants with no new filings require zero additional HTTP requests. - portal.py: HTTP session management + HTML parsing for all three portal page levels (search POST, summary GET, disclosure GET). Handles both modern (>=2013) and legacy (<2013) disclosure formats. - normalize.py: port of functions/src/lobbying/normalize.ts — 10-step entity name normalization pipeline, must match the TypeScript version exactly. - writer.py: Firestore document construction and batch writes. Schema matches types.ts (lobbyingRegistrants, lobbyingFilings collections). scripts/firebase-admin/backfillLobbying.ts — simplified to spawn scrape.py as a subprocess; all HTTP and Firestore logic moved to Python. functions/src/lobbying/http/ — thin Python HTTP helper kept for reference; not used in the current architecture. Note: server-side IP reputation behavior with Imperva untested. Build and run the container on Cloud Run with --dry-run to validate before full deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nesanders and others added 2 commits June 4, 2026 07:01

initial plan

0348b29

nesanders self-assigned this Jun 4, 2026

vercel Bot deployed to Preview – maple-dev June 4, 2026 12:50 View deployment

vercel Bot deployed to Preview – maple-dev June 5, 2026 20:59 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lobbying data pipeline#2158

Lobbying data pipeline#2158
nesanders wants to merge 3 commits into
codeforboston:mainfrom
nesanders:lobbying-data-pipeline

nesanders commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nesanders commented Jun 4, 2026

Summary

Checklist

Screenshots

Known issues

Steps to test/reproduce

Uh oh!

vercel Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 4, 2026 •

edited

Loading