Lobbying data pipeline#2158
Draft
nesanders wants to merge 3 commits into
Draft
Conversation
Scrapes the MA Secretary of State lobbying portal (sec.state.ma.us/LobbyistPublicSearch)
and writes structured data to Firestore for joining with MAPLE bill data.
New collections:
- /lobbyingRegistrants — one doc per (registrant, year), regType Lobbyist|Employer
- /lobbyingFilings — one doc per (registrant, client, bill, court), with billId
null for Executive/Other chambers so the join guard is type-level
Key design points:
- billId is constructed as {chamberPrefix}{integer} (e.g. H1234, SD56) to match
Bill.id in the existing bills collection; raw integer + chamber stored separately
- Entity name normalization pipeline ported from reference implementation (10 steps:
d/b/a stripping, legal entity words, punctuation, THE, ampersand, typo fix, etc.)
- Both raw and *Norm name fields stored for provenance and grouping
- Live Cloud Function scrapes current+prior year on a 24h schedule with a
summaryDiscCache to avoid re-fetching summary pages in steady state
- Backfill admin script handles full 2005-present history with a Firestore
subcollection cursor (/scrapers/lobbyingBackfill/processedUrls) that scales
to ~50k URLs and is safely resumable
Files:
- functions/src/lobbying/{types,normalize,portal,scrapeLobbying,index}.ts
- scripts/firebase-admin/backfillLobbying.ts
- firestore.rules + firestore.indexes.json updated
- docs/lobbying-disclosure-ingestion.md: full plan, test plan, future work
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The MA SoS portal is protected by Imperva WAF, which uses TLS fingerprinting to classify HTTP clients before examining headers. Python's requests library produces a fingerprint that Imperva allows through; Node.js does not. A standalone Cloud Run container (Python 3.12) is therefore used for the scheduled ingestion instead of a Cloud Function. lobbying-scraper/ — Cloud Run container (3 pip deps: requests, beautifulsoup4, google-cloud-firestore): - scrape.py: entry point with --mode weekly (incremental, fast exit if nothing new) and --mode backfill (full 2005-present history, resumable subcollection cursor). Weekly mode caches summary URL→disc URL mappings so prior-year registrants with no new filings require zero additional HTTP requests. - portal.py: HTTP session management + HTML parsing for all three portal page levels (search POST, summary GET, disclosure GET). Handles both modern (>=2013) and legacy (<2013) disclosure formats. - normalize.py: port of functions/src/lobbying/normalize.ts — 10-step entity name normalization pipeline, must match the TypeScript version exactly. - writer.py: Firestore document construction and batch writes. Schema matches types.ts (lobbyingRegistrants, lobbyingFilings collections). scripts/firebase-admin/backfillLobbying.ts — simplified to spawn scrape.py as a subprocess; all HTTP and Firestore logic moved to Python. functions/src/lobbying/http/ — thin Python HTTP helper kept for reference; not used in the current architecture. Note: server-side IP reputation behavior with Imperva untested. Build and run the container on Cloud Run with --dry-run to validate before full deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a new lobbying data ingestion pipeline and establishes the associated data model in firebase. It does not make any frontend modifications, which will happen in a subsequent PR.
It adapts the lobbying data scraper from nesanders/MAenvironmentaldata#71
c.f. #855 #1365
Checklist
firestore.indexes.json(Please do not only create indexes through the Firebase Web UI, even though the error messages may reccommend it - indexes created this way may be obliterated by subsequent deploys) — Added 4 composite indexes on lobbyingFilings (generalCourt + billId, chamber, entityNameNorm, clientNameNorm)Screenshots
N/A (no frontend changes)
Known issues
Steps to test/reproduce
See the Incremental Test Plan in the doc for the full sequence. For a quick reviewer smoke test: