feat(filings): Wave 1 PR B — Form 4 insider transactions ingest#230
Merged
Conversation
Wave 1 PR B of the institutional data-revamp arc. Expands EDGAR
coverage beyond the existing 8-K + 10-K/Q with Form 4 (Section 16
insider transactions). 13F (institutional positioning) + the longer-
form text filings (14A / S-1 / 13D-G) defer to PR B.1+.
Form 4 is filed by every officer, director, and 10%+ beneficial
owner within 2 business days of each transaction. Net insider
activity is a real institutional alpha signal — cluster buying,
CEO/CFO purchases vs RSU vests, sales after earnings, 10b5-1 plan
disclosures.
New module: rag/pipelines/ingest_form4.py
Form4Transaction (frozen dataclass) — per-transaction row shape
Identity: ticker, issuer_cik, accession_number, filed_date,
schema_version
Reporting owner: name, cik, is_director, is_officer,
is_ten_percent_owner, officer_title
Transaction: date, code (A/D/M/S/P/F/G/J), security_title,
shares, price_per_share, acquired_disposed_code (A/D),
value_usd (computed), shares_owned_after,
direct_or_indirect, is_derivative
Provenance: fetched_at
parse_form4_xml(xml, *, accession_number, filed_date) — pure parser
- Handles both non-derivative AND derivative transaction tables
- Multi-transaction filings emit multiple rows
- Tolerates direct-text vs <value>-wrapped transactionCode (both
forms occur in actual EDGAR XML)
- Returns [] on parse failure (logged) — caller skips, doesn't crash
_search_form4_filings(ticker, lookback_days, http) — EDGAR
discovery. Uses the shared submissions API + CIK lookup pattern
from ingest_8k_filings.py (rate-limited at ~8 req/sec under
EDGAR's 10/sec ceiling).
write_form4_parquet(transactions, *, filed_date, s3_client, bucket,
prefix) — S3 parquet writer at
`s3://alpha-engine-research/data/insider_transactions/
{filed_date}.parquet`. One parquet per filed_date holds all
insider transactions across all tickers filed that day.
ingest_for_tickers(tickers, *, lookback_days, s3_client, ...)
end-to-end orchestrator returning a stats dict. Groups
transactions by filed_date so one parquet write per day, not
per filing.
CLI: `python -m rag.pipelines.ingest_form4 --tickers AAPL,MSFT
--lookback-days 90`
+16 unit tests:
- XML parsing: single sale / multi-transaction (2 non-deriv + 1
deriv) / missing optional fields → None / malformed XML returns []
/ empty filing returns []
- transactionCode unwrapping (direct text + <value>-wrapped)
- DataFrame: empty list → empty df with schema / round-trip
preserves all fields
- S3 writer: key format / write-read round-trip via in-memory mock
/ empty list writes empty-schema parquet
- Orchestrator: end-to-end discovery → download → parse → write /
non-Form-4 filings skipped at discovery / dry-run skips writes /
one bad download isolated, batch continues
- SCHEMA_VERSION pinned to 1 + present on every row
What's deferred to PR B.1+:
- 13F (institutional positioning) — quarterly XML schema with
holdings tables; bigger XML payloads + ticker_to_cik reverse
lookup
- 14A / DEF 14A (proxy statements) — RAG-only text ingest
mirroring 10-K/Q pattern
- S-1 / S-4 (registration statements) — RAG-only
- 13D / 13G (5%+ ownership disclosures) — structured XML, similar
shape to Form 4
Suite: 945 passing (1 skipped).
Composes with:
- ingest_8k_filings.py — pattern source for the EDGAR discovery +
CIK helpers
- PR α (lib v0.15.0) — FilingDocument shape exists for future
text-based filings; Form 4 is structured-only so doesn't use it
- data-revamp-260513.md plan doc — full arc context
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 1 PR B of the institutional data-revamp arc. Expands EDGAR coverage beyond existing 8-K + 10-K/Q with Form 4 (Section 16 insider transactions) — the biggest institutional alpha signal still missing from the substrate.
Form 4 is filed by every officer, director, and 10%+ beneficial owner within 2 business days of each transaction. Net insider activity is a real institutional signal:
13F + longer-form text filings (14A, S-1, 13D-G) defer to PR B.1+.
What's in
New module
rag/pipelines/ingest_form4.py:Form4Transactionrow shape (frozen dataclass, 20 fields)parse_form4_xml()— pure XML parser<value>-wrappedtransactionCode(both forms occur in actual EDGAR XML)[]on parse failure (logged) — caller skips, doesn't crash_search_form4_filings()— EDGAR discoveryShared CIK lookup + submissions API pattern from
ingest_8k_filings.py. Rate-limited at ~8 req/sec under EDGAR's 10/sec ceiling.write_form4_parquet()— S3 writers3://alpha-engine-research/data/insider_transactions/{filed_date}.parquetingest_for_tickers()— orchestratorEnd-to-end: discovery → download → parse → write. Returns stats dict. Groups transactions by filed_date so one parquet write per day, not per filing.
CLI
What's deferred to PR B.1+
Test plan
tests/test_ingest_form4.py):[]/ empty filing returns[]transactionCodeunwrapping (direct text +<value>-wrapped)SCHEMA_VERSIONpinned to 1 + present on every row🤖 Generated with Claude Code