Skip to content

feat(filings): Wave 1 PR B — Form 4 insider transactions ingest#230

Merged
cipher813 merged 1 commit into
mainfrom
feat/wave1-form4-substrate
May 13, 2026
Merged

feat(filings): Wave 1 PR B — Form 4 insider transactions ingest#230
cipher813 merged 1 commit into
mainfrom
feat/wave1-form4-substrate

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Wave 1 PR B of the institutional data-revamp arc. Expands EDGAR coverage beyond existing 8-K + 10-K/Q with Form 4 (Section 16 insider transactions) — the biggest institutional alpha signal still missing from the substrate.

Form 4 is filed by every officer, director, and 10%+ beneficial owner within 2 business days of each transaction. Net insider activity is a real institutional signal:

  • Cluster buying by multiple insiders → strong bullish
  • CEO/CFO buys vs RSU vests → distinguished from forced exercises
  • Sales after earnings → may signal weakness
  • 10b5-1 plan disclosures → reduce noise from scheduled sales

13F + longer-form text filings (14A, S-1, 13D-G) defer to PR B.1+.

What's in

New module rag/pipelines/ingest_form4.py:

Form4Transaction row shape (frozen dataclass, 20 fields)

Group Fields
Identity ticker, issuer_cik, accession_number, filed_date, schema_version
Reporting owner name, cik, is_director, is_officer, is_ten_percent_owner, officer_title
Transaction date, code (A/D/M/S/P/F/G/J), security_title, shares, price_per_share, acquired_disposed_code, value_usd (computed), shares_owned_after, direct_or_indirect, is_derivative
Provenance fetched_at

parse_form4_xml() — pure XML parser

  • Handles both non-derivative AND derivative transaction tables
  • Multi-transaction filings emit multiple rows
  • Tolerates direct-text vs <value>-wrapped transactionCode (both forms occur in actual EDGAR XML)
  • Returns [] on parse failure (logged) — caller skips, doesn't crash

_search_form4_filings() — EDGAR discovery

Shared CIK lookup + submissions API pattern from ingest_8k_filings.py. Rate-limited at ~8 req/sec under EDGAR's 10/sec ceiling.

write_form4_parquet() — S3 writer

  • s3://alpha-engine-research/data/insider_transactions/{filed_date}.parquet
  • One parquet per filed_date holds all insider transactions across all tickers filed that day
  • Idempotent overwrite

ingest_for_tickers() — orchestrator

End-to-end: discovery → download → parse → write. Returns stats dict. Groups transactions by filed_date so one parquet write per day, not per filing.

CLI

python -m rag.pipelines.ingest_form4 --tickers AAPL,MSFT --lookback-days 90

What's deferred to PR B.1+

  • 13F (institutional positioning) — quarterly XML schema with holdings tables; bigger payloads + ticker_to_cik reverse lookup
  • 14A / DEF 14A (proxy statements) — RAG-only text ingest mirroring 10-K/Q pattern
  • S-1 / S-4 (registration statements) — RAG-only
  • 13D / 13G (5%+ ownership disclosures) — structured XML, similar shape to Form 4

Test plan

  • +16 unit tests (tests/test_ingest_form4.py):
    • XML parsing: single sale / multi-transaction (2 non-deriv + 1 deriv) / missing optional fields → None / malformed XML returns [] / empty filing returns []
    • transactionCode unwrapping (direct text + <value>-wrapped)
    • DataFrame: empty list → empty df with schema / round-trip preserves all fields
    • S3 writer: key format / write-read round-trip via in-memory mock / empty list writes empty-schema parquet
    • Orchestrator: end-to-end discovery → download → parse → write / non-Form-4 filings skipped at discovery / dry-run skips writes / one bad download isolated (batch continues)
    • SCHEMA_VERSION pinned to 1 + present on every row
  • Full data suite: 945 passing (1 skipped) in 4s

🤖 Generated with Claude Code

Wave 1 PR B of the institutional data-revamp arc. Expands EDGAR
coverage beyond the existing 8-K + 10-K/Q with Form 4 (Section 16
insider transactions). 13F (institutional positioning) + the longer-
form text filings (14A / S-1 / 13D-G) defer to PR B.1+.

Form 4 is filed by every officer, director, and 10%+ beneficial
owner within 2 business days of each transaction. Net insider
activity is a real institutional alpha signal — cluster buying,
CEO/CFO purchases vs RSU vests, sales after earnings, 10b5-1 plan
disclosures.

New module: rag/pipelines/ingest_form4.py

  Form4Transaction (frozen dataclass) — per-transaction row shape
    Identity: ticker, issuer_cik, accession_number, filed_date,
              schema_version
    Reporting owner: name, cik, is_director, is_officer,
              is_ten_percent_owner, officer_title
    Transaction: date, code (A/D/M/S/P/F/G/J), security_title,
              shares, price_per_share, acquired_disposed_code (A/D),
              value_usd (computed), shares_owned_after,
              direct_or_indirect, is_derivative
    Provenance: fetched_at

  parse_form4_xml(xml, *, accession_number, filed_date) — pure parser
    - Handles both non-derivative AND derivative transaction tables
    - Multi-transaction filings emit multiple rows
    - Tolerates direct-text vs <value>-wrapped transactionCode (both
      forms occur in actual EDGAR XML)
    - Returns [] on parse failure (logged) — caller skips, doesn't crash

  _search_form4_filings(ticker, lookback_days, http) — EDGAR
    discovery. Uses the shared submissions API + CIK lookup pattern
    from ingest_8k_filings.py (rate-limited at ~8 req/sec under
    EDGAR's 10/sec ceiling).

  write_form4_parquet(transactions, *, filed_date, s3_client, bucket,
    prefix) — S3 parquet writer at
    `s3://alpha-engine-research/data/insider_transactions/
    {filed_date}.parquet`. One parquet per filed_date holds all
    insider transactions across all tickers filed that day.

  ingest_for_tickers(tickers, *, lookback_days, s3_client, ...)
    end-to-end orchestrator returning a stats dict. Groups
    transactions by filed_date so one parquet write per day, not
    per filing.

  CLI: `python -m rag.pipelines.ingest_form4 --tickers AAPL,MSFT
       --lookback-days 90`

+16 unit tests:
  - XML parsing: single sale / multi-transaction (2 non-deriv + 1
    deriv) / missing optional fields → None / malformed XML returns []
    / empty filing returns []
  - transactionCode unwrapping (direct text + <value>-wrapped)
  - DataFrame: empty list → empty df with schema / round-trip
    preserves all fields
  - S3 writer: key format / write-read round-trip via in-memory mock
    / empty list writes empty-schema parquet
  - Orchestrator: end-to-end discovery → download → parse → write /
    non-Form-4 filings skipped at discovery / dry-run skips writes /
    one bad download isolated, batch continues
  - SCHEMA_VERSION pinned to 1 + present on every row

What's deferred to PR B.1+:
  - 13F (institutional positioning) — quarterly XML schema with
    holdings tables; bigger XML payloads + ticker_to_cik reverse
    lookup
  - 14A / DEF 14A (proxy statements) — RAG-only text ingest
    mirroring 10-K/Q pattern
  - S-1 / S-4 (registration statements) — RAG-only
  - 13D / 13G (5%+ ownership disclosures) — structured XML, similar
    shape to Form 4

Suite: 945 passing (1 skipped).

Composes with:
  - ingest_8k_filings.py — pattern source for the EDGAR discovery +
    CIK helpers
  - PR α (lib v0.15.0) — FilingDocument shape exists for future
    text-based filings; Form 4 is structured-only so doesn't use it
  - data-revamp-260513.md plan doc — full arc context

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit e23624d into main May 13, 2026
1 check passed
@cipher813 cipher813 deleted the feat/wave1-form4-substrate branch May 13, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant