Skip to content

v5.32.0

Choose a tag to compare

@dgunning dgunning released this 28 May 14:31
· 113 commits to main since this release

Added

  • xbrl.calculation_linkbase() DataFrame — exposes the per-filing calculation linkbase as one row per parent→child arc, with signed weight, role URI, taxonomy attribution (us-gaap vs filer extension), and SEC menucat classification. Enables external pipelines (e.g., bank revenue disaggregation, REIT rental income rollups) to build per-filer concept hierarchies without re-parsing _cal.xml. Layer 1 of the GH #766 implementation plan; the parser was already producing this data on CalculationTree/CalculationNode, this is a DataFrame projection over existing output. (#766)

  • Statement.extension_arcs() — surfaces filer-authored concepts that participate in a statement's calculation linkbase but are absent from its presentation tree, i.e. concepts that silently drop from render() output today. Opt-in via Statement.extension_arcs(include_values=False); default mode returns one ExtensionArc per concept (structural), include_values=True emits one per (concept, context) with the instance value attached. The existing render() path is untouched. Layer 2 of GH #766. Ground-truth verified on JPM FY2023 10-K cash flow (jpm:NetChangeInAdvancesToandInvestmentsInSubsidiaries, jpm:NetBorrowingsFromSubsidiaries — both calc-present, presentation-absent). (#766)

  • Section.markdown() accessor — closes the gap between Section.text() (item-aware but flattens tables and bullet lists) and Filing.markdown() (preserves structure but whole-document only). Per-item chunkers / RAG pipelines can now get structure-preserving markdown scoped to a single section. Pattern/heading-detected sections render the cached node tree via MarkdownRenderer; TOC-detected sections currently fall back to Section.text() to avoid corrupting adjacent-section markup (full TOC support tracked as a follow-up). Real-filing regression on AAPL 8-K Item 9.01 exhibit table locks in the pipe-table contract. (#833, contributor @HonzaCuhel)

Fixed

  • StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs on large filings — for SEC filings crossing the 10 MB streaming threshold (so most ~30–110 MB 10-Ks/20-Fs), filing.text() silently returned output 20%+ shorter than the non-streaming path. Two compounding bugs in the iterparse loop: elem.clear() ran on every event (both start and end), and ran on every element regardless of whether an enclosing structural element (<p>, <h1><h6>, <section>) had finished reading its children. Since SEC filings wrap virtually every word in <span style="…">, the inner <span>'s end event cleared .text/.tail before the enclosing <p> could read them — paragraphs came out empty, with no warning. Clearing now runs only on end events and is gated on a new _content_depth counter (mirroring the existing _table_depth gate). A separate gate prevents <p>/<h*>/<section> inside <td> from being emitted twice. (#830, contributor @kevinchiu)

  • HTTP_MGR had no default timeout — stalled requests could block workers indefinitely — the internal httpx client was constructed without a timeout, so a stalled upstream or slow TLS handshake could pin a worker on an uninterruptible socket read syscall. Downstream users observed processes running 50+ minutes past their job budget on a single request. get_http_mgr() now sets Timeout(30.0, connect=10.0) by default; EDGAR_HTTP_TIMEOUT (seconds) configures it statically and the existing configure_http(timeout=...) runtime API still works. Callers that need unbounded waits can opt out explicitly. (#831, contributor @kevinchiu)

  • 13F-HR holdings merged Put/Call positions into the underlying equity rowThirteenF.holdings grouped by CUSIP alone, so Put/Call rows aggregated into the same security's equity row and the PutCall column was lost on the merged result. Categories also used uppercase PUT/CALL while SEC XML emits title-case Put/Call, so the categorical conversion silently dropped those values too. Group key now includes PutCall when the column exists; category labels match SEC XML. Regression verified on SG Capital Management 13F-HR/A (3 distinct Put positions preserved in the aggregated view). (#824)

  • import edgar emitted DeprecationWarning on every startup — the legacy HTML modules (edgar.files.html_documents, edgar.files.html, edgar.files.htmltools) emitted warnings at module top, and edgartools' own startup cascade imports them, so the warnings fired on every fresh import. Downstream test suites running under -W error (a recommended pytest setup) had to install warning filters just to let import edgar succeed. The deprecation signal moved from module top to per-class __init__, so internal callers don't trip the warning while user-instantiated legacy classes still do. (#832, contributor @kevinchiu)

  • Filing.search() / Filing.grep() returned nothing on pre-2002 plain-text filingsFiling.search() raised AssertionError and Filing.grep() returned 0 matches on plain-text filings (e.g. PCG's 1999 10-K). Both relied on attachment iteration that finds nothing because SGML decomposition emits empty shells for text-only filings. sections() now falls back to chunking filing.text() on <PAGE> markers or blank lines when html() is None, and grep() falls back to filing.text() when no attachment yields usable text. (#819)

  • TOC analyzer fabricated phantom Items on 10-Q filingsTOCAnalyzer had three 10-K-shaped heuristics that fired regardless of form: it accepted any bare number 1–15 as an item identifier in preceding-<td> siblings (so a page-number cell like <td>8</td> became "Item 8"); it mapped any "financial statements" link to "Item 8" (correct for 10-K, wrong for 10-Q where Financial Statements is Part I, Item 1); and it sorted using a 10-K-shaped section-order table. All three heuristics are now form-guarded. (#827, contributor @HonzaCuhel)

  • SearchResults panel labels conflated BM25 rank with section indexSearchResults.__rich__ used the enumeration rank of the sorted display as the panel title, so the same numeric label meant different things in the BM25 and regex paths (BM25 sorts by score, regex preserves original order). "0" in BM25 output was the top-scoring section while "0" in regex output was the first section that matched, and the two were rarely the same. Panels now display DocSection.loc — the section's index in filing.sections() — consistently across search methods, so callers can index back into the corpus regardless of search mode. (#765)

Documentation

  • calculation_linkbase() and Statement.extension_arcs() documented alongside Phase 1 and Phase 2 of the GH #766 implementation, including the difference from presentation linkbase and worked examples on real filings. (#766, Phase 3)

Contributors: @HonzaCuhel, @kevinchiu, @0ywfe

Full changelog: v5.31.5...v5.32.0