v5.32.0
Added
-
xbrl.calculation_linkbase()DataFrame — exposes the per-filing calculation linkbase as one row per parent→child arc, with signed weight, role URI, taxonomy attribution (us-gaap vs filer extension), and SEC menucat classification. Enables external pipelines (e.g., bank revenue disaggregation, REIT rental income rollups) to build per-filer concept hierarchies without re-parsing_cal.xml. Layer 1 of the GH #766 implementation plan; the parser was already producing this data onCalculationTree/CalculationNode, this is a DataFrame projection over existing output. (#766) -
Statement.extension_arcs()— surfaces filer-authored concepts that participate in a statement's calculation linkbase but are absent from its presentation tree, i.e. concepts that silently drop fromrender()output today. Opt-in viaStatement.extension_arcs(include_values=False); default mode returns oneExtensionArcper concept (structural),include_values=Trueemits one per (concept, context) with the instance value attached. The existingrender()path is untouched. Layer 2 of GH #766. Ground-truth verified on JPM FY2023 10-K cash flow (jpm:NetChangeInAdvancesToandInvestmentsInSubsidiaries,jpm:NetBorrowingsFromSubsidiaries— both calc-present, presentation-absent). (#766) -
Section.markdown()accessor — closes the gap betweenSection.text()(item-aware but flattens tables and bullet lists) andFiling.markdown()(preserves structure but whole-document only). Per-item chunkers / RAG pipelines can now get structure-preserving markdown scoped to a single section. Pattern/heading-detected sections render the cached node tree viaMarkdownRenderer; TOC-detected sections currently fall back toSection.text()to avoid corrupting adjacent-section markup (full TOC support tracked as a follow-up). Real-filing regression on AAPL 8-K Item 9.01 exhibit table locks in the pipe-table contract. (#833, contributor @HonzaCuhel)
Fixed
-
StreamingParserdropped 20%+ of text from<span>-wrapped paragraphs on large filings — for SEC filings crossing the 10 MB streaming threshold (so most ~30–110 MB 10-Ks/20-Fs),filing.text()silently returned output 20%+ shorter than the non-streaming path. Two compounding bugs in theiterparseloop:elem.clear()ran on every event (both start and end), and ran on every element regardless of whether an enclosing structural element (<p>,<h1>–<h6>,<section>) had finished reading its children. Since SEC filings wrap virtually every word in<span style="…">, the inner<span>'s end event cleared.text/.tailbefore the enclosing<p>could read them — paragraphs came out empty, with no warning. Clearing now runs only onendevents and is gated on a new_content_depthcounter (mirroring the existing_table_depthgate). A separate gate prevents<p>/<h*>/<section>inside<td>from being emitted twice. (#830, contributor @kevinchiu) -
HTTP_MGRhad no default timeout — stalled requests could block workers indefinitely — the internalhttpxclient was constructed without a timeout, so a stalled upstream or slow TLS handshake could pin a worker on an uninterruptible socket read syscall. Downstream users observed processes running 50+ minutes past their job budget on a single request.get_http_mgr()now setsTimeout(30.0, connect=10.0)by default;EDGAR_HTTP_TIMEOUT(seconds) configures it statically and the existingconfigure_http(timeout=...)runtime API still works. Callers that need unbounded waits can opt out explicitly. (#831, contributor @kevinchiu) -
13F-HR
holdingsmerged Put/Call positions into the underlying equity row —ThirteenF.holdingsgrouped by CUSIP alone, so Put/Call rows aggregated into the same security's equity row and thePutCallcolumn was lost on the merged result. Categories also used uppercasePUT/CALLwhile SEC XML emits title-casePut/Call, so the categorical conversion silently dropped those values too. Group key now includesPutCallwhen the column exists; category labels match SEC XML. Regression verified on SG Capital Management 13F-HR/A (3 distinct Put positions preserved in the aggregated view). (#824) -
import edgaremittedDeprecationWarningon every startup — the legacy HTML modules (edgar.files.html_documents,edgar.files.html,edgar.files.htmltools) emitted warnings at module top, and edgartools' own startup cascade imports them, so the warnings fired on every fresh import. Downstream test suites running under-W error(a recommended pytest setup) had to install warning filters just to letimport edgarsucceed. The deprecation signal moved from module top to per-class__init__, so internal callers don't trip the warning while user-instantiated legacy classes still do. (#832, contributor @kevinchiu) -
Filing.search()/Filing.grep()returned nothing on pre-2002 plain-text filings —Filing.search()raisedAssertionErrorandFiling.grep()returned 0 matches on plain-text filings (e.g. PCG's 1999 10-K). Both relied on attachment iteration that finds nothing because SGML decomposition emits empty shells for text-only filings.sections()now falls back to chunkingfiling.text()on<PAGE>markers or blank lines whenhtml()is None, andgrep()falls back tofiling.text()when no attachment yields usable text. (#819) -
TOC analyzer fabricated phantom Items on 10-Q filings —
TOCAnalyzerhad three 10-K-shaped heuristics that fired regardless of form: it accepted any bare number 1–15 as an item identifier in preceding-<td>siblings (so a page-number cell like<td>8</td>became "Item 8"); it mapped any "financial statements" link to "Item 8" (correct for 10-K, wrong for 10-Q where Financial Statements is Part I, Item 1); and it sorted using a 10-K-shaped section-order table. All three heuristics are now form-guarded. (#827, contributor @HonzaCuhel) -
SearchResultspanel labels conflated BM25 rank with section index —SearchResults.__rich__used the enumeration rank of the sorted display as the panel title, so the same numeric label meant different things in the BM25 and regex paths (BM25 sorts by score, regex preserves original order). "0" in BM25 output was the top-scoring section while "0" in regex output was the first section that matched, and the two were rarely the same. Panels now displayDocSection.loc— the section's index infiling.sections()— consistently across search methods, so callers can index back into the corpus regardless of search mode. (#765)
Documentation
calculation_linkbase()andStatement.extension_arcs()documented alongside Phase 1 and Phase 2 of the GH #766 implementation, including the difference from presentation linkbase and worked examples on real filings. (#766, Phase 3)
Contributors: @HonzaCuhel, @kevinchiu, @0ywfe
Full changelog: v5.31.5...v5.32.0