Skip to content

👌 Fix quadratic complexity in fragments_join / text_join#389

Merged
chrisjsewell merged 2 commits intoexecutablebooks:masterfrom
petricevich:fragments_join_worst_case_n_squared_fix
May 6, 2026
Merged

👌 Fix quadratic complexity in fragments_join / text_join#389
chrisjsewell merged 2 commits intoexecutablebooks:masterfrom
petricevich:fragments_join_worst_case_n_squared_fix

Conversation

@petricevich
Copy link
Copy Markdown
Contributor

@petricevich petricevich commented May 4, 2026

Optimize adjacent-token joining in both inline cleanup stages by replacing repeated pairwise string concatenation with a single "".join(...) over each contiguous run.

Details

  • fragments_join merges adjacent text tokens left behind after emphasis/strikethrough post-processing and recalculates token levels
  • text_join converts text_special tokens to text and performs the final adjacent-text merge in the inline token stream

Both rules previously rebuilt growing strings incrementally, which can become quadratic for long runs.

Why

Tested on an adversarial ~190 KB document with ~30k intraword underscores on a single line. With tracemalloc running:

render time peak Python alloc
before 2.2s 4476 MB
after 0.6s 23 MB

It's not just a contrived attack input - this kind of thing also shows up naturally in Markdown produced by OCR pipelines, where tables of identifiers / references can easily contain very long runs of underscores or other delimiter characters.

Tests

Added focused tests for both rules:

  • fragments_join: verifies raw adjacent text fragments remain when both join stages are disabled, and that fragments_join alone collapses them when text_join is disabled
  • text_join: verifies escaped characters remain as multiple text_special tokens when text_join is disabled, and are converted and merged into a single text token when enabled

Result

No behavioral change in parser output, with less unnecessary work when joining long runs of adjacent tokens.

When emphasis/strikethrough postprocessing leaves a long run of adjacent
text tokens (e.g. unmatched intraword `_` delimiters), fragments_join
merged them via pairwise `a + b` concatenation. Each step rebuilds the
growing prefix, costing O(L*k) per run.

Walk the whole run once, collect content into a list, and "".join into
the last token, making the work O(L). The kept token is still the last
in the run so its non-content attributes (markup, etc.) are preserved.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.83%. Comparing base (8933147) to head (4a89f1d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #389      +/-   ##
==========================================
+ Coverage   95.80%   95.83%   +0.02%     
==========================================
  Files          64       64              
  Lines        3457     3481      +24     
==========================================
+ Hits         3312     3336      +24     
  Misses        145      145              
Flag Coverage Δ
pytests 95.83% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chrisjsewell
Copy link
Copy Markdown
Member

Thanks, will double check soon, but sounds good in principle

Avoid quadratic string concatenation in text_join by collapsing runs of
adjacent text-like tokens with a single "".join(...), matching the fix
applied to fragments_join.

Add tests that distinguish the responsibilities of the two rules:

- fragments_join is an inline post-processing rule that runs after
  emphasis/strikethrough resolution. It merges adjacent text tokens left
  behind by delimiter processing and recalculates token nesting levels.
- text_join is a later core rule that converts text_special tokens to
  text and performs a final adjacent-text merge across inline children.

The new tests verify both behaviors independently:
- disabling both fragments_join and text_join preserves the raw text
  fragments produced by emphasis delimiter handling
- disabling only text_join shows fragments_join collapsing those text
  fragments
- disabling text_join preserves multiple text_special tokens from escape
  handling
- enabling text_join converts and merges those text_special tokens into
  a single text token

This keeps the token stream compact in both stages without changing
observable parsing behavior.
@chrisjsewell chrisjsewell changed the title 👌 fix quadratic complexity in fragments_join 👌 fix quadratic complexity in fragments_join / text_join May 6, 2026
@chrisjsewell
Copy link
Copy Markdown
Member

I added an extra commit 4a89f1d and updated the PR title / description, hope that's all good 😅

@chrisjsewell chrisjsewell changed the title 👌 fix quadratic complexity in fragments_join / text_join 👌 Fix quadratic complexity in fragments_join / text_join May 6, 2026
@chrisjsewell chrisjsewell merged commit d4ea0ca into executablebooks:master May 6, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants