👌 Fix quadratic complexity in `fragments_join` / `text_join` by petricevich · Pull Request #389 · executablebooks/markdown-it-py

petricevich · 2026-05-04T13:16:31Z

Optimize adjacent-token joining in both inline cleanup stages by replacing repeated pairwise string concatenation with a single "".join(...) over each contiguous run.

Details

fragments_join merges adjacent text tokens left behind after emphasis/strikethrough post-processing and recalculates token levels
text_join converts text_special tokens to text and performs the final adjacent-text merge in the inline token stream

Both rules previously rebuilt growing strings incrementally, which can become quadratic for long runs.

Why

Tested on an adversarial ~190 KB document with ~30k intraword underscores on a single line. With tracemalloc running:

	render time	peak Python alloc
before	2.2s	4476 MB
after	0.6s	23 MB

It's not just a contrived attack input - this kind of thing also shows up naturally in Markdown produced by OCR pipelines, where tables of identifiers / references can easily contain very long runs of underscores or other delimiter characters.

Tests

Added focused tests for both rules:

fragments_join: verifies raw adjacent text fragments remain when both join stages are disabled, and that fragments_join alone collapses them when text_join is disabled
text_join: verifies escaped characters remain as multiple text_special tokens when text_join is disabled, and are converted and merged into a single text token when enabled

Result

No behavioral change in parser output, with less unnecessary work when joining long runs of adjacent tokens.

When emphasis/strikethrough postprocessing leaves a long run of adjacent text tokens (e.g. unmatched intraword `_` delimiters), fragments_join merged them via pairwise `a + b` concatenation. Each step rebuilds the growing prefix, costing O(L*k) per run. Walk the whole run once, collect content into a list, and "".join into the last token, making the work O(L). The kept token is still the last in the run so its non-content attributes (markup, etc.) are preserved.

codecov · 2026-05-04T13:40:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.83%. Comparing base (8933147) to head (4a89f1d).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #389      +/-   ##
==========================================
+ Coverage   95.80%   95.83%   +0.02%     
==========================================
  Files          64       64              
  Lines        3457     3481      +24     
==========================================
+ Hits         3312     3336      +24     
  Misses        145      145

Flag	Coverage Δ
pytests	`95.83% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chrisjsewell · 2026-05-04T20:40:23Z

Thanks, will double check soon, but sounds good in principle

Avoid quadratic string concatenation in text_join by collapsing runs of adjacent text-like tokens with a single "".join(...), matching the fix applied to fragments_join. Add tests that distinguish the responsibilities of the two rules: - fragments_join is an inline post-processing rule that runs after emphasis/strikethrough resolution. It merges adjacent text tokens left behind by delimiter processing and recalculates token nesting levels. - text_join is a later core rule that converts text_special tokens to text and performs a final adjacent-text merge across inline children. The new tests verify both behaviors independently: - disabling both fragments_join and text_join preserves the raw text fragments produced by emphasis delimiter handling - disabling only text_join shows fragments_join collapsing those text fragments - disabling text_join preserves multiple text_special tokens from escape handling - enabling text_join converts and merges those text_special tokens into a single text token This keeps the token stream compact in both stages without changing observable parsing behavior.

chrisjsewell · 2026-05-06T14:51:04Z

I added an extra commit 4a89f1d and updated the PR title / description, hope that's all good 😅

chrisjsewell changed the title ~~👌 fix quadratic complexity in fragments_join~~ 👌 fix quadratic complexity in fragments_join / text_join May 6, 2026

chrisjsewell changed the title ~~👌 fix quadratic complexity in fragments_join / text_join~~ 👌 Fix quadratic complexity in fragments_join / text_join May 6, 2026

chrisjsewell approved these changes May 6, 2026

View reviewed changes

chrisjsewell merged commit d4ea0ca into executablebooks:master May 6, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👌 Fix quadratic complexity in `fragments_join` / `text_join`#389

👌 Fix quadratic complexity in `fragments_join` / `text_join`#389
chrisjsewell merged 2 commits intoexecutablebooks:masterfrom
petricevich:fragments_join_worst_case_n_squared_fix

petricevich commented May 4, 2026 •

edited by chrisjsewell

Loading

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

chrisjsewell commented May 4, 2026

Uh oh!

chrisjsewell commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petricevich commented May 4, 2026 • edited by chrisjsewell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Why

Tests

Result

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chrisjsewell commented May 4, 2026

Uh oh!

chrisjsewell commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petricevich commented May 4, 2026 •

edited by chrisjsewell

Loading

codecov Bot commented May 4, 2026 •

edited

Loading