Skip to content

fixed bug where MarkdownHeaderSplitter's split result missed the first direct parent's header in the metadata and added lark to pyproject.toml#11042

Merged
sjrl merged 17 commits intodeepset-ai:mainfrom
MechaCritter:bugfix/markdown_header_splitter
Apr 9, 2026
Merged

fixed bug where MarkdownHeaderSplitter's split result missed the first direct parent's header in the metadata and added lark to pyproject.toml#11042
sjrl merged 17 commits intodeepset-ai:mainfrom
MechaCritter:bugfix/markdown_header_splitter

Conversation

@MechaCritter
Copy link
Copy Markdown
Contributor

@MechaCritter MechaCritter commented Apr 6, 2026

Related Issues

Proposed Changes:

The error that occured in the issue #11041 happens when a parent header has its own content chunk before the first
child. In that path the code clears active_parents, so the first child loses its
ancestor while later siblings inherit it correctly.

The fix is to derive parent_headers from the current header_stack instead of
carrying a separate mutable “active parents” list that gets out of sync after contentful
headers.

How did you test it?

Tested locally using:

hatch run test:unit    # 13 failed, 4926 passed, 4 skipped, 356 deselected, 10 warnings
hatch run test:type    # Success: no issues found in 334 source files
hatch run fmt    #All checks passed!

Notes for the reviewer

You can copy the haystack/components/preprocessors/markdown_header_splitter.py file and rerun the code snippet provided in issue #11041. The bug does not occur anymore afterward.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

…t direct parent's header in the metadata and added lark to pyproject.toml
@MechaCritter MechaCritter requested a review from a team as a code owner April 6, 2026 14:09
@MechaCritter MechaCritter requested review from sjrl and removed request for a team April 6, 2026 14:09
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 6, 2026

CLA assistant check
All committers have signed the CLA.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 7, 2026

@MechaCritter is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 9, 2026

@MechaCritter thanks for opening a PR! I see you are missing tests for the change. Could you add a test based on your example showing how your changes fix the issue.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Apr 9, 2026
@MechaCritter
Copy link
Copy Markdown
Contributor Author

MechaCritter commented Apr 9, 2026

@MechaCritter thanks for opening a PR! I see you are missing tests for the change. Could you add a test based on your example showing how your changes fix the issue.

I added the corresponding test. Here are the corresponding log files as proof that my fix works:

Success run on bugfix branch:
bugfix_branch.log

Failed run on current main branch:
main_branch.log

I ran all other tests again for the markdown_header_splitter (test_markdown_header_splitter.py). Everything passed, so no break changes were introduced.
all_test_bugbranch.log

@MechaCritter MechaCritter requested a review from sjrl April 9, 2026 12:25
MechaCritter and others added 3 commits April 9, 2026 14:32
…derSplitter-b5db96e19011b6b9.yaml


written description in past tense

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…derSplitter-b5db96e19011b6b9.yaml


written description in past tense

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…derSplitter-b5db96e19011b6b9.yaml


written description in past tense

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 9, 2026

@MechaCritter a few more minor comments but otherwise looks good! The logic is much easier to understand the previous as well!

MechaCritter and others added 4 commits April 9, 2026 15:49
…derSplitter-b5db96e19011b6b9.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
removed the "secondary_split="word"" argument as error happens regardless of if secondary split is present

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
broke down text in unit test for readability

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
added sanity checking (reconstructed text is equal original text)

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Apr 9, 2026 1:58pm

Request Review

@MechaCritter
Copy link
Copy Markdown
Contributor Author

I'm actually a bit confused: the CI pipeline failed a mypy test, which has nothing to do with my changes. @sjrl could you help me out?

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 9, 2026

Hey @MechaCritter yeah it was related to a new release of sentence transformers. Fix incoming here #11066 once merged we can update your branch

@sjrl sjrl enabled auto-merge (squash) April 9, 2026 14:30
@coveralls
Copy link
Copy Markdown
Collaborator

Coverage Report for CI Build 24196334786

Coverage decreased (-0.002%) to 92.841%

Details

  • Coverage decreased (-0.002%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 6 coverage regressions across 1 file.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

6 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
components/preprocessors/markdown_header_splitter.py 6 96.3%

Coverage Stats

Coverage Status
Relevant Lines: 17154
Covered Lines: 15926
Line Coverage: 92.84%
Coverage Strength: 0.93 hits per line

💛 - Coveralls

@sjrl sjrl merged commit a545c6a into deepset-ai:main Apr 9, 2026
20 of 21 checks passed
@MechaCritter MechaCritter deleted the bugfix/markdown_header_splitter branch April 9, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: MarkdownHeaderSplitter's split result misses the direct parents header in the parent_headers` key in the metadata of the chunk

4 participants