Skip to content

docs: Add MarkdownHeaderSplitter docs#10562

Merged
julian-risch merged 5 commits intomainfrom
add-markdownheadersplitter-docs
Feb 11, 2026
Merged

docs: Add MarkdownHeaderSplitter docs#10562
julian-risch merged 5 commits intomainfrom
add-markdownheadersplitter-docs

Conversation

@julian-risch
Copy link
Member

@julian-risch julian-risch commented Feb 11, 2026

Related Issues

Proposed Changes:

  • Add new docs page for MarkdownHeaderSplitter component
  • Fix GitHub link for HierarchicalDocumentSplitter by pointing to main branch instead of commit hash
  • Add new row for MarkdownHeaderSplitter in docs-website/docs/pipeline-components/preprocessors.mdx
  • Add MarkdownHeaderSplitter to preprocessors' init

How did you test it?

  • Ran code examples locally

Notes for the reviewer

I adressed most of vale comments. The remaining comments on use of parentheses I chose to leave as is.
Similar to the docs PR by Stefano here, I will need to copy the docs updates to version-2.24-unstable docs folder

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@vercel
Copy link

vercel bot commented Feb 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
haystack-docs Ready Ready Preview, Comment Feb 11, 2026 11:32am

Request Review

@julian-risch julian-risch added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Feb 11, 2026
title: "MarkdownHeaderSplitter"
id: markdownheadersplitter
slug: "/markdownheadersplitter"
description: "Split documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.


The `MarkdownHeaderSplitter` processes text documents by:

- Splitting them into chunks at ATX-style Markdown headers (`#`, `##`, …, `######`), preserving header hierarchy as metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.

| [DocumentPreprocessor](preprocessors/documentpreprocessor.mdx) | Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning. |
| [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. |
| [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. |
| [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.

@julian-risch julian-risch marked this pull request as ready for review February 11, 2026 10:43
@julian-risch julian-risch requested a review from a team as a code owner February 11, 2026 10:43
@julian-risch julian-risch requested review from davidsbatista and removed request for a team February 11, 2026 10:43
@julian-risch julian-risch added this to the 2.24.0 milestone Feb 11, 2026

# MarkdownHeaderSplitter

Split documents at ATX-style Markdown headers (`#`, `##`, and so on), with optional secondary splitting. Header hierarchy is preserved as metadata on each chunk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.

The `MarkdownHeaderSplitter` processes text documents by:

- Splitting them into chunks at ATX-style Markdown headers (`#`, `##`, …, `######`), preserving header hierarchy as metadata.
- Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's [`DocumentSplitter`](documentsplitter.mdx).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.

- Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's [`DocumentSplitter`](documentsplitter.mdx).
- Preserving and propagating metadata such as parent headers, page numbers, and split IDs.

Only ATX-style headers are recognized (e.g. `# Title`). Setext-style headers (`Underline with ===`) aren't supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.

- Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's [`DocumentSplitter`](documentsplitter.mdx).
- Preserving and propagating metadata such as parent headers, page numbers, and split IDs.

Only ATX-style headers are recognized (e.g. `# Title`). Setext-style headers (`Underline with ===`) aren't supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [vale] reported by reviewdog 🐶
[Google.Latin] Use 'for example' instead of 'e.g. '.

@anakin87
Copy link
Member

If this is meant to be included in 2.24, we should also add the doc page to version-2.24-unstable docs folder.

See #10556 (comment) and the recently updated release guide for an explanation.

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@julian-risch julian-risch enabled auto-merge (squash) February 11, 2026 11:31
@davidsbatista
Copy link
Contributor

I think the preprocessors.mdx in version-2.24-unstable also needs to be updated, no?

@anakin87
Copy link
Member

I think the preprocessors.mdx in version-2.24-unstable also needs to be updated, no?

It was done in #10561

@julian-risch julian-risch merged commit c1b1239 into main Feb 11, 2026
23 checks passed
@julian-risch julian-risch deleted the add-markdownheadersplitter-docs branch February 11, 2026 11:40
@julian-risch
Copy link
Member Author

I think the preprocessors.mdx in version-2.24-unstable also needs to be updated, no?

#10563

julian-risch added a commit that referenced this pull request Feb 11, 2026
* add markdownheadersplitter docs

* add to preprocessors init

* vale

* add copy to 2.24-unstable

* removed document cleaner
kacperlukawski pushed a commit that referenced this pull request Feb 12, 2026
* add markdownheadersplitter docs

* add to preprocessors init

* vale

* add copy to 2.24-unstable

* removed document cleaner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ignore-for-release-notes PRs with this flag won't be included in the release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add documentation page for new MarkdownHeaderSplitter

3 participants

Comments