docs: Add MarkdownHeaderSplitter docs#10562
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| title: "MarkdownHeaderSplitter" | ||
| id: markdownheadersplitter | ||
| slug: "/markdownheadersplitter" | ||
| description: "Split documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata." |
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
docs-website/docs/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
|
|
||
| The `MarkdownHeaderSplitter` processes text documents by: | ||
|
|
||
| - Splitting them into chunks at ATX-style Markdown headers (`#`, `##`, …, `######`), preserving header hierarchy as metadata. |
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
docs-website/docs/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
docs-website/docs/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
docs-website/docs/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
docs-website/docs/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
docs-website/docs/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
| | [DocumentPreprocessor](preprocessors/documentpreprocessor.mdx) | Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning. | | ||
| | [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. | | ||
| | [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. | | ||
| | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | |
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
|
|
||
| # MarkdownHeaderSplitter | ||
|
|
||
| Split documents at ATX-style Markdown headers (`#`, `##`, and so on), with optional secondary splitting. Header hierarchy is preserved as metadata on each chunk. |
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
| The `MarkdownHeaderSplitter` processes text documents by: | ||
|
|
||
| - Splitting them into chunks at ATX-style Markdown headers (`#`, `##`, …, `######`), preserving header hierarchy as metadata. | ||
| - Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's [`DocumentSplitter`](documentsplitter.mdx). |
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
| - Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's [`DocumentSplitter`](documentsplitter.mdx). | ||
| - Preserving and propagating metadata such as parent headers, page numbers, and split IDs. | ||
|
|
||
| Only ATX-style headers are recognized (e.g. `# Title`). Setext-style headers (`Underline with ===`) aren't supported. |
There was a problem hiding this comment.
📝 [vale] reported by reviewdog 🐶
[Google.Parens] Use parentheses judiciously.
| - Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's [`DocumentSplitter`](documentsplitter.mdx). | ||
| - Preserving and propagating metadata such as parent headers, page numbers, and split IDs. | ||
|
|
||
| Only ATX-style headers are recognized (e.g. `# Title`). Setext-style headers (`Underline with ===`) aren't supported. |
There was a problem hiding this comment.
🚫 [vale] reported by reviewdog 🐶
[Google.Latin] Use 'for example' instead of 'e.g. '.
|
If this is meant to be included in 2.24, we should also add the doc page to See #10556 (comment) and the recently updated release guide for an explanation. |
...oned_docs/version-2.24-unstable/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
...oned_docs/version-2.24-unstable/pipeline-components/preprocessors/markdownheadersplitter.mdx
Outdated
Show resolved
Hide resolved
|
I think the |
It was done in #10561 |
|
* add markdownheadersplitter docs * add to preprocessors init * vale * add copy to 2.24-unstable * removed document cleaner
* add markdownheadersplitter docs * add to preprocessors init * vale * add copy to 2.24-unstable * removed document cleaner
Related Issues
Proposed Changes:
How did you test it?
Notes for the reviewer
I adressed most of vale comments. The remaining comments on use of parentheses I chose to leave as is.
Similar to the docs PR by Stefano here, I will need to copy the docs updates to version-2.24-unstable docs folder
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.