#### How to split by HTML Header

HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds meta data for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same meta data, with the objects of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of chunking pipeline.

In [1]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main sections</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            </br>
            <p>Some concluding text about Foo.</p>
        </div>
    </body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main sections'}, page_content='Bar main sections'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main sections'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main sections', 'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main sections', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main sections', 'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main sections', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'

In [4]:
url = "https://artificialanalysis.ai/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits


[Document(metadata={}, page_content='$! /$ $ /$ $! /$  \nFollow us on Twitter or LinkedIn to stay up to date with future analysis  \nArtificial Analysis  \nInsights Login  \nArtificial Analysis  \nLanguage Models  \nSpeech, Image, Video  \nHardware  \nLeaderboards  \nAI Trends  \nMicroEvals  \nBeta  \nArenas  \nArticles  \nAbout  \nInsights Login'),
 Document(metadata={'Header 2': 'analysis of AI'}, page_content='analysis of AI'),
 Document(metadata={}, page_content="Independent  \nUnderstand the AI landscape to choose the best model and provider for your use case  \nQ1 2025 State of AI Report  \nAnalysis of the AI landscape and the key trends shaping AI  \nAccess Report  \n🇨🇳 State of AI: China Report  \nAccess  \nHighlights  \n$! /$ $! /$  \nIntelligence  \nArtificial Analysis Intelligence Index; Higher is better  \nSpeed  \nOutput Tokens per Second; Higher is better  \nPrice  \nUSD per 1M Tokens; Lower is better  \nHow do OpenAI, Google, Meta & DeepSeek's new models compare?  \nRece