# **HTMLHeaderTextSplitter**

- Author: [ChangJun Lee](https://www.linkedin.com/in/cjleeno1/)
- Design: []()
- Peer Review:
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)
  
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/07-HTMLHeaderTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/07-HTMLHeaderTextSplitter.ipynb)


## Overview

> Similar to `MarkdownHeaderTextSplitter` in concept, <br>
> `HTMLHeaderTextSplitter` is a "structure-aware" chunk generator that splits text at the element level and adds metadata for each header.

It adds metadata "related to" each chunk.

`HTMLHeaderTextSplitter` can return chunks by element or combine elements with the same metadata, and aims to:

- (a) semantically (roughly) group related text and
- (b) preserve contextual information encoded in document structure.

## Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Using HTML String](#using-html-string)
- [RecursiveCharacterTextSplitter](#recursivecharactertextsplitter)
- [Limitations](#limitations)
---

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


In [25]:
%%capture --no-stderr
%pip install langchain-opentutorial
%pip install lxml


- Specify header tags and their names as tuples in the `headers_to_split_on` list that will be used as splitting criteria.
- Create an `HTMLHeaderTextSplitter` object by passing the header list to the `headers_to_split_on` parameter.

In [26]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ]
)

Current environment: windows-py3.11
Release type or date: stable
Installing packages: langsmith, langchain, langchain_core, langchain_community, langchain_text_splitters, langchain_openai...
Successfully installed: langsmith, langchain==0.3.13, langchain_core, langchain_community, langchain_text_splitters, langchain_openai


In [17]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        # "OPENAI_API_KEY": "",
        # "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "HTMLHeaderTextSplitter",
    }
)

Environment variables have been set successfully.


Alternatively, you can set and load `OPENAI_API_KEY` from a `.env` file.

**[Note]** This is only necessary if you haven't already set `OPENAI_API_KEY` in previous steps.

In [27]:
from dotenv import load_dotenv

load_dotenv()

True

## Using HTML String
`HTMLHeaderTextSplitter` is a text splitter that divides HTML documents based on header tags.

**Key Features:**
- Can *split text* while preserving ` HTML` document structure
- Splits based on header tags like `h1`, `h2`, `h3`, etc.
- Each split section includes header information as metadata

**Use Cases:**
- Useful for processing long `HTML` documents into meaningful sections
- Ideal when you need to chunk text while preserving structural information
- Suitable for `header-based` document summarization and analysis




In [19]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),  # Specify the header tag and its name to split on
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

# Create an HTMLHeaderTextSplitter object that splits HTML text based on specified headers
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# Split the HTML string and store the results in html_header_splits variable
html_header_splits = html_splitter.split_text(html_string)
# Print the split results
for header in html_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")


Foo
{}
Some intro text about Foo.  
Bar main section Bar subsection 1 Bar subsection 2
{'Header 1': 'Foo'}
Some intro text about Bar.
{'Header 1': 'Foo', 'Header 2': 'Bar main section'}
Some text about the first subtopic of Bar.
{'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}
Some text about the second subtopic of Bar.
{'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}
Baz
{'Header 1': 'Foo'}
Some text about Baz
{'Header 1': 'Foo', 'Header 2': 'Baz'}
Some concluding text about Foo
{'Header 1': 'Foo'}


## RecursiveCharacterTextSplitter
`RecursiveCharacterTextSplitter` is a text splitter that divides text recursively.

This splitter has the following characteristics:
- Splits text using specified separators (defaults to `["\n\n", "\n", " ", ""]`)
- Tries separators in order until chunk size requirements are met
- Splits text into chunks that don't exceed the maximum character count specified by chunk_size
- Allows overlap between chunks by the number of characters specified in chunk_overlap

This is particularly useful for splitting long texts into meaningful chunks, especially when preparing text for LLM processing by breaking it down into manageable sizes.






In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"  # Specify the URL of the text to split

headers_to_split_on = [  # Specify HTML header tags and their names to split on
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

# Create an HTMLHeaderTextSplitter object that splits text based on HTML headers
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Fetch text from URL and split based on HTML headers
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500  # Specify the size of chunks to split the text into
chunk_overlap = 30  # Specify the number of overlapping characters between split chunks
text_splitter = RecursiveCharacterTextSplitter(  # Create a RecursiveCharacterTextSplitter object that splits text recursively
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split the text that was split by HTML headers again according to chunk size
splits = text_splitter.split_documents(html_header_splits)

# Print chunks from 80th to 85th of the split text
for header in splits[80:85]:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")


We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth
{'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}
means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.
{'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}
This account of Gödel’s discovery was t

### Limitations
While `HTMLHeaderTextSplitter` attempts to handle structural differences between HTML documents, **it may sometimes miss certain headers**.

> For example, this algorithm assumes headers are always positioned hierarchically above their related text - either as previous sibling nodes, ancestor nodes, or a combination of both.

In the following news article (at the time of writing), although the main headline is tagged as `h1`, it exists in a **separate subtree** distinct from the main content elements.

As a result, while the `h1` header and its associated text will not be included in the chunk metadata, any `h2` headers and their related text will be properly captured where they exist.

In [28]:
# Specify the URL of the HTML page to split
url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"

headers_to_split_on = [
    ("h1", "Header 1"),  # Specify the header tag and its name to split on
    ("h2", "Header 2"),  # Specify the header tag and its name to split on
]

# Create an HTMLHeaderTextSplitter object that splits HTML text based on specified headers
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Split the HTML page from the specified URL and store the results in html_header_splits variable
html_header_splits = html_splitter.split_text_from_url(url)

# Print the split results
for header in html_header_splits:
    print(f"{header.page_content[:100]}")
    print(f"{header.metadata}", end="\n=====================\n")

CNN values your feedback  
1. How relevant is this ad to you?  
2. Did you encounter any technical i
{}
No two El Niño winters are the same, but many have temperature and precipitation trends in common.  
{'Header 2': 'What could this winter look like?'}
Ad Feedback  
Ad Feedback  
Ad Feedback  
Ad Feedback  
Ad Feedback  
Ad Feedback  
Ad Feedback  
Ad
{}
