feat: support html tables by chonk-lain · Pull Request #500 · feyninc/chonkie

chonk-lain · 2026-02-17T19:21:21Z

adds support for html tables including both chunker and chef.
fixes : #495

how to use

from chonkie import TableChef

chef = TableChef()

html_content = """
<h2>Employee List</h2>
<table>
  <thead>
    <tr><th>ID</th><th>Status</th></tr>
  </thead>
  <tbody>
    <tr><td>1</td><td>Active</td></tr>
    <tr><td>2</td><td>Pending</td></tr>
    <tr><td>3</td><td>Inactive</td></tr>
    <tr><td>4</td><td>Active</td></tr>
  </tbody>
</table>
"""

# Extract all html tables
doc = chef.process(html_content)
for i, table in enumerate(doc.tables):
    print(f"Table {i+1} content:\n{table.content}\n")

from chonkie import TableChunker

html_table = """
<table>
  <thead>
    <tr><th>ID</th><th>Status</th></tr>
  </thead>
  <tbody>
    <tr><td>1</td><td>Active</td></tr>
    <tr><td>2</td><td>Pending</td></tr>
    <tr><td>3</td><td>Inactive</td></tr>
    <tr><td>4</td><td>Active</td></tr>
  </tbody>
</table>
"""

# HTML tables are chunked while preserving <table>, <thead>, and <tbody> tags
chunker = TableChunker(tokenizer="row", chunk_size=2)
chunks = chunker.chunk(html_table)

for chunk in chunks:
    print(f"--- HTML Chunk ---\n{chunk.text}\n")

gemini-code-assist · 2026-02-17T19:21:51Z

Summary of Changes

Hello @chonk-lain, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for HTML tables across the chonkie library. The TableChef can now accurately identify and extract HTML tables from various text inputs, while the TableChunker is capable of splitting these HTML tables into manageable chunks, maintaining their structural integrity. This enhancement significantly broadens the types of tabular data that can be processed and utilized within LLM and RAG pipelines.

Highlights

HTML Table Support for TableChef: The TableChef now supports extracting HTML tables from text, in addition to existing markdown, CSV, and Excel formats.
HTML Table Chunking for TableChunker: The TableChunker has been enhanced to chunk HTML tables by row or token, preserving , , and tags.
Documentation Updates: Documentation for both TableChef and TableChunker has been updated with examples and descriptions for HTML table support.
New Regex Patterns: New regex patterns were introduced in TableChef and TableChunker to identify and parse HTML table structures.
Expanded Test Coverage: Unit tests were added and modified to ensure correct functionality for HTML table extraction and chunking.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/oss/chefs/tablechef.mdx
- Updated description to mention HTML table support.
- Modified code examples to consolidate CSV/Excel and Markdown, and added a new example for HTML tables.
- Updated "Supported File Formats" to explicitly include HTML tables.
docs/oss/chunkers/table-chunker.mdx
- Updated description to mention HTML table support.
- Modified code examples to clarify Markdown (Row-Based) and Markdown (Token-Based) and added a new example for HTML tables.
- Updated "Notes" to reflect support for HTML

src/chonkie/chef/table.py
- Added html_table_pattern regex to identify HTML

Modified __init__ docstring to reflect new regex patterns.
Updated extract_tables_from_markdown to find both markdown pipe tables and HTML tables, then sort them by their appearance order.
src/chonkie/chunker/table.py
- Added html_tag_pattern and html_row_pattern regexes for HTML table detection and row extraction.
- Renamed _split_table to _split_markdown_table and introduced _split_html_table for HTML-specific parsing.
- Refactored the chunk method to detect if the input is an HTML table or a markdown table and apply the appropriate splitting logic.
- Ensured that HTML table chunks correctly include

Adjusted token counting for token-based chunking to account for HTML table footers.
Updated warning messages for table requirements.
tests/chef/test_table_chef.py
- Renamed test_extract_tables_from_markdown_multiple to test_extract_tables_from_markdown_html.
- Modified the test case to include an HTML table and assert its presence alongside a markdown table.
tests/chunkers/test_table_chunker.py
- Added test_table_chunker_html_table to verify the chunking functionality for HTML tables, ensuring header/footer preservation and content integrity.

Activity

No human activity has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sentry · 2026-02-17T19:24:36Z

✅ All tests passed.

Copilot

Pull request overview

This pull request adds HTML table support to Chonkie's table processing capabilities, extending both the TableChunker and TableChef to handle HTML <table> elements in addition to existing markdown pipe tables. This addresses issue #495 which requested the ability to chunk HTML tables similar to markdown tables.

Changes:

Extended TableChunker to detect and chunk HTML tables while preserving table structure tags (<table>, <thead>, <tbody>)
Enhanced TableChef to extract both HTML and markdown tables from mixed content
Added documentation and examples for HTML table processing

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
src/chonkie/chunker/table.py	Added HTML table detection, splitting logic with `_split_html_table()` method, and footer tag handling for proper HTML structure in chunks
src/chonkie/chef/table.py	Added HTML table regex pattern and extraction logic to find HTML tables alongside markdown tables, with sorting by position
tests/chunkers/test_table_chunker.py	Added test case for HTML table chunking (requires missing fixture to be added)
tests/chef/test_table_chef.py	Updated test to verify both HTML and markdown table extraction from mixed content
docs/oss/chunkers/table-chunker.mdx	Added HTML table usage example and updated description to mention HTML support
docs/oss/chefs/tablechef.mdx	Added HTML table extraction example and updated description

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

The pull request introduces support for HTML tables in both TableChef and TableChunker. This is a valuable addition that expands the library's utility for processing web-based content. The implementation correctly handles the repetition of headers in chunks and provides a unified interface for both Markdown and HTML tables. However, there are several issues related to index calculation consistency, regex robustness for nested tables, and performance inefficiencies in string handling that should be addressed to ensure the feature is production-ready.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chonk-lain · 2026-02-24T13:01:04Z

/gemini review

gemini-code-assist

Code Review

The pull request introduces support for HTML tables in both TableChef and TableChunker. This is a valuable addition for processing documents with mixed Markdown and HTML content. The implementation uses regex for extraction and a row-based or token-based approach for chunking. While the overall logic is sound, I identified several issues related to index accuracy and potential edge cases in HTML parsing that should be addressed to ensure the reliability of the chunking process.

gemini-code-assist · 2026-02-24T13:02:44Z

+        tbody_match = re.search(r"<tbody.*?>", table, re.IGNORECASE)
+        if tbody_match:
+            header = table[: tbody_match.end()]
+            body_content = table[tbody_match.end() : table.lower().find("</tbody>")]


The use of table.lower().find("</tbody>") can return -1 if the </tbody> tag is missing or if the HTML is malformed. Slicing with -1 as the end index (e.g., table[start:-1]) will incorrectly exclude the last character of the string instead of taking the whole remaining content. Consider checking if the index is valid before slicing.

gemini-code-assist · 2026-02-24T13:02:44Z

-                    len("".join(current_chunk)) - len(header)
-                    if chunks != []
-                    else len("".join(current_chunk))
+                    len("".join(current_chunk)) - len(header) if chunks != [] else len(chunk_text)


Similar to the issue in the first chunk, using len(chunk_text) for the chunk_len when chunks is empty (single-chunk case) might be problematic if the original text had trailing whitespace or characters after the closing </table> tag that were stripped. However, the more critical issue is the inconsistency with how chunk_len is calculated for subsequent chunks.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (3)

src/chonkie/chunker/table.py:142

In the row-based chunking path for HTML tables, the footer is added to chunk_text but the end_index calculation doesn't account for the footer length. Line 140 sets end_index=current_char_index + data_rows_len, but chunk_text includes the footer. This causes the end_index to be incorrect when a footer is present, potentially leading to misalignment when these chunks are used in a larger document context.

                for i in range(0, len(data_rows), self.chunk_size):
                    chunk_rows = data_rows[i : i + self.chunk_size]
                    chunk_text = header + "".join(chunk_rows) + footer
                    data_rows_len = len("".join(chunk_rows))

                    chunks.append(
                        Chunk(
                            text=chunk_text,
                            token_count=len(chunk_rows),
                            start_index=current_char_index,
                            end_index=current_char_index + data_rows_len,
                        ),
                    )

src/chonkie/chunker/table.py:52

The method is named _split_markdown_table but it's used for splitting markdown tables specifically (with pipe syntax). For clarity and consistency, consider renaming it to better reflect that it handles markdown pipe tables, distinguishing it from the new _split_html_table method. However, if this is an intentional choice to maintain backward compatibility or consistency with existing code, it's acceptable.

    def _split_markdown_table(self, table: str) -> tuple[str, list[str]]:

src/chonkie/chunker/table.py:47

The pattern r"<table.*?>" matches any table opening tag but doesn't validate that it's a proper HTML table. This could match malformed tags or tags within strings/comments. While this might be acceptable for the use case, consider whether stricter validation would be beneficial to avoid false positives when processing markdown content that might contain code snippets with table-like tags.

        self.html_tag_pattern = re.compile(r"<table.*?>", re.IGNORECASE)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T13:06:18Z

+def test_table_chunker_html_table(html_table: str) -> None:
+    """Test chunking an HTML table."""
+    chunker = TableChunker(tokenizer="character", chunk_size=100)
+    chunks = chunker.chunk(html_table)
+
+    assert len(chunks) > 1
+    for chunk in chunks:
+        assert "<table>" in chunk.text
+        assert "</table>" in chunk.text
+        assert "<thead>" in chunk.text
+        assert "ID" in chunk.text
+
+    # All data rows should be present across chunks
+    all_content = "".join(chunks[i].text for i in range(len(chunks)))
+    assert "Alice" in all_content
+    assert "Eve" in all_content


The test only covers HTML tables with both thead and tbody sections, using character-based chunking. It should also test: 1) HTML tables without tbody tags (just tr elements), 2) row-based chunking for HTML tables, 3) tables that fit within a single chunk, 4) edge cases like empty tbody, 5) malformed HTML. This would ensure the _split_html_table method's else branch and other edge cases are properly tested.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

support html tables

6e23039

chonk-lain requested review from chonknick and Copilot February 17, 2026 19:21

Copilot started reviewing on behalf of chonk-lain February 17, 2026 19:21 View session

mintlify Bot deployed to staging - docs February 17, 2026 19:22 View deployment

fix

466d59a

Copilot AI reviewed Feb 17, 2026

View reviewed changes

lint

ad9f385

gemini-code-assist Bot reviewed Feb 17, 2026

View reviewed changes

Comment thread src/chonkie/chunker/table.py

Comment thread src/chonkie/chunker/table.py Outdated

Comment thread src/chonkie/chef/table.py Outdated

Comment thread src/chonkie/chunker/table.py

chonk-lain force-pushed the html-tables branch from 42e973b to ad9f385 Compare February 17, 2026 19:54

chonk-lain marked this pull request as draft February 20, 2026 18:41

Update docs/oss/chunkers/table-chunker.mdx

05a6a49

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mintlify Bot deployed to staging - docs February 24, 2026 12:42 View deployment

chonk-lain marked this pull request as ready for review February 24, 2026 12:56

Merge branch 'main' of github.com:chonkie-inc/chonkie into html-tables

5b49c1e

chonk-lain requested a review from Copilot February 24, 2026 13:00

Copilot started reviewing on behalf of chonk-lain February 24, 2026 13:01 View session

mintlify Bot deployed to staging - docs February 24, 2026 13:01 View deployment

gemini-code-assist Bot reviewed Feb 24, 2026

View reviewed changes

Copilot AI reviewed Feb 24, 2026

View reviewed changes

chonk-lain and others added 7 commits February 24, 2026 14:18

Update src/chonkie/chunker/table.py

617e3c3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

prioritize html table extraction

32c4bf4

better html table pattern

a86c009

ruff

a454c53

more tests

431c797

Merge branch 'main' into html-tables

f3a27a6

Merge branch 'main' of github.com:chonkie-inc/chonkie into html-tables

c92fd2e

mintlify Bot deployed to staging - docs February 27, 2026 23:48 View deployment

ruff

5f7d2f2

chonknick merged commit ba92795 into main Feb 28, 2026
12 checks passed

chonk-lain deleted the html-tables branch February 28, 2026 20:49

Uh oh!

Conversation

chonk-lain commented Feb 17, 2026

how to use

Uh oh!

gemini-code-assist Bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

sentry Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chonk-lain commented Feb 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sentry Bot commented Feb 17, 2026 •

edited

Loading