Skip to content

feat: support html tables#500

Merged
chonknick merged 13 commits into
mainfrom
html-tables
Feb 28, 2026
Merged

feat: support html tables#500
chonknick merged 13 commits into
mainfrom
html-tables

Conversation

@chonk-lain

Copy link
Copy Markdown
Member

adds support for html tables including both chunker and chef.
fixes : #495

how to use

from chonkie import TableChef

chef = TableChef()

html_content = """
<h2>Employee List</h2>
<table>
  <thead>
    <tr><th>ID</th><th>Status</th></tr>
  </thead>
  <tbody>
    <tr><td>1</td><td>Active</td></tr>
    <tr><td>2</td><td>Pending</td></tr>
    <tr><td>3</td><td>Inactive</td></tr>
    <tr><td>4</td><td>Active</td></tr>
  </tbody>
</table>
"""

# Extract all html tables
doc = chef.process(html_content)
for i, table in enumerate(doc.tables):
    print(f"Table {i+1} content:\n{table.content}\n")
from chonkie import TableChunker

html_table = """
<table>
  <thead>
    <tr><th>ID</th><th>Status</th></tr>
  </thead>
  <tbody>
    <tr><td>1</td><td>Active</td></tr>
    <tr><td>2</td><td>Pending</td></tr>
    <tr><td>3</td><td>Inactive</td></tr>
    <tr><td>4</td><td>Active</td></tr>
  </tbody>
</table>
"""

# HTML tables are chunked while preserving <table>, <thead>, and <tbody> tags
chunker = TableChunker(tokenizer="row", chunk_size=2)
chunks = chunker.chunk(html_table)

for chunk in chunks:
    print(f"--- HTML Chunk ---\n{chunk.text}\n")

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @chonk-lain, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for HTML tables across the chonkie library. The TableChef can now accurately identify and extract HTML tables from various text inputs, while the TableChunker is capable of splitting these HTML tables into manageable chunks, maintaining their structural integrity. This enhancement significantly broadens the types of tabular data that can be processed and utilized within LLM and RAG pipelines.

Highlights

  • HTML Table Support for TableChef: The TableChef now supports extracting HTML tables from text, in addition to existing markdown, CSV, and Excel formats.
  • HTML Table Chunking for TableChunker: The TableChunker has been enhanced to chunk HTML tables by row or token, preserving , , and tags.
  • Documentation Updates: Documentation for both TableChef and TableChunker has been updated with examples and descriptions for HTML table support.
  • New Regex Patterns: New regex patterns were introduced in TableChef and TableChunker to identify and parse HTML table structures.
  • Expanded Test Coverage: Unit tests were added and modified to ensure correct functionality for HTML table extraction and chunking.
  • 🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

    Changelog
    • docs/oss/chefs/tablechef.mdx
      • Updated description to mention HTML table support.
      • Modified code examples to consolidate CSV/Excel and Markdown, and added a new example for HTML tables.
      • Updated "Supported File Formats" to explicitly include HTML tables.
    • docs/oss/chunkers/table-chunker.mdx
      • Updated description to mention HTML table support.
      • Modified code examples to clarify Markdown (Row-Based) and Markdown (Token-Based) and added a new example for HTML tables.
      • Updated "Notes" to reflect support for HTML
    elements and their structural requirements.
  • src/chonkie/chef/table.py
    • Added html_table_pattern regex to identify HTML
  • elements.
  • Modified __init__ docstring to reflect new regex patterns.
  • Updated extract_tables_from_markdown to find both markdown pipe tables and HTML tables, then sort them by their appearance order.
  • src/chonkie/chunker/table.py
    • Added html_tag_pattern and html_row_pattern regexes for HTML table detection and row extraction.
    • Renamed _split_table to _split_markdown_table and introduced _split_html_table for HTML-specific parsing.
    • Refactored the chunk method to detect if the input is an HTML table or a markdown table and apply the appropriate splitting logic.
    • Ensured that HTML table chunks correctly include
  • , , and tags as header/footer elements.
  • Adjusted token counting for token-based chunking to account for HTML table footers.
  • Updated warning messages for table requirements.
  • tests/chef/test_table_chef.py
    • Renamed test_extract_tables_from_markdown_multiple to test_extract_tables_from_markdown_html.
    • Modified the test case to include an HTML table and assert its presence alongside a markdown table.
  • tests/chunkers/test_table_chunker.py
    • Added test_table_chunker_html_table to verify the chunking functionality for HTML tables, ensuring header/footer preservation and content integrity.
  • Activity
    • No human activity has occurred on this pull request yet.
    Using Gemini Code Assist

    The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

    Invoking Gemini

    You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

    Feature Command Description
    Code Review /gemini review Performs a code review for the current pull request in its current state.
    Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
    Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
    Help /gemini help Displays a list of available commands.

    Customization

    To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

    Limitations & Feedback

    Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

    You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

    Footnotes

    1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sentry

sentry Bot commented Feb 17, 2026

Copy link
Copy Markdown

✅ All tests passed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds HTML table support to Chonkie's table processing capabilities, extending both the TableChunker and TableChef to handle HTML <table> elements in addition to existing markdown pipe tables. This addresses issue #495 which requested the ability to chunk HTML tables similar to markdown tables.

Changes:

  • Extended TableChunker to detect and chunk HTML tables while preserving table structure tags (<table>, <thead>, <tbody>)
  • Enhanced TableChef to extract both HTML and markdown tables from mixed content
  • Added documentation and examples for HTML table processing

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/chonkie/chunker/table.py Added HTML table detection, splitting logic with _split_html_table() method, and footer tag handling for proper HTML structure in chunks
src/chonkie/chef/table.py Added HTML table regex pattern and extraction logic to find HTML tables alongside markdown tables, with sorting by position
tests/chunkers/test_table_chunker.py Added test case for HTML table chunking (requires missing fixture to be added)
tests/chef/test_table_chef.py Updated test to verify both HTML and markdown table extraction from mixed content
docs/oss/chunkers/table-chunker.mdx Added HTML table usage example and updated description to mention HTML support
docs/oss/chefs/tablechef.mdx Added HTML table extraction example and updated description

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/oss/chunkers/table-chunker.mdx Outdated
Comment thread tests/chunkers/test_table_chunker.py
Comment thread src/chonkie/chunker/table.py
Comment thread src/chonkie/chunker/table.py
Comment thread tests/chunkers/test_table_chunker.py
Comment thread src/chonkie/chef/table.py Outdated
Comment thread tests/chunkers/test_table_chunker.py
Comment thread src/chonkie/chunker/table.py
Comment thread src/chonkie/chunker/table.py
Comment thread src/chonkie/chunker/table.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for HTML tables in both TableChef and TableChunker. This is a valuable addition that expands the library's utility for processing web-based content. The implementation correctly handles the repetition of headers in chunks and provides a unified interface for both Markdown and HTML tables. However, there are several issues related to index calculation consistency, regex robustness for nested tables, and performance inefficiencies in string handling that should be addressed to ensure the feature is production-ready.

Comment thread src/chonkie/chunker/table.py
Comment thread src/chonkie/chunker/table.py Outdated
Comment thread src/chonkie/chef/table.py Outdated
Comment thread src/chonkie/chunker/table.py
@chonk-lain chonk-lain marked this pull request as draft February 20, 2026 18:41
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@chonk-lain chonk-lain requested a review from Copilot February 24, 2026 13:00
@chonk-lain

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for HTML tables in both TableChef and TableChunker. This is a valuable addition for processing documents with mixed Markdown and HTML content. The implementation uses regex for extraction and a row-based or token-based approach for chunking. While the overall logic is sound, I identified several issues related to index accuracy and potential edge cases in HTML parsing that should be addressed to ensure the reliability of the chunking process.

Comment thread src/chonkie/chunker/table.py Outdated
tbody_match = re.search(r"<tbody.*?>", table, re.IGNORECASE)
if tbody_match:
header = table[: tbody_match.end()]
body_content = table[tbody_match.end() : table.lower().find("</tbody>")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of table.lower().find("</tbody>") can return -1 if the </tbody> tag is missing or if the HTML is malformed. Slicing with -1 as the end index (e.g., table[start:-1]) will incorrectly exclude the last character of the string instead of taking the whole remaining content. Consider checking if the index is valid before slicing.

len("".join(current_chunk)) - len(header)
if chunks != []
else len("".join(current_chunk))
len("".join(current_chunk)) - len(header) if chunks != [] else len(chunk_text)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the issue in the first chunk, using len(chunk_text) for the chunk_len when chunks is empty (single-chunk case) might be problematic if the original text had trailing whitespace or characters after the closing </table> tag that were stripped. However, the more critical issue is the inconsistency with how chunk_len is calculated for subsequent chunks.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (3)

src/chonkie/chunker/table.py:142

  • In the row-based chunking path for HTML tables, the footer is added to chunk_text but the end_index calculation doesn't account for the footer length. Line 140 sets end_index=current_char_index + data_rows_len, but chunk_text includes the footer. This causes the end_index to be incorrect when a footer is present, potentially leading to misalignment when these chunks are used in a larger document context.
                for i in range(0, len(data_rows), self.chunk_size):
                    chunk_rows = data_rows[i : i + self.chunk_size]
                    chunk_text = header + "".join(chunk_rows) + footer
                    data_rows_len = len("".join(chunk_rows))

                    chunks.append(
                        Chunk(
                            text=chunk_text,
                            token_count=len(chunk_rows),
                            start_index=current_char_index,
                            end_index=current_char_index + data_rows_len,
                        ),
                    )

src/chonkie/chunker/table.py:52

  • The method is named _split_markdown_table but it's used for splitting markdown tables specifically (with pipe syntax). For clarity and consistency, consider renaming it to better reflect that it handles markdown pipe tables, distinguishing it from the new _split_html_table method. However, if this is an intentional choice to maintain backward compatibility or consistency with existing code, it's acceptable.
    def _split_markdown_table(self, table: str) -> tuple[str, list[str]]:

src/chonkie/chunker/table.py:47

  • The pattern r"<table.*?>" matches any table opening tag but doesn't validate that it's a proper HTML table. This could match malformed tags or tags within strings/comments. While this might be acceptable for the use case, consider whether stricter validation would be beneficial to avoid false positives when processing markdown content that might contain code snippets with table-like tags.
        self.html_tag_pattern = re.compile(r"<table.*?>", re.IGNORECASE)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/chonkie/chef/table.py
Comment thread src/chonkie/chef/table.py Outdated
Comment on lines +739 to +754
def test_table_chunker_html_table(html_table: str) -> None:
"""Test chunking an HTML table."""
chunker = TableChunker(tokenizer="character", chunk_size=100)
chunks = chunker.chunk(html_table)

assert len(chunks) > 1
for chunk in chunks:
assert "<table>" in chunk.text
assert "</table>" in chunk.text
assert "<thead>" in chunk.text
assert "ID" in chunk.text

# All data rows should be present across chunks
all_content = "".join(chunks[i].text for i in range(len(chunks)))
assert "Alice" in all_content
assert "Eve" in all_content

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test only covers HTML tables with both thead and tbody sections, using character-based chunking. It should also test: 1) HTML tables without tbody tags (just tr elements), 2) row-based chunking for HTML tables, 3) tables that fit within a single chunk, 4) edge cases like empty tbody, 5) malformed HTML. This would ensure the _split_html_table method's else branch and other edge cases are properly tested.

Copilot uses AI. Check for mistakes.
@chonknick chonknick merged commit ba92795 into main Feb 28, 2026
12 checks passed
@chonk-lain chonk-lain deleted the html-tables branch February 28, 2026 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants