feat: support html tables#500
Conversation
Summary of ChangesHello @chonk-lain, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for HTML tables across the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. ChangelogActivityUsing Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes |
|
✅ All tests passed. |
There was a problem hiding this comment.
Pull request overview
This pull request adds HTML table support to Chonkie's table processing capabilities, extending both the TableChunker and TableChef to handle HTML <table> elements in addition to existing markdown pipe tables. This addresses issue #495 which requested the ability to chunk HTML tables similar to markdown tables.
Changes:
- Extended
TableChunkerto detect and chunk HTML tables while preserving table structure tags (<table>,<thead>,<tbody>) - Enhanced
TableChefto extract both HTML and markdown tables from mixed content - Added documentation and examples for HTML table processing
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/chonkie/chunker/table.py | Added HTML table detection, splitting logic with _split_html_table() method, and footer tag handling for proper HTML structure in chunks |
| src/chonkie/chef/table.py | Added HTML table regex pattern and extraction logic to find HTML tables alongside markdown tables, with sorting by position |
| tests/chunkers/test_table_chunker.py | Added test case for HTML table chunking (requires missing fixture to be added) |
| tests/chef/test_table_chef.py | Updated test to verify both HTML and markdown table extraction from mixed content |
| docs/oss/chunkers/table-chunker.mdx | Added HTML table usage example and updated description to mention HTML support |
| docs/oss/chefs/tablechef.mdx | Added HTML table extraction example and updated description |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
The pull request introduces support for HTML tables in both TableChef and TableChunker. This is a valuable addition that expands the library's utility for processing web-based content. The implementation correctly handles the repetition of headers in chunks and provides a unified interface for both Markdown and HTML tables. However, there are several issues related to index calculation consistency, regex robustness for nested tables, and performance inefficiencies in string handling that should be addressed to ensure the feature is production-ready.
42e973b to
ad9f385
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces support for HTML tables in both TableChef and TableChunker. This is a valuable addition for processing documents with mixed Markdown and HTML content. The implementation uses regex for extraction and a row-based or token-based approach for chunking. While the overall logic is sound, I identified several issues related to index accuracy and potential edge cases in HTML parsing that should be addressed to ensure the reliability of the chunking process.
| tbody_match = re.search(r"<tbody.*?>", table, re.IGNORECASE) | ||
| if tbody_match: | ||
| header = table[: tbody_match.end()] | ||
| body_content = table[tbody_match.end() : table.lower().find("</tbody>")] |
There was a problem hiding this comment.
The use of table.lower().find("</tbody>") can return -1 if the </tbody> tag is missing or if the HTML is malformed. Slicing with -1 as the end index (e.g., table[start:-1]) will incorrectly exclude the last character of the string instead of taking the whole remaining content. Consider checking if the index is valid before slicing.
| len("".join(current_chunk)) - len(header) | ||
| if chunks != [] | ||
| else len("".join(current_chunk)) | ||
| len("".join(current_chunk)) - len(header) if chunks != [] else len(chunk_text) |
There was a problem hiding this comment.
Similar to the issue in the first chunk, using len(chunk_text) for the chunk_len when chunks is empty (single-chunk case) might be problematic if the original text had trailing whitespace or characters after the closing </table> tag that were stripped. However, the more critical issue is the inconsistency with how chunk_len is calculated for subsequent chunks.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (3)
src/chonkie/chunker/table.py:142
- In the row-based chunking path for HTML tables, the footer is added to chunk_text but the end_index calculation doesn't account for the footer length. Line 140 sets
end_index=current_char_index + data_rows_len, but chunk_text includes the footer. This causes the end_index to be incorrect when a footer is present, potentially leading to misalignment when these chunks are used in a larger document context.
for i in range(0, len(data_rows), self.chunk_size):
chunk_rows = data_rows[i : i + self.chunk_size]
chunk_text = header + "".join(chunk_rows) + footer
data_rows_len = len("".join(chunk_rows))
chunks.append(
Chunk(
text=chunk_text,
token_count=len(chunk_rows),
start_index=current_char_index,
end_index=current_char_index + data_rows_len,
),
)
src/chonkie/chunker/table.py:52
- The method is named
_split_markdown_tablebut it's used for splitting markdown tables specifically (with pipe syntax). For clarity and consistency, consider renaming it to better reflect that it handles markdown pipe tables, distinguishing it from the new_split_html_tablemethod. However, if this is an intentional choice to maintain backward compatibility or consistency with existing code, it's acceptable.
def _split_markdown_table(self, table: str) -> tuple[str, list[str]]:
src/chonkie/chunker/table.py:47
- The pattern
r"<table.*?>"matches any table opening tag but doesn't validate that it's a proper HTML table. This could match malformed tags or tags within strings/comments. While this might be acceptable for the use case, consider whether stricter validation would be beneficial to avoid false positives when processing markdown content that might contain code snippets with table-like tags.
self.html_tag_pattern = re.compile(r"<table.*?>", re.IGNORECASE)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_table_chunker_html_table(html_table: str) -> None: | ||
| """Test chunking an HTML table.""" | ||
| chunker = TableChunker(tokenizer="character", chunk_size=100) | ||
| chunks = chunker.chunk(html_table) | ||
|
|
||
| assert len(chunks) > 1 | ||
| for chunk in chunks: | ||
| assert "<table>" in chunk.text | ||
| assert "</table>" in chunk.text | ||
| assert "<thead>" in chunk.text | ||
| assert "ID" in chunk.text | ||
|
|
||
| # All data rows should be present across chunks | ||
| all_content = "".join(chunks[i].text for i in range(len(chunks))) | ||
| assert "Alice" in all_content | ||
| assert "Eve" in all_content |
There was a problem hiding this comment.
The test only covers HTML tables with both thead and tbody sections, using character-based chunking. It should also test: 1) HTML tables without tbody tags (just tr elements), 2) row-based chunking for HTML tables, 3) tables that fit within a single chunk, 4) edge cases like empty tbody, 5) malformed HTML. This would ensure the _split_html_table method's else branch and other edge cases are properly tested.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
adds support for html tables including both chunker and chef.
fixes : #495
how to use