Skip to content

Fix br tail text handling in HTML tables#4351

Open
dsolankii wants to merge 1 commit into
Unstructured-IO:mainfrom
dsolankii:fix-html-table-br-tail-text
Open

Fix br tail text handling in HTML tables#4351
dsolankii wants to merge 1 commit into
Unstructured-IO:mainfrom
dsolankii:fix-html-table-br-tail-text

Conversation

@dsolankii
Copy link
Copy Markdown

@dsolankii dsolankii commented May 12, 2026

Fixes #3899.

Summary

This PR fixes text loss when normalizing HTML tables that contain <br/> tags inside table cells.

HtmlTable.from_html_text() previously removed all element tail text during normalization. For <br/> elements, the text after the tag is stored as tail text, so content after line breaks was dropped from the normalized HTML output.

Problem

Given an HTML table cell like:

This is 1st line.<br/>2nd line.<br/>3rd line.

the normalized table output preserved the <br/> tags but dropped the text after them, resulting in loss of 2nd line. and 3rd line..

Solution

This change preserves normalized tail text for <br/> elements while continuing to remove tail text for other elements.

This keeps the existing cleanup behavior for table normalization but avoids dropping valid content that follows line break tags.

Changes

  • Preserve tail text for <br/> elements in HtmlTable.from_html_text().
  • Continue removing tail text for non-br elements.
  • Add regression coverage for preserving text after <br/> tags in HTML table cells.

How to test

Run the targeted regression test:

python -m pytest test_unstructured/common/test_html_table.py::DescribeHtmlTable::test_from_html_text_preserves_br_tail_text -q

Expected result:

1 passed

Validation

I reproduced the issue locally before applying the fix.

Before the fix, this input lost the text after <br/> tags:

This is 1st line.<br/>2nd line.<br/>3rd line.

After the fix, the normalized table preserves all lines:

This is 1st line.<br/>2nd line.<br/>3rd line.

The targeted regression test passes locally.

@dsolankii
Copy link
Copy Markdown
Author

@cragwolfe Would you please review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/br tag tail text loss

1 participant