Skip to content

Conversation

@ulan-yisaev
Copy link

Description

This PR addresses two issues with the HTML backend:

  1. Missing questions in Bootstrap accordion components: The HTML backend was not properly extracting questions from Bootstrap accordion components. These questions were in <a> tags inside <div class="panel-title"> elements, causing incomplete Q&A extraction.

  2. Unwanted extraction of hidden content: The backend was including text from elements marked as 'hidden', which polluted the extracted content with metadata and invisible elements.

Changes implemented:

  • Added div to the TAGS_FOR_NODE_ITEMS list to ensure div elements are processed
  • Added specialized handlers for Bootstrap accordion components:
    • handle_panel_title(): Extracts question text from panel titles
    • handle_panel(): Processes entire accordion panels
  • Implemented is_hidden_element() method to detect various types of hidden elements:
    • Elements with classes like "hidden", "d-none", "hide", "invisible", "collapse"
    • Elements with inline styles like "display:none" or "visibility:hidden"
    • Elements with the "hidden" attribute
  • Modified text extraction to skip hidden elements in multiple places:
    • In extract_text_recursively() to skip hidden content during text collection
    • In walk() to skip processing hidden tags entirely
    • In analyze_tag() to prevent processing hidden elements

Issue resolved by this Pull Request:
Resolves #1112

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@mergify
Copy link

mergify bot commented Mar 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dolfim-ibm
dolfim-ibm previously approved these changes Mar 5, 2025
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulan-yisaev thanks for your contribution. looks good to me.

* fix(html-backend): improve accordion extraction and hidden content handling

   - Add specialized handlers for Bootstrap accordion components to properly extract
     questions from panel-title elements
   - Implement is_hidden_element() method to detect and skip content with hidden
     classes, styles, and attributes
   - Update walk(), analyze_tag(), and extract_text_recursively() to filter out
     hidden elements
   - Add comprehensive test suite with direct method tests and example HTML files

   This fixes two issues:
   1. Missing questions in accordion components
   2. Unwanted extraction of hidden metadata content

   Tests: tests/test_html_enhanced.py

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* + html-backend itelsd

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* run pre-commit run --all-files

---------

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
@cau-git
Copy link
Member

cau-git commented Mar 12, 2025

@ulan-yisaev Could you please try again:

poetry run pre-commit run --all-files

You may need to run it multiple times until all styling fixes are applied by the different tools.

@ulan-yisaev
Copy link
Author

@cau-git I've fixed the typing issues that MyPy was reporting. I added a helper method _has_class() to more safely check for class names, and improved the type handling throughout the code.
All pre-commit checks are now passing.

@ulan-yisaev ulan-yisaev force-pushed the fix-html-backend-accordion-hidden branch from a18e1df to 3840c55 Compare March 13, 2025 11:43
@ulan-yisaev
Copy link
Author

Currently experiencing validation issues with the HTML backend tests. Will address these test failures later.

/hold

@PeterStaar-IBM
Copy link
Member

@ulan-yisaev It is a very good change, what is your ETA for this?

@ulan-yisaev
Copy link
Author

Thanks for the feedback, Peter. I'll work on fixing the tests this week and will update you as soon as possible.

@PeterStaar-IBM
Copy link
Member

@ulan-yisaev Awesome! Looking forward to it and many thanks for the work!

@ulan-yisaev ulan-yisaev force-pushed the fix-html-backend-accordion-hidden branch 2 times, most recently from a0f5098 to 2b6fd25 Compare March 23, 2025 21:15
@ulan-yisaev
Copy link
Author

Hi team,
I want to express my frustration regarding the DCO issues I've been encountering with this PR. I've spent several hours trying to resolve the sign-off requirements, but I continue to face challenges, particularly with the commit history and the email addresses associated with my commits.
I understand the importance of maintaining compliance with DCO rules, but the process has become quite complicated, and I’m finding it difficult to navigate. I would appreciate any guidance or assistance in resolving these issues so that we can move forward.
Thank you for your understanding.

@PeterStaar-IBM
Copy link
Member

@ulan-yisaev I understand your frustration, but this is unfortunately one of the admin tasks we need to do. I have run myself into this problem a few times ...

@ulan-yisaev
Copy link
Author

Thanks for understanding, @PeterStaar-IBM . It really is a frustrating process. Given the ongoing DCO issues, I’m considering creating a new PR from a fresh fork of the docling repo. Do you think that would be a viable solution to overcome these complications?

@cau-git cau-git requested a review from ceberam March 24, 2025 08:54
@ceberam
Copy link
Member

ceberam commented Mar 24, 2025

@ulan-yisaev the issues you encountered are most probably due to the merge commits that you did to pick up the latest changes of docling's main. These are unsigned commits and therefore DCO complains.
I would strongly suggest to rebase on main and eventually amend your commit, instead of creating merge commits:

git remote add upstream git@github.com:docling-project/docling.git
git fetch upstream
git checkout fix-html-backend-accordion-hidden
git rebase upstream/main

If you want to start fresh on a new branch, I would do the following:

# on your local copy of your fork https://github.com/ulan-yisaev/docling
git checkout main
git remote add upstream git@github.com:docling-project/docling.git
git fetch upstream
git rebase upstream/main
git branch fix-html-backend-accordion-hidden-2
git checkout fix-html-backend-accordion-hidden-2
# pick-up your commit and eventually fix any conflict
git cherry-pick 4c88d4fe1434c7f0355c2fa29a57705322052e3e

You should see your commit (already properly signed) on top of the latest docling main commit.

Also, please check locally that all the checks pass...

poetry run pre-commit run --all-files

...as well as the tests. You may want to check the regression tests of the HTML backend first:

poetry run pytest tests/test_backend_html.py

and then make sure everything is fine with all the rest:

poetry run pytest tests

Please let us know if you have any issue.

@ceberam
Copy link
Member

ceberam commented Mar 28, 2025

@ulan-yisaev do you need further support from us to complete this PR?
If you give me write access to your fork, I could rebase the branch for you.

@ulan-yisaev
Copy link
Author

Hi @ceberam, yes, that would be super helpful — please go ahead 🙏
Thanks a lot for offering to help — I’ve just granted you write access to my fork.

@PeterStaar-IBM PeterStaar-IBM removed the request for review from cau-git March 31, 2025 08:12
@ceberam ceberam self-assigned this Mar 31, 2025
@ceberam
Copy link
Member

ceberam commented Mar 31, 2025

Thanks @ulan-yisaev for giving me access to your fork.
I have had a closer look while rebasing the branch to our latest main version.
Unfortunately we cannot proceed with merging this PR. The code you introduced refers to an HTML framework, Bootstrap, which requires specific CSS.
The goal of Docling's HTML backend is to provide a parser that addresses the main specifications of HTML5. Even though Bootstrap is a popular framework, we cannot assume that any HTML page will follow the CSS of this framework and the code in this PR makes assumptions, like for instance, to decide whether some text should be parsed into DoclingDocument or not. In addition, the panels are deprecated since version 4 in favor of the newly introduced cards, which would require creating version-specific parsers for the Bootstrap framework.
Our suggestion is the following: we will introduce a new feature in docling to allow the customization of document backends. Users should be able to create a new backend or extend an existing one. This should allow the parsing of specific frameworks or custom CSS in HTML. We will use your example as a use case.
For now, we will close this PR, create a new issue with the feature above and reference both #1112 and #1115 .

@ceberam ceberam changed the title fix(html-backend): improve accordion extraction and hidden content ha… feat(html-backend): improve accordion extraction and hidden content ha… Mar 31, 2025
@ceberam ceberam closed this Mar 31, 2025
@ulan-yisaev
Copy link
Author

Thanks, @ceberam. I really appreciate the detailed explanation — I don't have the same deep knowledge of HTML/CSS, so I wasn’t aware of those assumptions. Your suggestion to support custom backends sounds great and makes a lot of sense. Looking forward to seeing it in action!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML backend fails to properly extract content from Bootstrap accordion components and incorrectly includes hidden elements in the extracted text

5 participants