-
Notifications
You must be signed in to change notification settings - Fork 3.5k
feat(html-backend): improve accordion extraction and hidden content ha… #1115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(html-backend): improve accordion extraction and hidden content ha… #1115
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
dolfim-ibm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulan-yisaev thanks for your contribution. looks good to me.
* fix(html-backend): improve accordion extraction and hidden content handling
- Add specialized handlers for Bootstrap accordion components to properly extract
questions from panel-title elements
- Implement is_hidden_element() method to detect and skip content with hidden
classes, styles, and attributes
- Update walk(), analyze_tag(), and extract_text_recursively() to filter out
hidden elements
- Add comprehensive test suite with direct method tests and example HTML files
This fixes two issues:
1. Missing questions in accordion components
2. Unwanted extraction of hidden metadata content
Tests: tests/test_html_enhanced.py
Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
* + html-backend itelsd
Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
* run pre-commit run --all-files
---------
Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
|
@ulan-yisaev Could you please try again: You may need to run it multiple times until all styling fixes are applied by the different tools. |
|
@cau-git I've fixed the typing issues that MyPy was reporting. I added a helper method |
a18e1df to
3840c55
Compare
|
Currently experiencing validation issues with the HTML backend tests. Will address these test failures later. /hold |
|
@ulan-yisaev It is a very good change, what is your ETA for this? |
|
Thanks for the feedback, Peter. I'll work on fixing the tests this week and will update you as soon as possible. |
|
@ulan-yisaev Awesome! Looking forward to it and many thanks for the work! |
a0f5098 to
2b6fd25
Compare
|
Hi team, |
|
@ulan-yisaev I understand your frustration, but this is unfortunately one of the admin tasks we need to do. I have run myself into this problem a few times ... |
|
Thanks for understanding, @PeterStaar-IBM . It really is a frustrating process. Given the ongoing DCO issues, I’m considering creating a new PR from a fresh fork of the docling repo. Do you think that would be a viable solution to overcome these complications? |
|
@ulan-yisaev the issues you encountered are most probably due to the merge commits that you did to pick up the latest changes of docling's git remote add upstream git@github.com:docling-project/docling.git
git fetch upstream
git checkout fix-html-backend-accordion-hidden
git rebase upstream/mainIf you want to start fresh on a new branch, I would do the following: # on your local copy of your fork https://github.com/ulan-yisaev/docling
git checkout main
git remote add upstream git@github.com:docling-project/docling.git
git fetch upstream
git rebase upstream/main
git branch fix-html-backend-accordion-hidden-2
git checkout fix-html-backend-accordion-hidden-2
# pick-up your commit and eventually fix any conflict
git cherry-pick 4c88d4fe1434c7f0355c2fa29a57705322052e3eYou should see your commit (already properly signed) on top of the latest docling Also, please check locally that all the checks pass... poetry run pre-commit run --all-files...as well as the tests. You may want to check the regression tests of the HTML backend first: poetry run pytest tests/test_backend_html.pyand then make sure everything is fine with all the rest: poetry run pytest testsPlease let us know if you have any issue. |
|
@ulan-yisaev do you need further support from us to complete this PR? |
|
Hi @ceberam, yes, that would be super helpful — please go ahead 🙏 |
|
Thanks @ulan-yisaev for giving me access to your fork. |
|
Thanks, @ceberam. I really appreciate the detailed explanation — I don't have the same deep knowledge of HTML/CSS, so I wasn’t aware of those assumptions. Your suggestion to support custom backends sounds great and makes a lot of sense. Looking forward to seeing it in action! |
Description
This PR addresses two issues with the HTML backend:
Missing questions in Bootstrap accordion components: The HTML backend was not properly extracting questions from Bootstrap accordion components. These questions were in
<a>tags inside<div class="panel-title">elements, causing incomplete Q&A extraction.Unwanted extraction of hidden content: The backend was including text from elements marked as 'hidden', which polluted the extracted content with metadata and invisible elements.
Changes implemented:
divto theTAGS_FOR_NODE_ITEMSlist to ensure div elements are processedhandle_panel_title(): Extracts question text from panel titleshandle_panel(): Processes entire accordion panelsis_hidden_element()method to detect various types of hidden elements:extract_text_recursively()to skip hidden content during text collectionwalk()to skip processing hidden tags entirelyanalyze_tag()to prevent processing hidden elementsIssue resolved by this Pull Request:
Resolves #1112
Checklist: