-
Notifications
You must be signed in to change notification settings - Fork 3k
feat(backend): add generic options support and HTML image handling modes #2011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(backend): add generic options support and HTML image handling modes #2011
Conversation
✅ DCO Check Passed Thanks @Leg0shii, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
Is documentation and examples nessecary? If so, where would I add them? |
Hello @Leg0shii In terms of design:
Other technical aspects:
Further improvements, out of the scope of this task, and besides those that you already listed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, see the comment above
Besides docstrings, no further documentation is needed in this PR. |
Hello, thank you for the feedback, I really appreciate it! |
@Leg0shii I will take it from there this week. Wish you a good recovery! |
Im sorry for the late reply. I have invited you @ceberam |
Waiting for this PR to be Merged @ceberam @dolfim-ibm |
@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week. |
Hey thanks for speedy response. Looking forward for ASAP merge. |
@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged? |
@punit1108 We expect to have it merged by the end of today |
c9d3efc
to
25f8bc3
Compare
25f8bc3
to
2826b8c
Compare
a15c4fd
to
6b57b88
Compare
The refactoring of this PR is done and ready for review. Please @cau-git @dolfim-ibm @PeterStaar-IBM have a look at it.
|
@ceberam the tests seem to be failing, can you have a quick look? |
45cd339
to
06bcc9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ceberam great progress. I posted few remarks below.
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…ove HTML image handling Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…are set correctly Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…le paths Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Backend options for DeclarativeDocumentBackend classes and only when necessary. Refactor caption parsing in 'img' elements and remove dummy text. Replace deprecated annotations from Typing library with native types. Replace typing annotations according to pydantic guidelines. Some documentation with pydantic annotations. Fix diff issue with test files. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Move backend option classes to its own module within datamodel package. Rename 'source_location' with 'source_uri' in HTMLBackendOptions. Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
cdb0230
06bcc9f
to
cdb0230
Compare
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
d5c1433
to
24c3e99
Compare
Description
This PR enhances the DeclarativeDocumentBackend to support configurable backend options and significantly improves HTML image handling capabilities in Docling.
Key Changes:
Made DeclarativeDocumentBackend generic and configurable:
TBackendOptions
to support backend-specific optionsFormatOption
with automatic defaultsIntroduced configurable image handling for HTML:
Enhanced image source support:
//example.com
)Improved test infrastructure:
Usage Example:
Breaking Changes:
None - backward compatible with optional backend options.
Remaining Issues
resolve_source_to_stream
returns only the end portion of the URL (e.g., "about") from full URLs like https://www.website.com/section/about. This prevents the HTML backend from properly resolving relative image paths since the full base URL is needed for correct image downloading.<!-- 🖼️❌ Image not available. Please use
PdfPipelineOptions(generate_picture_images=True)-->
(This might occur for relative path images, svg images or other reasons when a image cant be opened)403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy
I believe that these issues fall outside of the scope of this PR and should be handled in a future PRs.
Issue resolved by this Pull Request:
Resolves #1963
Checklist: