Skip to content

Conversation

Leg0shii
Copy link
Contributor

@Leg0shii Leg0shii commented Jul 29, 2025

Description

This PR enhances the DeclarativeDocumentBackend to support configurable backend options and significantly improves HTML image handling capabilities in Docling.

Key Changes:

  1. Made DeclarativeDocumentBackend generic and configurable:

    • Added generic type parameter TBackendOptions to support backend-specific options
    • Integrated backend options into FormatOption with automatic defaults
  2. Introduced configurable image handling for HTML:

    • ImageOptions.NONE: Images as placeholders only
    • ImageOptions.REFERENCED: Images with URI references
    • ImageOptions.EMBEDDED: Images embedded as base64 data
  3. Enhanced image source support:

    • HTTP/HTTPS URLs, data URLs, local files, protocol-relative URLs (//example.com)
    • Relative path resolution for both web and local contexts
  4. Improved test infrastructure:

    • Tests now validate all three image handling modes
    • Reference data organized by image option type
    • Maintains portable relative paths in test outputs

Usage Example:

from docling.datamodel.base_models import InputFormat
from docling.backend.html_backend import HTMLBackendOptions, ImageOptions
from docling.document_converter import DocumentConverter, HTMLFormatOption

# Configure HTML backend with embedded images
converter = DocumentConverter(
    format_options={
        InputFormat.HTML: HTMLFormatOption(
            backend_options=HTMLBackendOptions(
                image_options=ImageOptions.EMBEDDED
            )
        )
    }
)

Breaking Changes:

None - backward compatible with optional backend options.

Remaining Issues

  • resolve_source_to_stream returns only the end portion of the URL (e.g., "about") from full URLs like https://www.website.com/section/about. This prevents the HTML backend from properly resolving relative image paths since the full base URL is needed for correct image downloading.
  • SVG images are not supported - PIL/Pillow cannot open SVG files -> skipping them
  • The error message for failed image embedding is incorrect for HTML documents, showing: <!-- 🖼️❌ Image not available. Please use PdfPipelineOptions(generate_picture_images=True) --> (This might occur for relative path images, svg images or other reasons when a image cant be opened)
  • Some images cant be loaded from wikipedia due to: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy

I believe that these issues fall outside of the scope of this PR and should be handled in a future PRs.

Issue resolved by this Pull Request:
Resolves #1963

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 29, 2025

DCO Check Passed

Thanks @Leg0shii, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Jul 29, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@Leg0shii
Copy link
Contributor Author

Is documentation and examples nessecary? If so, where would I add them?

@ceberam ceberam requested review from ceberam and dolfim-ibm July 29, 2025 21:54
@ceberam ceberam self-assigned this Jul 29, 2025
@ceberam ceberam added enhancement New feature or request html issue related to html backend labels Jul 29, 2025
@ceberam
Copy link
Contributor

ceberam commented Jul 29, 2025

Hello @Leg0shii
Thanks again for your willingness to improve Docling, we really appreciate your effort.
This PR is connected to the feature of image handling in HTML but the design of this feature expands beyond this backend parser and therefore we need to check the implications of all the suggested changes very carefully.
We have some feedback that should be addressed.

In terms of design:

  • By default, Docling should never access external sites without explicit instructions from the user for several reasons, including security and user privacy. The same applies to local resources. Otherwise, converting HTML pages with malicious relative paths could expose sensitive data through the parsing and the export functions.
  • We suggest introducing 2 new flag variables in the AbstractDocumentBackend class:
    • enable_remote_fetch, to allow fetching remote images (or other resources), and
    • enable_local_fetch, to allow fetching local images (or other resources).
  • These flag variables should default to False and they would be used by the backend implementations. The idea is similar to the option enable_remote_services to explicitly opt-in in communicating with external services. Check Using remote services in the documentation for more details.
  • The HTMLDocumentBackend should never pull images by default (remotely or locally). The backend should check if the options enable_remote_fetch and enable_local_fetch have been set to True to enable that functionality. If the backend attempts to fetch images without an explicit option, a OperationNotAllowed exception should be thrown.
  • Turning DeclarativeDocumentBackend into a generic model is a good idea for handling backend options. For the type variable, we prefer the naming convention like BackendOptionsT instead of TBackendOptions. These options should be optional in the backend constructors.
  • The class BackendOptions should have a string field (e.g., kind) to distinguish the subclasses such as HTMLBackendOptions. We foresee the use of unions in type annotations and therefore having discriminated unions with str discriminators will be more efficient. You can find the same approach in BaseVlmOptions or BaseAsrOptions.
  • The class HTMLBackendOptions should have the field image_fetch (instead of image_options) of type boolean. If False (default), the backend will not access remote or local resources to fetch images. If True, the backend will try to fetch those resources and embed them in DoclingDocument. The first case corresponds to ImageOptions.NONE and the second to ImageOptions.EMBEDDED in your suggested implementation. We therefore drop the 3rd scenario (ImageOptions.REFERENCED), since we believe that DoclingDocument should be self-contained and keeping just image references should be the task of the serializers.

Other technical aspects:

  • Check your development environment since many modules (including some that are not related to this PR) show as full diff changes on git (e.g., test_backend_jats.py ). We cannot provide a proper PR review in this situation.
  • Ensure backwards compatibility (the new backend options should be optional). Therefore, all the test modules of the declarative backends (except test_backend_html.py) should not be modified.
  • Even though it is not enforced by the pre-commit hooks, please try to add docstrings on the new classes and functions, with the google docstring convention.
  • In particular, provide some documentation on the HTMLBackendOptions fields through pydantic's Field function and its description argument.
  • Avoid remote calls in regression tests, since we do not want to put extra burden to our CI/CD pipelines. Consider using unittest.mock for the embedded image option of the HTMLDocumentBackend.
  • Rebase the PR on the main branch and resolve the conflicts, since we merged some commits today.
  • You may want to add @vaaale as co-author in some commits, since the HTML image handling is based on their initial implementation.

Further improvements, out of the scope of this task, and besides those that you already listed:

  • Enable backend options in the CLI
  • Backend options should be allowed to be extended with HTTP request headers (like User-Agent) to comply with remote service policies (e.g., User-Agent policy to avoid the 403 error messages that you pointed out).

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, see the comment above

@ceberam
Copy link
Contributor

ceberam commented Jul 29, 2025

Is documentation and examples nessecary? If so, where would I add them?

Besides docstrings, no further documentation is needed in this PR.

@Leg0shii
Copy link
Contributor Author

Leg0shii commented Aug 3, 2025

Hello, thank you for the feedback, I really appreciate it!
I already took a look at the suggested changes, but sadly I got sick and cannot implement them at the moment.
Maybe someone else can take over from here?

@ceberam
Copy link
Contributor

ceberam commented Aug 18, 2025

Hello, thank you for the feedback, I really appreciate it! I already took a look at the suggested changes, but sadly I got sick and cannot implement them at the moment. Maybe someone else can take over from here?

@Leg0shii I will take it from there this week. Wish you a good recovery!
If you give me permissions to push to your fork, that would be helpful.

@Leg0shii
Copy link
Contributor Author

Im sorry for the late reply. I have invited you @ceberam

@irajank
Copy link

irajank commented Sep 23, 2025

Waiting for this PR to be Merged @ceberam @dolfim-ibm
Please <3

@ceberam
Copy link
Contributor

ceberam commented Sep 23, 2025

Waiting for this PR to be Merged @ceberam @dolfim-ibm Please <3

@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week.

@irajank
Copy link

irajank commented Sep 23, 2025

Waiting for this PR to be Merged @ceberam @dolfim-ibm Please <3

@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week.

Hey thanks for speedy response. Looking forward for ASAP merge.
Again Thanks.

@punit1108
Copy link

@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged?

@ceberam
Copy link
Contributor

ceberam commented Sep 29, 2025

@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged?

@punit1108 We expect to have it merged by the end of today

@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from c9d3efc to 25f8bc3 Compare October 10, 2025 15:43
@ceberam ceberam marked this pull request as draft October 10, 2025 15:56
@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from 25f8bc3 to 2826b8c Compare October 10, 2025 16:01
@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch 4 times, most recently from a15c4fd to 6b57b88 Compare October 16, 2025 11:17
@ceberam
Copy link
Contributor

ceberam commented Oct 17, 2025

The refactoring of this PR is done and ready for review. Please @cau-git @dolfim-ibm @PeterStaar-IBM have a look at it.

  • It takes into account the discussions above. The main goal is to parse HTML pages with links to images by optionally fetching those images (either remotely or locally) and embedding them into the DoclingDocument.
  • It introduces backend options for DeclarativeDocumentBackend classes only (i.e., the standard PDF pipeline does not change). The use of this artifact is optional and therefore nothing changed for most backends. For this PR only the HTML and the Markdown backends are altered (the last one since it uses HTML in certain cases).
  • The use of backend options is only possible programmatically. The CLI is not yet enabled, but I would do it later, once we agree with the proposed design and implementation in this PR.
    -It refactors the caption in img elements: the alt attribute is taken instead of the dummy text Image Hyperlink.
  • Other small improvements to:
    • Replace deprecated annotations from the typing library with native types (set, tuple, list, dict)
    • Replace some typing annotations according to pydantic guidelines (e.g., return types of model validators)
    • Fix type annotation flaws in datamodel/document.py.
    • Some documentation with pydantic annotations

@ceberam ceberam marked this pull request as ready for review October 17, 2025 16:02
@PeterStaar-IBM
Copy link
Contributor

@ceberam the tests seem to be failing, can you have a quick look?

@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from 45cd339 to 06bcc9f Compare October 20, 2025 07:03
ceberam
ceberam previously approved these changes Oct 20, 2025
@ceberam ceberam requested a review from cau-git October 20, 2025 08:41
dolfim-ibm
dolfim-ibm previously approved these changes Oct 20, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 20, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ceberam great progress. I posted few remarks below.

Leg0shii and others added 10 commits October 21, 2025 10:22
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…ove HTML image handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…are set correctly

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…le paths

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam dismissed stale reviews from PeterStaar-IBM, dolfim-ibm, and themself via cdb0230 October 21, 2025 08:32
@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from 06bcc9f to cdb0230 Compare October 21, 2025 08:32
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from d5c1433 to 24c3e99 Compare October 21, 2025 09:56
@dolfim-ibm dolfim-ibm merged commit a30e6a7 into docling-project:main Oct 21, 2025
23 checks passed
@dosubot
Copy link

dosubot bot commented Oct 21, 2025

Documentation Updates

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request html issue related to html backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow parameters in HTML backend or any DeclarativeDocumentBackend implementation

7 participants