Skip to content

Conversation

jkwatson
Copy link
Collaborator

@jkwatson jkwatson commented Jul 8, 2025

No description provided.

@Copilot Copilot AI review requested due to automatic review settings July 8, 2025 18:21
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the docling integration by bumping its version and replacing the existing hierarchical chunker with a new hybrid chunker implementation in the document reader.

  • Bump docling dependency to >=2.40.0 and add docling-ibm-models override
  • Swap HierarchicalChunker for HybridChunker and remove manual serialization
  • Clean up related imports

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File Description
llm-service/pyproject.toml Upgraded docling version and added docling-ibm-models override
llm-service/app/ai/indexing/readers/docling_reader.py Switched to HybridChunker, removed per-item serialization, updated imports
Comments suppressed due to low confidence (5)

llm-service/app/ai/indexing/readers/docling_reader.py:45

  • Duplicate import of BaseChunk detected on line 47 as well. Remove one of these to avoid confusion.
from docling_core.transforms.chunker import BaseChunk

llm-service/app/ai/indexing/readers/docling_reader.py:49

  • SerializationResult is no longer used after removing manual serialization. Consider removing this import.
from docling_core.transforms.serializer.base import SerializationResult

llm-service/app/ai/indexing/readers/docling_reader.py:50

  • MarkdownDocSerializer import is unused since serialization was removed. You can delete this import.
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

llm-service/app/ai/indexing/readers/docling_reader.py:72

  • MarkdownSerializerProvider is not imported, so this will raise a NameError. Add from docling_core.transforms.serializer.markdown import MarkdownSerializerProvider (or correct provider) to the imports.
        chunky_chunks = HybridChunker(serializer_provider=MarkdownSerializerProvider()).chunk(docling_doc.document)

llm-service/app/ai/indexing/readers/docling_reader.py:84

  • Variable document is not defined in this scope. Did you mean to use docling_doc.document.metadata or pull metadata from docling_doc?
            node.metadata["file_name"] = document.metadata["file_name"]

@jkwatson jkwatson merged commit d04759d into main Jul 8, 2025
3 checks passed
@jkwatson jkwatson deleted the mob/cherrypicking branch July 8, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants