You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We expect structural information to improve quality of search results so that predicted answers can be mapped back to original documents (e.g. pdf pages).
Plan
How to add structural information like headlines as metadata to Documents? Problem:
File Converters return a single Document object containing the whole text. This Document is only split at the PreProcessor, which doesn't access the original PDF/markdown files. One possibility:
Add a metadata field to the Document containing the headlines + the spans for which they apply for the whole file. This metadata field needs to be adapted in the PreProcessor.
Is your feature request related to a problem? Please describe.
We expect structural information to improve quality of search results so that predicted answers can be mapped back to original documents (e.g. pdf pages).
Plan
How to add structural information like headlines as metadata to Documents?
Problem:
File Converters return a single Document object containing the whole text. This Document is only split at the PreProcessor, which doesn't access the original PDF/markdown files.
One possibility:
Add a metadata field to the Document containing the headlines + the spans for which they apply for the whole file. This metadata field needs to be adapted in the PreProcessor.
Assess usage of LayoutLM for extracting structural elements of PDFs #3058The text was updated successfully, but these errors were encountered: