Extract structure from PDF and markdown files #2809

masci · 2022-07-14T10:17:32Z

Is your feature request related to a problem? Please describe.
We expect structural information to improve quality of search results so that predicted answers can be mapped back to original documents (e.g. pdf pages).

Plan

How to add structural information like headlines as metadata to Documents?
Problem:
File Converters return a single Document object containing the whole text. This Document is only split at the PreProcessor, which doesn't access the original PDF/markdown files.
One possibility:
Add a metadata field to the Document containing the headlines + the spans for which they apply for the whole file. This metadata field needs to be adapted in the PreProcessor.

masci · 2022-09-05T11:52:15Z

Let's remove #3058 from the scope of this epic, we'll tackle that separately.

masci added the epic label Jul 14, 2022

danielbichuetti mentioned this issue Jul 19, 2022

SQL based Datastores fail when document metadata has a list #2792

Closed

masci assigned bogdankostic and vblagoje Jul 20, 2022

masci added the epic:idle Epic not yet started label Jul 28, 2022

masci added epic:in-progress Epic is in progress and removed epic:idle Epic not yet started labels Sep 1, 2022

masci closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract structure from PDF and markdown files #2809

Extract structure from PDF and markdown files #2809

masci commented Jul 14, 2022 •

edited

masci commented Sep 5, 2022

Extract structure from PDF and markdown files #2809

Extract structure from PDF and markdown files #2809

Comments

masci commented Jul 14, 2022 • edited

Plan

masci commented Sep 5, 2022

masci commented Jul 14, 2022 •

edited