Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract structure from PDF and markdown files #2809

Closed
2 of 3 tasks
masci opened this issue Jul 14, 2022 · 1 comment
Closed
2 of 3 tasks

Extract structure from PDF and markdown files #2809

masci opened this issue Jul 14, 2022 · 1 comment
Assignees
Labels

Comments

@masci
Copy link
Member

masci commented Jul 14, 2022

Is your feature request related to a problem? Please describe.
We expect structural information to improve quality of search results so that predicted answers can be mapped back to original documents (e.g. pdf pages).

Plan

How to add structural information like headlines as metadata to Documents?
Problem:
File Converters return a single Document object containing the whole text. This Document is only split at the PreProcessor, which doesn't access the original PDF/markdown files.
One possibility:
Add a metadata field to the Document containing the headlines + the spans for which they apply for the whole file. This metadata field needs to be adapted in the PreProcessor.

@masci masci added the epic label Jul 14, 2022
@masci masci added the epic:idle Epic not yet started label Jul 28, 2022
@masci masci added epic:in-progress Epic is in progress and removed epic:idle Epic not yet started labels Sep 1, 2022
@masci
Copy link
Member Author

masci commented Sep 5, 2022

Let's remove #3058 from the scope of this epic, we'll tackle that separately.

@masci masci closed this as completed Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

3 participants