-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding splitting information in the metadata of DocumentSplitter output #7389
Comments
@tradicio are you using this information in your application? |
Hi @anakin87, thanks for your message! You're correct, I am using the _split_id and _split_overlap in my application in order to keep the information of the chunks order resulting from PreProcessor(). This feature allows textual chunks to be displayed as they are in the DocumentStore() and is very useful for checking how the information from a long text is divided and ordered. With the introduction of Haystack v2 in the application, I would like to keep this functionality to continue to display the sorted output from DocumentSplitter() class. Let me know if you need further clarification, I am really glad to contribute! |
Ok, I understand! Let's involve @julian-risch for an opinion since he worked on these components. |
@tradicio Thanks for the feedback, I agree we should add these advanced metadata of _split_id and _split_overlap to a next iteration of DocumentSplitter, yes. 👍 |
@tradicio If you feel like it, go ahead and try to create a PR. |
Sure, I'll try to create a PR in the next days! Thanks for all your support, I am really glad to give my contribution |
Is your feature request related to a problem? Please describe.
When splitting a document in Haystack v1, the function _create_docs_from_splits() within the class PreProcessor() is able to save relevant metadata information such as _split_id and _split_overlap. So far, the class DocumentSplitter in Haystack v2 is not able to do the same. I think that it could be useful to bring these metadata information in the final output of DocumentSplitter.
Describe the solution you'd like
In the run() method, the solution could be to reproduce an adapted version of _create_docs_from_splits() as reported in the PreProcessor() class in Haystack v1.
Additional context
To check the differences between the outputs of the two Haystack versions, you need to realize an indexing pipeline in Haystack v2 and compare it with the output resulting from the PreProcessor of Haysatck v1
The text was updated successfully, but these errors were encountered: