Adding splitting information in the metadata of DocumentSplitter output #7389

tradicio · 2024-03-20T15:42:43Z

Is your feature request related to a problem? Please describe.
When splitting a document in Haystack v1, the function _create_docs_from_splits() within the class PreProcessor() is able to save relevant metadata information such as _split_id and _split_overlap. So far, the class DocumentSplitter in Haystack v2 is not able to do the same. I think that it could be useful to bring these metadata information in the final output of DocumentSplitter.

Describe the solution you'd like
In the run() method, the solution could be to reproduce an adapted version of _create_docs_from_splits() as reported in the PreProcessor() class in Haystack v1.

Additional context
To check the differences between the outputs of the two Haystack versions, you need to realize an indexing pipeline in Haystack v2 and compare it with the output resulting from the PreProcessor of Haysatck v1

anakin87 · 2024-03-22T15:25:09Z

@tradicio are you using this information in your application?
Please explain your use case to understand how we can support it, possibly reintroducing this information.

tradicio · 2024-03-27T14:53:35Z

Hi @anakin87, thanks for your message!

You're correct, I am using the _split_id and _split_overlap in my application in order to keep the information of the chunks order resulting from PreProcessor(). This feature allows textual chunks to be displayed as they are in the DocumentStore() and is very useful for checking how the information from a long text is divided and ordered.

With the introduction of Haystack v2 in the application, I would like to keep this functionality to continue to display the sorted output from DocumentSplitter() class.

Let me know if you need further clarification, I am really glad to contribute!

anakin87 · 2024-03-27T15:01:46Z

Ok, I understand!

Let's involve @julian-risch for an opinion since he worked on these components.

julian-risch · 2024-03-27T15:04:44Z

@tradicio Thanks for the feedback, I agree we should add these advanced metadata of _split_id and _split_overlap to a next iteration of DocumentSplitter, yes. 👍

anakin87 · 2024-03-27T15:36:53Z

@tradicio If you feel like it, go ahead and try to create a PR.

tradicio · 2024-03-28T09:01:23Z

@tradicio If you feel like it, go ahead and try to create a PR.

Sure, I'll try to create a PR in the next days! Thanks for all your support, I am really glad to give my contribution

anakin87 added topic:preprocessing type:feature New feature or request labels Mar 20, 2024

anakin87 added the community-triage label Mar 22, 2024

anakin87 added the 2.x Related to Haystack v2.0 label Mar 27, 2024

github-actions bot added the stale label May 10, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024

anakin87 reopened this May 22, 2024

anakin87 removed stale community-triage labels May 22, 2024

anakin87 mentioned this issue Jun 18, 2024

Implement Sentence-Window Retrieval for benchmark evaluation #7843

Closed

davidsbatista self-assigned this Jun 24, 2024

davidsbatista mentioned this issue Jun 26, 2024

feat : adding split_id and split_overlap to DocumentSplitter #7933

Merged

davidsbatista closed this as completed in #7933 Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding splitting information in the metadata of DocumentSplitter output #7389

Adding splitting information in the metadata of DocumentSplitter output #7389

tradicio commented Mar 20, 2024

anakin87 commented Mar 22, 2024

tradicio commented Mar 27, 2024

anakin87 commented Mar 27, 2024

julian-risch commented Mar 27, 2024

anakin87 commented Mar 27, 2024

tradicio commented Mar 28, 2024

Adding splitting information in the metadata of DocumentSplitter output #7389

Adding splitting information in the metadata of DocumentSplitter output #7389

Comments

tradicio commented Mar 20, 2024

anakin87 commented Mar 22, 2024

tradicio commented Mar 27, 2024

anakin87 commented Mar 27, 2024

julian-risch commented Mar 27, 2024

anakin87 commented Mar 27, 2024

tradicio commented Mar 28, 2024