Skip to content

[MEDI] Clarify the IngestionChunkWriter.WriteAsync contract around documents #6970

@roji

Description

@roji

VectorStoreWriter.WriteAsync currently assumes that all chunks for a WriteAsync invocation belong to the same document (the preExistingKeys is only initialized once for the first chunk, code). This assumption should either be more strongly-encoded in the documentation and API, or possibly revisited.

If we want to keep this behavior, where IngestionChunkWriter.WriteAsync is called once for all the chunks of a single document, we should probably:

  • Document it as such
  • Consider validating it (remember the document ID of the first chunk, throw if any subsequent chunk in the loop has a different one)
  • Consider renaming the API from WriteAsync to something like WriteDocumentAsync, or WriteDocumentChunksAsync

Another option is to relax this, and allow having chunks from multiple documents in the same WriteAsync invocation; whether this makes sense depends on the larger archietcture of an MEDI pipeline. Allowing this would mean revisiting how (and possibly when) we delete the previous chunks of a document that's being newly-ingested (overwritten).

I think we should figure this out before GA'ing, as a change here would be breaking.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions