💫 Release v0.30.0 #1410
samsja
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
💫 Release v0.30.0 (a.k.a DocArray v2)
Changelog
If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.
DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.
This gives the following advantages:
You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:
For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Changes to
Document
Document
has been renamed toBaseDoc
.BaseDoc
cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.BaseDoc
allows for a flexible schema compared to theDocument
class in v1 which only allowed for a fixed schema, with one oftensor
,text
andblob
, and additionalchunks
andmatches
..load_uri_to_image_tensor()
) are not supported in v2. Instead, we provide some of those methods on the typing-level.LegacyDocument
class, which extendsBaseDoc
while following the same schema as v1'sDocument
. TheLegacyDocument
can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1Document
. Indeed, none of the methods associated withDocument
are present. Only the schema of the data is similar.Changes to
DocumentArray
DocList
DocumentArray
class from v1 has been renamed toDocList
, to be more descriptive of its actual functionality, since it is a list ofBaseDoc
s.DocVec
DocVec
, which is a column-based representation ofBaseDoc
s. BothDocVec
andDocList
extendAnyDocArray
.DocVec
is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).DocVec
has a similar interface asDocList
but with an underlying implementation that is column-based instead of row-based. Each field of the schema of theDocVec
(the.doc_type
which is aBaseDoc
) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a singledoc_vec
(Torch/TensorFlow/NumPy) tensor. If the tensor field isAnyTensor
or a Union of tensor types, the.tensor_type
will be used to determine the type of thedoc_vec
column.Parameterized DocList
DocList
it does not necessarily have to be homogenous.DocList
you can parameterize it at initialization time:.from_csv()
or.pull()
only work with parameterizedDocList
s.Access attributes of your DocumentArray
AnyDocArray
will expose the same attributes as theBaseDoc
s it contains. This will return a list oftype(attribute)
. However, this only works if (and only if) all theBaseDoc
s in theAnyDocArray
have the same schema. Therefore only this works:Changes to Document Store
In v2 the
Document Store
has been renamed toDocIndex
and can be used for fast retrieval using vector similarity. DocArray v2DocIndex
supports:Instead of creating a
DocumentArray
instance and setting thestorage
parameter to a vector database of your choice, in v2 you can initialize aDocIndex
object of your choice, such as:In contrast,
DocStore
in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.Thank you to all of the contributors to this release:
This discussion was created from the release 💫 Release v0.30.0.
Beta Was this translation helpful? Give feedback.
All reactions