Skip to content

💫 Release v0.19.0

Compare
Choose a tag to compare
@github-actions github-actions released this 15 Nov 16:19
1409e52

Release Note (0.19.0)

Release time: 2022-11-15 15:22:16

This release contains 2 breaking changes, 11 new features, 1 performance improvement, 7 bug fixes and 7 documentation improvements.

💥 Breaking changes

  • DocumentArray now supports Qdrant versions above 0.10.1, and drops support for previous versions (#726)
  • DocumentArray now supports Weaviate versions above 0.16.0 and client 3.9.0, and drops support for previous versions (#736)

🆕 Features

Add flag to disable list-like structure and behavior (#730, #766, #768, #762)

Sometimes, you do not need to use a DocumentArray as a list and access by offset. Since this capability involves keeping in the store a mapping of Offset2ID it comes with overhead.

Now, when using a DocumentArray with external storage, you can disable this behavior. This improves performance when accessing Documents by ID while disallowing some list-like behavior.

from docarray import DocumentArray

da = DocumentArray(storage='qdrant', config={'n_dim': 10, 'list_like': False})

Support find by text and filter for ElasticSearch and Redis backends (#740)

For ElasticSearch and Redis document stores we now support find by text while applying filtering.

from docarray import DocumentArray, Document

da = DocumentArray(storage='elasticsearch', config={'n_dim': 32, 'columns': {'price': 'int'}, 'index_text': True})

with da:
    da.extend(
        [Document(tags={'price': i}, text=f'pizza {i}') for i in range(10)]
    )
    da.extend(
        [
            Document(tags={'price': i}, text=f'noodles {i}')
            for i in range(10)
        ]
    )

results = da.find('pizza', filter={
    'range': {
        'price': {
            'lte': 5,
        }
    }
})

assert len(results) > 0
assert all([r.tags['price'] < 5 for r in results])
assert all(['pizza' in r.text for r in results])

Add 3D data handling of mesh vertices and faces (#709, #717)

DocArray now supports loading data with vertices and faces to represent 3D objects. You can visualize them using display:

from docarray import Document

doc = Document(uri='some/uri')
doc.load_uri_to_vertices_and_faces()
doc.display()


Add embed_and_evaluate method (#702, #731)

The method embed_and_evaluate has been added to DocumentArray that performs embedding, matching, and computing evaluation metrics all at once. It batches operations to reduce the computation footprint.

import numpy as np
from docarray import Document, DocumentArray


def emb_func(da):
    for d in da:
        np.random.seed(int(d.text))
        d.embedding = np.random.random(5)


da = DocumentArray(
    [Document(text=str(i), tags={'label': i % 10}) for i in range(1_000)]
)

da.embed_and_evaluate(
    metrics=['precision_at_k'], embed_funcs=emb_func, query_sample_size=100
)

Reduction of memory usage when evaluating 100 query vectors against 500,000 index vectors with 500 dimensions:

Manual Evaluation:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    28   1130.7 MiB   1130.7 MiB           1   @profile
    29                                         def run_evaluation_old_style(queries, index, model):
    30   1133.1 MiB      2.5 MiB           1       queries.embed(model)
    31   2345.6 MiB   1212.4 MiB           1       index.embed(model)
    32   2360.4 MiB     14.8 MiB           1       queries.match(index)
    33   2360.4 MiB      0.0 MiB           1       return queries.evaluate(metrics=['reciprocal_rank'])

Evaluation with `embed_and_evaluate (batch_size 100,000):

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    23   1130.6 MiB   1130.6 MiB           1   @profile
    24                                         def run_evaluation(queries, index, model, batch_size=None):
    25   1130.6 MiB      0.0 MiB           1       kwargs = {'match_batch_size':batch_size} if batch_size else {}
    26   1439.9 MiB    309.3 MiB           1       return queries.embed_and_evaluate(metrics=['reciprocal_rank'], index_data=index, embed_models=model, **kwargs)

Update Qdrant version to 0.10.1 (#726)

This release supports Qdrant versions above 0.10.1. This comes with a lot of performance improvements and bug fixes on the backend.

Add filter support for Qdrant document store (#652)

Qdrant document store now supports pure filtering:

from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
    storage='qdrant',
    config={'n_dim': n_dim, 'columns': {'price': 'float'}},
)
with da:
    da.extend(
        [
            Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
            for i in range(10)
        ]
    )


max_price = 7
n_limit = 4
filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}

results = da.filter(filter=filter, limit=n_limit)

print('\nPoints with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
    print(f'\tembedding={embedding},\t price={price}')

This prints:

Points with "price" at most 7:
	embedding=[6. 6. 6.],	 price=6
	embedding=[7. 7. 7.],	 price=7
	embedding=[1. 1. 1.],	 price=1
	embedding=[2. 2. 2.],	 price=2

Support passing search_params in find for Qdrant document store (#675)

You can now pass search_params in find interface with Qdrant.

results = da.find(np_query, filter=filter, limit=n_limit, search_params={"hnsw_ef": 64})

Add login and logout proxy methods to DocumentArray (#697)

DocArray offers login and logout methods to log into your Jina AI Cloud account directly from DocArray.

from docarray import login, logout
login()
# you are logged in
logout()
# you are logged out

Add docarray version to push (#710)

When pushing DocumentArray to cloud, docarray version is now added as metadata.

Add args to load_uri_to_video_tensor (#663)

Add keyword arguments that are available in av.open() to load_uri_to_video_tensor()

from docarray import Document

doc = Document(uri='/some/uri')
doc.load_uri_to_video_tensor(timeout=5000)

Update Weaviate server to v1.16.1 and client to 3.9.0 (#736, #750)

This release adds support for Weaviate version above v1.16.0. Make sure to use version 1.16.1 of the Weaviate backend to enjoy all Weaviate features.

🚀 Performance

Sync sub-index only when parent is synced (#719)

Previously, if you used the sub-index feature, every time you add new Documents with chunks, DocArray would persist the offset2ids of the chunk subindex. With this change, the offset2id is persisted once, when the parent DocumentArray's offset2id is persisted.

🐞 Bug Fixes

Exception for all from generator calls on instance (#659)

Previously, when calling generator class methods as from_csv from a DocumentArray instance it had the non-intuitive behavior of not changing the DocumentArray in place.

Now DocumentArray instances are not allowed to call these methods, and raise an Exception.

from docarray import DocumentArray

da = DocumentArray()
da.from_files(
    patterns='*.*',
    size=2,
)
AttributeError: Class method can't be called from a DocumentArray instance but only from the DocumentArray class.

Fix markup error in summary (#739)

Previously, calling summary on a Document that contains some textual patterns would raise an Exception from rich. This release uses the Text class from rich to ensure the text is properly rendered.

Convert score of search results to float (#707)

When using find or match interfaces with Redis document store, scores are now returned as float and not string.

Initialize doc with dataclass obj and kwargs (#694)

Allow initialization of a Document instance with a dataclass object as well as additional kwargs.
Currently, when a Document is initialized with dataclass and kwargs the attributes passed with the dataclass object are overridden.

from docarray import dataclass, Document
from docarray.typing import Text

@dataclass
class MyDoc:
    chunk_text: Text

d = Document(MyDoc(chunk_text='chunk level text'), text='top level text')

assert d.text == 'top level text'
assert d.chunk_text.text == 'chunk level text'

Attribute error with empty list in dataclass (#674)

Allow passing an empty List as field input of a dataclass:

from docarray import *
from docarray.typing import *

from typing import List

@dataclass()
class A:
    img: List[Text]

Document(A(img = []))

Propagate context enter and exit to subindices (#737)

When using DocumentArray as a context manager, subindices are now handled as context managers as well.
This makes handling subindices more robust.

Correct type hint for tags in DocumentData (#735 )

Change the type hint for tags in docarray.document.data.DocumentData from tags: Optional[Dict[str, 'StructValueType']] to tags: Optional[Dict[str, Any]].
This stops the IDE complaining when passing nested dictionaries inside tags.

📗 Documentation Improvements

Add new benchmark page with SIFT1M dataset (#691)

Change the benchmark section of docs to use SIFT1M dataset. Also add QPS-Recall graphs to compare how different DocumentStores work in DocArray.

newplot

Other documentation improvements

  • Add Colab notebook for interactive 3D data visualization (#749)
  • Fix Finetuner links in README (#706)
  • Use URL instead of session state in version selector (#693)
  • Replace plot with display (#689)
  • Add versioned documentation (#664)
  • Complement and rewrite evaluation docs (#662)

🤟 Contributors

We would like to thank all contributors to this release: