Skip to content

Commit

Permalink
docs: multi modalities (#1317)
Browse files Browse the repository at this point in the history
* docs: add multi modalities section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: data types section add :

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: first draft image modality

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: image display with mkdocs

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: image display with mkdocs

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: image display with mkdocs

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: fix second image

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: second image

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add empty sections and 3d mesh iframe

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add first draft of 3d mesh section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: update image display_notebook.jpg for image section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: remove duplicate mesh display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: section header in mesh section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add first draft of audio section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: update audio file

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add first draft of video section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: fix video display in video section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: first draft table section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* chore: add mkdocs-video

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: move mkdocs-video from markdown-extensions to plugins section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add header to empty sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: fix video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: use resized video

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video display

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: display video

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* feat: enable copy to clipboard in mkdocs for code snippets

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* feat: add extra.css file to change highlight color in code blocks

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: image and other sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: apply samis suggestions from code review

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: note with cmd instead of python field

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: fix audio section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: fix black docs

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: audio tensor import in docarray.typing and audiodoc documentation

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: update video section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: video doc and audio docs

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: mesh 3d section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: table section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: remove duplicates in intro sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: move indexing part in video bytes to make more readable

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* refactor: change all DocArray to DocList

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: rebase missed dash

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: mypy, add type hints

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add emojis to headers

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: text section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: getting started sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: multimodal section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: collapse output sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: collapse sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: clean up data types section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* test: add data types section to tests

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: add books.csv to toydata

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: move apple png to toydata dir

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: apply johannes' suggestions from code review

Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai>

* fix: move apple.pngfix: fix docstrings for predefined docs, without testing for now

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: mark missing links

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: adjust links

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: remove link placeholders

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: add missing links

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: clean up

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: apply suggestions from code review

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: apply suggestions

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* test: add csv and tsv file to toydata dir

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: docs tests

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: fix audio section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: image section

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* docs: fix tests

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* test: adjust test_docs

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: adjust paths to github files

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: doc string test for documents

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: swap docvec and anydocarray sections

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

* fix: run grammarly on .md files

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>

---------

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai>
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
  • Loading branch information
anna-charlotte and JohannesMessner committed Apr 12, 2023
1 parent 7b47249 commit b8f178e
Show file tree
Hide file tree
Showing 42 changed files with 4,439 additions and 518 deletions.
107 changes: 53 additions & 54 deletions docarray/array/any_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ def _set_data_column(
field: str,
values: Union[List, T, 'AbstractTensor'],
):
"""Set all Documents in this DocList using the passed values
"""Set all Documents in this [`DocList`][docarray.typing.DocList] using the passed values
:param field: name of the fields to extract
:values: the values to set at the DocList level
Expand All @@ -140,7 +140,7 @@ def to_protobuf(self) -> 'DocListProto':
...

def _to_node_protobuf(self) -> 'NodeProto':
"""Convert a DocList into a NodeProto protobuf message.
"""Convert a [`DocList`][docarray.typing.DocList] into a NodeProto protobuf message.
This function should be called when a DocList
is nested into another Document that need to be converted into a protobuf
Expand All @@ -157,82 +157,81 @@ def traverse_flat(
) -> Union[List[Any], 'AbstractTensor']:
"""
Return a List of the accessed objects when applying the `access_path`. If this
results in a nested list or list of DocLists, the list will be flattened
results in a nested list or list of [`DocList`s][docarray.typing.DocList], the list will be flattened
on the first level. The access path is a string that consists of attribute
names, concatenated and "__"-separated. It describes the path from the first
level to an arbitrary one, e.g. 'content__image__url'.
names, concatenated and `"__"`-separated. It describes the path from the first
level to an arbitrary one, e.g. `'content__image__url'`.
:param access_path: a string that represents the access path ("__"-separated).
:param access_path: a string that represents the access path (`"__"`-separated).
:return: list of the accessed objects, flattened if nested.
EXAMPLE USAGE
.. code-block:: python
from docarray import BaseDoc, DocList, Text
```python
from docarray import BaseDoc, DocList, Text
class Author(BaseDoc):
name: str
class Author(BaseDoc):
name: str
class Book(BaseDoc):
author: Author
content: Text
class Book(BaseDoc):
author: Author
content: Text
docs = DocList[Book](
Book(author=Author(name='Jenny'), content=Text(text=f'book_{i}'))
for i in range(10) # noqa: E501
)
docs = DocList[Book](
Book(author=Author(name='Jenny'), content=Text(text=f'book_{i}'))
for i in range(10) # noqa: E501
)
books = docs.traverse_flat(access_path='content') # list of 10 Text objs
books = docs.traverse_flat(access_path='content') # list of 10 Text objs
authors = docs.traverse_flat(access_path='author__name') # list of 10 strings
authors = docs.traverse_flat(access_path='author__name') # list of 10 strings
```
If the resulting list is a nested list, it will be flattened:
EXAMPLE USAGE
.. code-block:: python
from docarray import BaseDoc, DocList
```python
from docarray import BaseDoc, DocList
class Chapter(BaseDoc):
content: str
class Chapter(BaseDoc):
content: str
class Book(BaseDoc):
chapters: DocList[Chapter]
class Book(BaseDoc):
chapters: DocList[Chapter]
docs = DocList[Book](
Book(chapters=DocList[Chapter]([Chapter(content='some_content') for _ in range(3)]))
for _ in range(10)
)
chapters = docs.traverse_flat(access_path='chapters') # list of 30 strings
docs = DocList[Book](
Book(chapters=DocList[Chapter]([Chapter(content='some_content') for _ in range(3)]))
for _ in range(10)
)
If your DocList is in doc_vec mode, and you want to access a field of
type AnyTensor, the doc_vec tensor will be returned instead of a list:
chapters = docs.traverse_flat(access_path='chapters') # list of 30 strings
```
EXAMPLE USAGE
.. code-block:: python
class Image(BaseDoc):
tensor: TorchTensor[3, 224, 224]
If your [`DocList`][docarray.typing.DocList] is in doc_vec mode, and you want to access a field of
type [`AnyTensor`][docarray.typing.AnyTensor], the doc_vec tensor will be returned instead of a list:
```python
class Image(BaseDoc):
tensor: TorchTensor[3, 224, 224]
batch = DocList[Image](
[
Image(
tensor=torch.zeros(3, 224, 224),
)
for _ in range(2)
]
)
batch_stacked = batch.stack()
tensors = batch_stacked.traverse_flat(
access_path='tensor'
) # tensor of shape (2, 3, 224, 224)
batch = DocList[Image](
[
Image(
tensor=torch.zeros(3, 224, 224),
)
for _ in range(2)
]
)
batch_stacked = batch.stack()
tensors = batch_stacked.traverse_flat(
access_path='tensor'
) # tensor of shape (2, 3, 224, 224)
```
"""
...

Expand Down Expand Up @@ -264,7 +263,7 @@ def _flatten_one_level(sequence: List[Any]) -> List[Any]:

def summary(self):
"""
Print a summary of this DocList object and a summary of the schema of its
Print a summary of this [`DocList`][docarray.typing.DocList] object and a summary of the schema of its
Document type.
"""
DocArraySummary(self).summary()
Expand All @@ -276,13 +275,13 @@ def _batch(
show_progress: bool = False,
) -> Generator[T, None, None]:
"""
Creates a `Generator` that yields `DocList` of size `batch_size`.
Creates a `Generator` that yields [`DocList`][docarray.typing.DocList] of size `batch_size`.
Note, that the last batch might be smaller than `batch_size`.
:param batch_size: Size of each generated batch.
:param shuffle: If set, shuffle the Documents before dividing into minibatches.
:param show_progress: if set, show a progress bar when batching documents.
:yield: a Generator of `DocList`, each in the length of `batch_size`
:yield: a Generator of [`DocList`][docarray.typing.DocList], each in the length of `batch_size`
"""
from rich.progress import track

Expand Down
76 changes: 39 additions & 37 deletions docarray/array/doc_list/doc_list.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,8 @@ class DocList(
homogeneous and follow the same schema. To precise this schema you can use
the `DocList[MyDocument]` syntax where MyDocument is a Document class
(i.e. schema). This creates a DocList that can only contains Documents of
the type 'MyDocument'.
the type `MyDocument`.
---
```python
from docarray import BaseDoc, DocList
Expand All @@ -86,36 +85,39 @@ class Image(BaseDoc):
docs = DocList[Image](
Image(url='http://url.com/foo.png') for _ in range(10)
) # noqa: E510
```
---
# If your DocList is homogeneous (i.e. follows the same schema), you can access
# fields at the DocList level (for example `docs.tensor` or `docs.url`).
print(docs.url)
# [ImageUrl('http://url.com/foo.png', host_type='domain'), ...]
If your DocList is homogeneous (i.e. follows the same schema), you can access
fields at the DocList level (for example `docs.tensor` or `docs.url`).
You can also set fields, with `docs.tensor = np.random.random([10, 100])`:
# You can also set fields, with `docs.tensor = np.random.random([10, 100])`:
print(docs.url)
# [ImageUrl('http://url.com/foo.png', host_type='domain'), ...]
import numpy as np
import numpy as np
docs.tensor = np.random.random([10, 100])
print(docs.tensor)
# [NdArray([0.11299577, 0.47206767, 0.481723 , 0.34754724, 0.15016037,
# 0.88861321, 0.88317666, 0.93845579, 0.60486676, ... ]), ...]
docs.tensor = np.random.random([10, 100])
You can index into a DocList like a numpy doc_list or torch tensor:
print(docs.tensor)
# [NdArray([0.11299577, 0.47206767, 0.481723 , 0.34754724, 0.15016037,
# 0.88861321, 0.88317666, 0.93845579, 0.60486676, ... ]), ...]
docs[0] # index by position
docs[0:5:2] # index by slice
docs[[0, 2, 3]] # index by list of indices
docs[True, False, True, True, ...] # index by boolean mask
# You can index into a DocList like a numpy doc_list or torch tensor:
You can delete items from a DocList like a Python List
docs[0] # index by position
docs[0:5:2] # index by slice
docs[[0, 2, 3]] # index by list of indices
docs[True, False, True, True, ...] # index by boolean mask
del docs[0] # remove first element from DocList
del docs[0:5] # remove elements for 0 to 5 from DocList
# You can delete items from a DocList like a Python List
del docs[0] # remove first element from DocList
del docs[0:5] # remove elements for 0 to 5 from DocList
```
:param docs: iterable of Document
Expand All @@ -135,10 +137,10 @@ def construct(
docs: Sequence[T_doc],
) -> T:
"""
Create a DocList without validation any data. The data must come from a
Create a `DocList` without validation any data. The data must come from a
trusted source
:param docs: a Sequence (list) of Document with the same schema
:return:
:return: a `DocList` object
"""
new_docs = cls.__new__(cls)
new_docs._data = docs if isinstance(docs, list) else list(docs)
Expand All @@ -154,13 +156,13 @@ def __eq__(self, other: Any) -> bool:

def _validate_docs(self, docs: Iterable[T_doc]) -> Iterable[T_doc]:
"""
Validate if an Iterable of Document are compatible with this DocList
Validate if an Iterable of Document are compatible with this `DocList`
"""
for doc in docs:
yield self._validate_one_doc(doc)

def _validate_one_doc(self, doc: T_doc) -> T_doc:
"""Validate if a Document is compatible with this DocList"""
"""Validate if a Document is compatible with this `DocList`"""
if not issubclass(self.doc_type, AnyDoc) and not isinstance(doc, self.doc_type):
raise ValueError(f'{doc} is not a {self.doc_type}')
return doc
Expand All @@ -178,25 +180,25 @@ def __bytes__(self) -> bytes:

def append(self, doc: T_doc):
"""
Append a Document to the DocList. The Document must be from the same class
as the doc_type of this DocList otherwise it will fail.
Append a Document to the `DocList`. The Document must be from the same class
as the `.doc_type` of this `DocList` otherwise it will fail.
:param doc: A Document
"""
self._data.append(self._validate_one_doc(doc))

def extend(self, docs: Iterable[T_doc]):
"""
Extend a DocList with an Iterable of Document. The Documents must be from
the same class as the doc_type of this DocList otherwise it will
Extend a `DocList` with an Iterable of Document. The Documents must be from
the same class as the `.doc_type` of this `DocList` otherwise it will
fail.
:param docs: Iterable of Documents
"""
self._data.extend(self._validate_docs(docs))

def insert(self, i: int, doc: T_doc):
"""
Insert a Document to the DocList. The Document must be from the same
class as the doc_type of this DocList otherwise it will fail.
Insert a Document to the `DocList`. The Document must be from the same
class as the doc_type of this `DocList` otherwise it will fail.
:param i: index to insert
:param doc: A Document
"""
Expand Down Expand Up @@ -238,10 +240,10 @@ def _set_data_column(
field: str,
values: Union[List, T, 'AbstractTensor'],
):
"""Set all Documents in this DocList using the passed values
"""Set all Documents in this `DocList` using the passed values
:param field: name of the fields to set
:values: the values to set at the DocList level
:values: the values to set at the `DocList` level
"""
...

Expand All @@ -253,11 +255,11 @@ def stack(
tensor_type: Type['AbstractTensor'] = NdArray,
) -> 'DocVec':
"""
Convert the DocList into a DocVec. `Self` cannot be used
Convert the `DocList` into a `DocVec`. `Self` cannot be used
afterwards
:param tensor_type: Tensor Class used to wrap the doc_vec tensors. This is useful
if the BaseDoc has some undefined tensor type like AnyTensor or Union of NdArray and TorchTensor
:return: A DocVec of the same document type as self
:return: A `DocVec` of the same document type as self
"""
from docarray.array.doc_vec.doc_vec import DocVec

Expand Down Expand Up @@ -291,7 +293,7 @@ def traverse_flat(
@classmethod
def from_protobuf(cls: Type[T], pb_msg: 'DocListProto') -> T:
"""create a Document from a protobuf message
:param pb_msg: The protobuf message from where to construct the DocList
:param pb_msg: The protobuf message from where to construct the `DocList`
"""
return super().from_protobuf(pb_msg)

Expand Down
Loading

0 comments on commit b8f178e

Please sign in to comment.