Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up setting docs by traversal paths #89

Closed
alaeddine-13 opened this issue Feb 1, 2022 · 1 comment
Closed

speed up setting docs by traversal paths #89

alaeddine-13 opened this issue Feb 1, 2022 · 1 comment

Comments

@alaeddine-13
Copy link
Member

alaeddine-13 commented Feb 1, 2022

The current implementation of setting docs by traversal paths is not efficient because we have to retrieve the root documents from the storage backend.
A better implementation would be to traverse the documents from the storage backend and return the root document each time with the child. That way, the child should be referenced already by the parent and we just need to modify the child and persist the parent:
new non-debugged code:

def _gen_children_parent_pairs(docs_gen, parent):
    for children in docs_gen:
        yield children, parent

def _gen_children_from_pairs(pair_gen):
    if isinstance(pair_gen, Generator):
        for children, _ in pair_gen:
            yield _gen_children_from_pairs(children)
    elif isinstance(pair_gen, tuple):
        children, _ = pair_gen
        yield children

    @staticmethod
    def _traverse(
        docs: 'T',
        path: str,
        filter_fn: Optional[Callable[['Document'], bool]] = None,
        parent: Optional['Document'] = None
    ):
        path = re.sub(r'\s+', '', path)
        if path:
            cur_loc, cur_slice, _left = _parse_path_string(path)
            if cur_loc == 'r':
                yield from _gen_children_parent_pairs(TraverseMixin._traverse(
                    docs[cur_slice], _left, filter_fn=filter_fn,
                ), parent)
            elif cur_loc == 'm':
                for d in docs:
                    yield from _gen_children_parent_pairs(TraverseMixin._traverse(
                        d.matches[cur_slice], _left, filter_fn=filter_fn, parent=d
                    ), parent)
            elif cur_loc == 'c':
                for d in docs:
                    yield from _gen_children_parent_pairs(TraverseMixin._traverse(
                        d.chunks[cur_slice], _left, filter_fn=filter_fn, parent=d
                    ), parent)
            else:
                raise ValueError(
                    f'`path`:{path} is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure'
                )
        elif filter_fn is None:
            yield docs, parent
        else:
            from .. import DocumentArray

            yield DocumentArray(list(filter(filter_fn, docs))), parent
@JoanFM
Copy link
Member

JoanFM commented Jun 21, 2022

Closing until new community or team member raises

@JoanFM JoanFM closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants