Skip to content

Using the Elastic DSL iterate with an Index #2998

@AndTheDaysGoBy

Description

@AndTheDaysGoBy

Search.scan was the previous method to utilize the Scroll API via elasticsearch-dsl. However, the Scroll API has, for a lot of functionality, been deprecated in favor of the search_after approach. To facilitate this, elasticsearch-dsl has a Search.iterate method which handles the default pagination for the user, default in that, you can't set the page size.

Now, suppose you have a Search object, you can set the index via Search(index='some-index') or with Search().index('some-index'). Regardless, you have a Search object on an index, for which you can then iterate the documents in said index.

for x in Search(index='some-index').iterate():
     pass # do thing

However, this does not behave as I expect it to. In iterate,

def iterate(self, keep_alive: str = "1m") -> Iterator[_R]:

a point-in-time (PIT) is opened up, which makes sense to avoid the data changing under you.

However, my issues lies within the point_in_time method.

def point_in_time(self, keep_alive: str = "1m") -> Iterator[Self]:

It opens the point in time with the appropriate index, however, next, notice how it takes the self, i.e. the current Search object and clears the index. It then yields this search object. This might make sense in the situation the doc string describes where you are constructing a point in time for multiple queries, e.g.

with s.point_in_time() as neo_s:
    neo_s.index('a').execute()
    neo_s.index('b').execute()

however, in the context of iterate, this yields issues as, index is never set again. Thus, each /search query done by iterate will be against all indices, which could yield issues if the user doesn't have permissions to read from all indices.

Please correct me if I'm wrong, this is just what seemed to be the issue when I tried to iterate on an index with a user with fixed read permissions.

As an aside, since search_after utilizes the response values of the last hit in a query (per https://www.elastic.co/guide/en/elasticsearch/reference/8.18/paginate-search-results.html#search-after )
I am a bit confused as to why


is using the s object from the context manager, as opposed to r from the response. I.e., why it isn't r.search_after() per
def search_after(self) -> "SearchBase[_R]":

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions