Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid the fetch phase when retrieving only _id #17159

Closed
gcampbell-epiq opened this issue Mar 17, 2016 · 12 comments
Closed

Avoid the fetch phase when retrieving only _id #17159

gcampbell-epiq opened this issue Mar 17, 2016 · 12 comments
Labels
discuss :Search/Search Search-related issues that do not fall into other categories

Comments

@gcampbell-epiq
Copy link

Is it possible to avoid the fetch phase for a search and return only document IDs? Is _id available at the end of the query phase such that fetch would be redundant when all I want is the _id? Can this be done through the current API?

Ultimately, I'm hoping to retrieve doc IDs from a search much faster than what I've seen so far. I've tried all documented ways in all sorts of permutations to get better performance, and I've found no satisfactory results. The best I've achieved was only a 25% speed improvement by, in parallel, querying each of 5 shards individually. An acceptable speed would be 90% faster. It would help a lot to understand whether this is reasonable and why if it is not. It's very difficult to understand why I can be given a) the first 100 results, b) the total count and c) have them sorted so quickly, but retrieving the results is very slow.

@gcampbell-epiq
Copy link
Author

Also, is there any possibility of improving performance for this (IDs only) scenario by developing a plugin? Are there any other options, documented or not that can reduce overhead?

Just to stress the importance of this, it would be crucial to our implementation and likely a deciding factor for our adoption of Elastic to replace our current massive persistence layer.

@clintongormley clintongormley added discuss :Search/Search Search-related issues that do not fall into other categories labels Mar 17, 2016
@jpountz
Copy link
Contributor

jpountz commented Mar 18, 2016

How many ids are you retrieving per request? If few then I am surprised that the fetch phase is taking so long, if many then I'm afraid elasticsearch is not the right tool for the job: this is something that regular databases are better at.

@gcampbell-epiq
Copy link
Author

Returning few IDs is very fast. Returning 10k and up is slow. I'd like to understand why. Can you explain this? Also, I'd like to explore options for getting better performance. Could you provide some guidance or ideas on where to look developing performance improvements, e.g. plugin for Elastic, use Lucene directly? Why not try a query only (no fetch) search type?

@nik9000
Copy link
Member

nik9000 commented Mar 18, 2016

I'd like to understand why.

The search phase fetches Lucene's doc ids (integers), not elasticsearch's ids (strings). The fetch phase looks up the doc ids using Lucene's stored fields mechanism. Stored fields are stored together in compressed chunks. Since _source is a stored field you have to decompress a lot of _source to get to the id field. Because it is chunked you also have to decompress stored fields for docs you didn't hit.

Aggregations are fast because they use doc values which is a non-chunked columnal structure. It is compressed, but using numeric tricks rather than a general purpose compression algorithm. If you can retool your work as an aggregation by pushing the interesting work to Elasticsearch then your thing can be orders of magnitude faster.

@gcampbell-epiq
Copy link
Author

That's a great explanation. Thank you so much for that and the idea. I will try it immediately.

From looking at lucene/elastic code I had worried that the intermediate results from the query phase would not be usable. This comment appears in every(?) implementation of IndexReader in lucene.

<p> For efficiency, in this API documents are often referred to via <i>document numbers</i>, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.

But given this comment, I wonder if there is or could be an implementation of IndexReader that returns a usable ID.

@gcampbell-epiq
Copy link
Author

This is exciting. Using the aggregation method, I was able to get back 10K IDs in 16ms. Via scroll, the same results took ~6000ms. Can you help me understand what costs or tradeoffs are made by using this method, e.g., is memory usage much greater, or performance degradation non-linear?

@jpountz
Copy link
Contributor

jpountz commented Oct 24, 2016

@jimferenczi I think I remember you did something about this?

@jimczi
Copy link
Contributor

jimczi commented Oct 25, 2016

@gcampbell-epiq in the upcoming 5.0 you can disable the stored fields retrieval. This should speed up the search if you need docvalue or fieldcache fields only. For instance if you want to retrieve the _uid field you can do:

GET _search 
{
    "stored_fields": "_none_",
    "docvalue_fields": ["_uid"]
}

.. this will retrieve the _uid field from the fielddata (this field doesn't have docvalues) so it should be slow on the first query which needs to build the fielddata in the heap but from there the next search should be much faster than the regular one.

@jimczi jimczi closed this as completed Oct 25, 2016
@cjbottaro
Copy link

Can someone explain how to use aggregations to return document ids only and avoid the slow fetching?

@nik9000
Copy link
Member

nik9000 commented Dec 6, 2016

Can someone explain how to use aggregations to return document ids only and avoid the slow fetching?

Do what @jimczi suggests above - disable stored_fields and fetch only fields with docvalues. Your best bet is to only use this with fields that have docvalues like keyword fields or numbers.

@lanpay-lulu
Copy link

I suggest use another field to store uid and fetching it with docvalues which should be fast enough.

alexklibisz added a commit to alexklibisz/elastiknn that referenced this issue Jul 7, 2020
Addresses issue #102, as described in this discussion: elastic/elasticsearch#17159 (comment). 
This removes the `LZ4.decompress()` hotspot and reduces the L2 LSH benchmark from ~10 seconds to ~6.5 seconds (on my laptop). 
Possibly also resolves #95. Need to revisit that one to see if the decompress calls were actually decompressing vectors or the doc body.

Specific changes to scala client:
- Stores document ID in a doc-values field.
- Retrieves the ID from the doc-values field, and not the typical document ID.
- Uses a custom response handler to copy the ID value into its regular position in the elastic4s `SearchHit` case class, so users don't need to find it in the weakly-typed `.fields` map. 
- Better comments and more consistent naming conventions.
@jtlz2
Copy link

jtlz2 commented Sep 15, 2022

This is exciting. Using the aggregation method, I was able to get back 10K IDs in 16ms. Via scroll, the same results took ~6000ms. Can you help me understand what costs or tradeoffs are made by using this method, e.g., is memory usage much greater, or performance degradation non-linear?

@gcampbell-epiq How can I implement the "aggregation" method? (sorry - 5 years later :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

8 participants