New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowdown on date range queries in ES 2.1 #15994
Comments
What version of ES? |
|
And what is the average time when you remove the range filter/post_filter ? The post_filter vs filter are very similar in terms of implementation, the post_filter has to scan the inverted lists also. |
@jrots It seems there may be a misunderstanding of what the That said, something seems odd about how the query is executed - if the range query is the most costly, the bool query should figure that out and change the execution order to apply the terms queries first. We'll investigate. Two questions:
|
@clintongormley I guess that the number of results differs because the post-filtered documents are not taken into account in the total hits. I opened #16021 because I don't know if it's expected or not. |
@jimferenczi no - these counts should be the same whether post_filter is used or not. post_filter just affects which documents the aggs see. |
I was still actively indexing at that moment - now that's done and I've also ran: force merge max_segments=1 to be sure that we're measuring the right things here. with range on birthdate : {"took":872,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":875,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":109,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}} If I change the range of the birthdate with one day f.e. it's again consecutive 800ms+ leaving out the birthdate completely : {"took":37,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":39062}}
{"took":32,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":39062}} etc.. running with range on birthdate in post_filter,initial + consecutive {"took":401,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":54,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":53,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}} birthdate is stored as : "birthdate" : {
"type" : "date",
"format" : "YYYY-MM-dd"
}, Wonder if I don't need time precision on my dates (like a birthdate f.e.), |
OK that explains the varying counts.
Before going there... query execution order has had a big rewrite, and it sounds like it is choosing a suboptimal order. @jimferenczi is going to dive into it to see if he can find something that can be improved. thanks for all the info |
edited version of the comment which was wrong and misleading. |
@jrots I've tried to reproduce with a 40M documents index, 5 shards, each document contain one date field with a random date chosen between "1950-01-01" to "2000-12-31" and two long fields with different cardinalities (one big and one small). |
@jrots could you provide a more complete recreation? or would it be possible to upload your index somewhere so that we could try it out? |
Ok I'll have a look to have a testable index that doesn't contain too much private info |
Hi guys, I think if you have a lot of documents and are doing a wide numeric range there can be a big slowdown. . I fixed my problem by rewriting the following range query :
to :
0_<year rounded 0> => spans 10 years I added a transform to the index mapping that generates these keywords automatically for me:
I have a speed up times 10 doing it like this for my use case. |
The new point field encoding coming in 5.0 should improve this situation. Closing |
Hi @jrots. Did you test your use case with ES 5.* ? Can you confirm the improve ? I have the same issue with ES 5.3 and I don't found a "native solution" in ES. Regards. |
@hwb1992 Please ask this question on our discuss forum where we can provide better support. |
On a dataset of +/- 30M (unoptimized / so lot's of segments)
The following query executes faster a lot faster :
< 110ms, consecutive runs,
{"took":109,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":2454,"max_score":1.0
vs. this is one is always : 300ms to 400ms +
{"took":304,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":2446,"max_score"
If this expected behaviour, please close the issue -
But I think that ES should analyse the query and know when to do post filtering instead of generating a lot of lucene terms for date fields that slow down queries drastically.
The text was updated successfully, but these errors were encountered: