New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774
Comments
Here is the profiler output: https://gist.github.com/gpstathis/d284e2ad8b8ac3830901549c558ce36b Because of #18693, take the numbers with a grain of salt but one can still see that the majority of the time is spent on iterating over the matches which is validated when reducing the date range in the query: the shorter the date range the faster the response times. |
I don't know much about GlobalOrdinals, but it looks like they come up in this issue. Does anyone who knows more want to take a look? |
Thanks @nik9000. I don't know if this is only for Aggregations, but according to https://www.elastic.co/guide/en/elasticsearch/guide/current/preload-fielddata.html#global-ordinals:
So if we update the index frequently and keep merging, we don't really give global ordinals the chance to build up. In theory then, if we extend |
I could be wrong, but since documents are only allowed to have a single parent, you should be able to run this query with a single join instead of 3. {
"from": 0,
"size": 25,
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"has_parent": {
"query": {
"bool": {
"filter": [
{
"terms": {
"itype": [
"person"
]
}
},
{
"term": {
"tags": "bf0c5985e9644cf3:infs/acct/"
}
},
{
"term": {
"tags": "bf0c5985e9644cf3:infs/proj/21694/"
}
}
]
}
},
"parent_type": "influencer"
}
},
{
"range": {
"publishedDate": {
"from": "2015-04-29T00:00:00-0400",
"to": "2016-06-02T23:59:59-0400",
"include_lower": true,
"include_upper": true
}
}
}
]
}
}
}
},
"fields": "*",
"sort": [
{
"publishedDate": {
"order": "desc"
}
}
]
} Given that the hot threads suggest that the bottleneck is the join, this should help on both versions and might also help reduce the gap. |
Thanks @jpountz . These queries are dynamically constructed so it looks like we need to go back and take closer look at the code that does the joins on Unfortunately, I tested it, and while rewriting it does shave about 1 second from the 13 seconds it used to take on 2.3.3, it's still a 2x performance degradation from 1.4.2. Here are the hot threads while running it: https://gist.github.com/gpstathis/1b508d08e4e612cf9dd0f0224bfabfcd So there is still a very large gap that we are striving to explain here. |
I agree the fact that it is still more than 2x slower than 1.x is puzzling. Can you tell us:
|
@jpountz sent the numbers to your Gmail address. |
Thanks, these numbers will help me reproduce the problem locally. Could you also give again the output of the profiler, but with the new query this time? |
I have been digging a bit, there seem to be several factors at play, but the main one seems to me about how we run conjunctions. We used to have optimizations for very specific cases through FilteredQuery that got lost when we replaced it with BooleanQuery and switched to different bitset impls for the cache (roaring instead of plain bit sets, and FilteredQuery/BooleanFilter had optimizations for the bitset case, look for eg. The fact that parent/child runs on doc values rather than in-memory fielddata in 2.x probably also explains part of the performance hit, but it is unclear how much. I would expect the contribution to be less than the conjunction thing though. |
Thanks for taking a look @jpountz . |
I also opened https://issues.apache.org/jira/browse/LUCENE-7339, which I think also explains the slow down that you have been seeing. Even though the benchmark on the issue sees a 6% jump, this happens with a range that is not cached, so it would likely be better with a cached range (which I think could be the case in your queries since you said the range should get reused). If you don't mind I will now close this issue. Things will hopefully get better in 5.0. |
Thanks @jpountz . Is 5.0 the earliest version that will benefit from these patches? |
I expect these changes to be in Lucene 6.2, which means they should be in elasticsearch 5.1 in the worst case, but likely in elasticsearch 5.0. However it won't be in any 2.x release. |
@jpountz ok thanks |
We were in same situation after an upgrade from 1.4.4 to 2.4.1 where we saw search performance degrade. In our case however the issue was with the source filtering (setFetchSource with specific list of fields) call which was taking time. |
Elasticsearch version: 2.3.3
JVM version: 1.8.0_31-b13
OS version: Ubuntu 14.04.1 LTS
We are upgrading from 1.4.2 to 2.3.3 and are seeing some significant performance loss for several of our filtered queries against a dynamically updated index. For simplicity, I'll focus on just one query for this issue.
The query in question: https://gist.github.com/gpstathis/f96c048d0afe8f2bfce4753ffc9d555b
I am testing this same query on two different clusters, one 1.4.2 and one 2.3.3, loaded with the same data and getting the same real-time updates (new docs arriving at about 20 o 50 per second). Both clusters are using the same machine specs and the indices are sized with the same number of shards.
On a corpus of about 1 billion docs this query/filter yields about 88k hits. It takes about 5 seconds to execute on 1.4.2 while on 2.3.3 the time has more than doubled to 13 seconds.
Here are the hot threads for the query on 1.4.2: https://gist.github.com/gpstathis/12ef17f104e3e2d7fafbae10d6eb9267
And now hot threads for the query on 2.3.3: https://gist.github.com/gpstathis/9196625858d4ca0b4311e5d288bae55e
From what I can tell, we are not benefiting from caching in either environment so I am not clear why 2.3.3, despite the new autocaching, would be penalized so much.
Now the real head scratcher is this: the query times and hot threads above are from production—we also have a QA 2.3.3 cluster with a similar size corpus as prod but sized with half the nodes to save on cost (same number of shards). This QA cluster does not receive real-time updates like the prod cluster does. Even with half the number of nodes, the smaller 2.3.3 QA cluster performs similarly to our old 1.4.2 prod cluster and is therefore faster than the larger 2.3.3 cluster.
From the above, it seems that the real-time updates have a much more adverse effect on 2.3.3 compared to pre-2.x versions?
The text was updated successfully, but these errors were encountered: