Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774

gpstathis · 2016-06-08T06:10:00Z

Elasticsearch version: 2.3.3

JVM version: 1.8.0_31-b13

OS version: Ubuntu 14.04.1 LTS

We are upgrading from 1.4.2 to 2.3.3 and are seeing some significant performance loss for several of our filtered queries against a dynamically updated index. For simplicity, I'll focus on just one query for this issue.

The query in question: https://gist.github.com/gpstathis/f96c048d0afe8f2bfce4753ffc9d555b

I am testing this same query on two different clusters, one 1.4.2 and one 2.3.3, loaded with the same data and getting the same real-time updates (new docs arriving at about 20 o 50 per second). Both clusters are using the same machine specs and the indices are sized with the same number of shards.

On a corpus of about 1 billion docs this query/filter yields about 88k hits. It takes about 5 seconds to execute on 1.4.2 while on 2.3.3 the time has more than doubled to 13 seconds.

Here are the hot threads for the query on 1.4.2: https://gist.github.com/gpstathis/12ef17f104e3e2d7fafbae10d6eb9267

And now hot threads for the query on 2.3.3: https://gist.github.com/gpstathis/9196625858d4ca0b4311e5d288bae55e

From what I can tell, we are not benefiting from caching in either environment so I am not clear why 2.3.3, despite the new autocaching, would be penalized so much.

Now the real head scratcher is this: the query times and hot threads above are from production—we also have a QA 2.3.3 cluster with a similar size corpus as prod but sized with half the nodes to save on cost (same number of shards). This QA cluster does not receive real-time updates like the prod cluster does. Even with half the number of nodes, the smaller 2.3.3 QA cluster performs similarly to our old 1.4.2 prod cluster and is therefore faster than the larger 2.3.3 cluster.

From the above, it seems that the real-time updates have a much more adverse effect on 2.3.3 compared to pre-2.x versions?

gpstathis · 2016-06-08T16:55:43Z

Here is the profiler output:

https://gist.github.com/gpstathis/d284e2ad8b8ac3830901549c558ce36b

Because of #18693, take the numbers with a grain of salt but one can still see that the majority of the time is spent on iterating over the matches which is validated when reducing the date range in the query: the shorter the date range the faster the response times.

nik9000 · 2016-06-08T17:45:22Z

I don't know much about GlobalOrdinals, but it looks like they come up in this issue. Does anyone who knows more want to take a look?

gpstathis · 2016-06-09T04:35:34Z

Thanks @nik9000. I don't know if this is only for Aggregations, but according to https://www.elastic.co/guide/en/elasticsearch/guide/current/preload-fielddata.html#global-ordinals:

Once global ordinals have been rebuilt, they will be reused until the segments in the index change: after a refresh, a flush, or a merge

So if we update the index frequently and keep merging, we don't really give global ordinals the chance to build up. In theory then, if we extend refresh_interval we should be able to see the benefits from global ordinals. I tested this by setting refresh_interval to 120s—there was no difference in the response times.

jpountz · 2016-06-09T08:23:15Z

I could be wrong, but since documents are only allowed to have a single parent, you should be able to run this query with a single join instead of 3.

{
  "from": 0,
  "size": 25,
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "filter": [
            {
              "has_parent": {
                "query": {
                  "bool": {
                    "filter": [
                      {
                        "terms": {
                          "itype": [
                            "person"
                          ]
                        }
                      },
                      {
                        "term": {
                          "tags": "bf0c5985e9644cf3:infs/acct/"
                        }
                      },
                      {
                        "term": {
                          "tags": "bf0c5985e9644cf3:infs/proj/21694/"
                        }
                      }
                    ]
                  }
                },
                "parent_type": "influencer"
              }
            },
            {
              "range": {
                "publishedDate": {
                  "from": "2015-04-29T00:00:00-0400",
                  "to": "2016-06-02T23:59:59-0400",
                  "include_lower": true,
                  "include_upper": true
                }
              }
            }
          ]
        }
      }
    }
  },
  "fields": "*",
  "sort": [
    {
      "publishedDate": {
        "order": "desc"
      }
    }
  ]
}

Given that the hot threads suggest that the bottleneck is the join, this should help on both versions and might also help reduce the gap.

gpstathis · 2016-06-09T13:46:07Z

Thanks @jpountz . These queries are dynamically constructed so it looks like we need to go back and take closer look at the code that does the joins on _parent.

Unfortunately, I tested it, and while rewriting it does shave about 1 second from the 13 seconds it used to take on 2.3.3, it's still a 2x performance degradation from 1.4.2.

Here are the hot threads while running it: https://gist.github.com/gpstathis/1b508d08e4e612cf9dd0f0224bfabfcd

So there is still a very large gap that we are striving to explain here.

jpountz · 2016-06-09T14:15:58Z

I agree the fact that it is still more than 2x slower than 1.x is puzzling. Can you tell us:

how many parents and children documents the index has
how many documents match each individual query (the range, the has_parent, the bool under the has_parent, and each query under the bool)
are there parts of the query that are often the same?

gpstathis · 2016-06-09T17:33:35Z

@jpountz sent the numbers to your Gmail address.

jpountz · 2016-06-09T21:15:51Z

Thanks, these numbers will help me reproduce the problem locally. Could you also give again the output of the profiler, but with the new query this time?

gpstathis · 2016-06-09T21:30:37Z

@jpountz here you go: https://gist.github.com/gpstathis/e2e4f3bacd9462ebfd63672a35db278a

jpountz · 2016-06-10T18:15:28Z

I have been digging a bit, there seem to be several factors at play, but the main one seems to me about how we run conjunctions. We used to have optimizations for very specific cases through FilteredQuery that got lost when we replaced it with BooleanQuery and switched to different bitset impls for the cache (roaring instead of plain bit sets, and FilteredQuery/BooleanFilter had optimizations for the bitset case, look for eg. RandomAccessFilterStrategy in the 1.x hot threads). I started working on some changes in Lucene (see https://issues.apache.org/jira/browse/LUCENE-7330 and https://issues.apache.org/jira/browse/LUCENE-7332) but they won't be available until elasticsearch 5.x.

The fact that parent/child runs on doc values rather than in-memory fielddata in 2.x probably also explains part of the performance hit, but it is unclear how much. I would expect the contribution to be less than the conjunction thing though.

gpstathis · 2016-06-10T23:52:58Z

Thanks for taking a look @jpountz .

jpountz · 2016-06-16T10:03:10Z

I also opened https://issues.apache.org/jira/browse/LUCENE-7339, which I think also explains the slow down that you have been seeing. Even though the benchmark on the issue sees a 6% jump, this happens with a range that is not cached, so it would likely be better with a cached range (which I think could be the case in your queries since you said the range should get reused).

If you don't mind I will now close this issue. Things will hopefully get better in 5.0.

gpstathis · 2016-06-16T20:48:06Z

Thanks @jpountz . Is 5.0 the earliest version that will benefit from these patches?

jpountz · 2016-06-17T08:06:43Z

I expect these changes to be in Lucene 6.2, which means they should be in elasticsearch 5.1 in the worst case, but likely in elasticsearch 5.0. However it won't be in any 2.x release.

gpstathis · 2016-06-17T14:02:29Z

@jpountz ok thanks

gauravbhattpb · 2018-01-16T08:27:21Z

We were in same situation after an upgrade from 1.4.4 to 2.4.1 where we saw search performance degrade. In our case however the issue was with the source filtering (setFetchSource with specific list of fields) call which was taking time.

clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 13, 2016

clintongormley assigned jpountz Jun 13, 2016

jpountz closed this as completed Jun 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774

Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774

gpstathis commented Jun 8, 2016 •

edited

gpstathis commented Jun 8, 2016 •

edited

nik9000 commented Jun 8, 2016

gpstathis commented Jun 9, 2016 •

edited

jpountz commented Jun 9, 2016

gpstathis commented Jun 9, 2016

jpountz commented Jun 9, 2016

gpstathis commented Jun 9, 2016

jpountz commented Jun 9, 2016

gpstathis commented Jun 9, 2016

jpountz commented Jun 10, 2016

gpstathis commented Jun 10, 2016

jpountz commented Jun 16, 2016

gpstathis commented Jun 16, 2016

jpountz commented Jun 17, 2016

gpstathis commented Jun 17, 2016

gauravbhattpb commented Jan 16, 2018 •

edited

Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774

Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774

Comments

gpstathis commented Jun 8, 2016 • edited

gpstathis commented Jun 8, 2016 • edited

nik9000 commented Jun 8, 2016

gpstathis commented Jun 9, 2016 • edited

jpountz commented Jun 9, 2016

gpstathis commented Jun 9, 2016

jpountz commented Jun 9, 2016

gpstathis commented Jun 9, 2016

jpountz commented Jun 9, 2016

gpstathis commented Jun 9, 2016

jpountz commented Jun 10, 2016

gpstathis commented Jun 10, 2016

jpountz commented Jun 16, 2016

gpstathis commented Jun 16, 2016

jpountz commented Jun 17, 2016

gpstathis commented Jun 17, 2016

gauravbhattpb commented Jan 16, 2018 • edited

gpstathis commented Jun 8, 2016 •

edited

gpstathis commented Jun 8, 2016 •

edited

gpstathis commented Jun 9, 2016 •

edited

gauravbhattpb commented Jan 16, 2018 •

edited