Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query/Filter performance loss from 1.4.2 to 2.3.3 on dynamically updated index #18774

Closed
gpstathis opened this issue Jun 8, 2016 · 16 comments
Closed
Assignees
Labels
:Search/Search Search-related issues that do not fall into other categories

Comments

@gpstathis
Copy link
Contributor

gpstathis commented Jun 8, 2016

Elasticsearch version: 2.3.3

JVM version: 1.8.0_31-b13

OS version: Ubuntu 14.04.1 LTS

We are upgrading from 1.4.2 to 2.3.3 and are seeing some significant performance loss for several of our filtered queries against a dynamically updated index. For simplicity, I'll focus on just one query for this issue.

The query in question: https://gist.github.com/gpstathis/f96c048d0afe8f2bfce4753ffc9d555b

I am testing this same query on two different clusters, one 1.4.2 and one 2.3.3, loaded with the same data and getting the same real-time updates (new docs arriving at about 20 o 50 per second). Both clusters are using the same machine specs and the indices are sized with the same number of shards.

On a corpus of about 1 billion docs this query/filter yields about 88k hits. It takes about 5 seconds to execute on 1.4.2 while on 2.3.3 the time has more than doubled to 13 seconds.

Here are the hot threads for the query on 1.4.2: https://gist.github.com/gpstathis/12ef17f104e3e2d7fafbae10d6eb9267

And now hot threads for the query on 2.3.3: https://gist.github.com/gpstathis/9196625858d4ca0b4311e5d288bae55e

From what I can tell, we are not benefiting from caching in either environment so I am not clear why 2.3.3, despite the new autocaching, would be penalized so much.

Now the real head scratcher is this: the query times and hot threads above are from production—we also have a QA 2.3.3 cluster with a similar size corpus as prod but sized with half the nodes to save on cost (same number of shards). This QA cluster does not receive real-time updates like the prod cluster does. Even with half the number of nodes, the smaller 2.3.3 QA cluster performs similarly to our old 1.4.2 prod cluster and is therefore faster than the larger 2.3.3 cluster.

From the above, it seems that the real-time updates have a much more adverse effect on 2.3.3 compared to pre-2.x versions?

@gpstathis
Copy link
Contributor Author

gpstathis commented Jun 8, 2016

Here is the profiler output:

https://gist.github.com/gpstathis/d284e2ad8b8ac3830901549c558ce36b

Because of #18693, take the numbers with a grain of salt but one can still see that the majority of the time is spent on iterating over the matches which is validated when reducing the date range in the query: the shorter the date range the faster the response times.

@nik9000
Copy link
Member

nik9000 commented Jun 8, 2016

I don't know much about GlobalOrdinals, but it looks like they come up in this issue. Does anyone who knows more want to take a look?

@gpstathis
Copy link
Contributor Author

gpstathis commented Jun 9, 2016

Thanks @nik9000. I don't know if this is only for Aggregations, but according to https://www.elastic.co/guide/en/elasticsearch/guide/current/preload-fielddata.html#global-ordinals:

Once global ordinals have been rebuilt, they will be reused until the segments in the index change: after a refresh, a flush, or a merge

So if we update the index frequently and keep merging, we don't really give global ordinals the chance to build up. In theory then, if we extend refresh_interval we should be able to see the benefits from global ordinals. I tested this by setting refresh_interval to 120s—there was no difference in the response times.

@jpountz
Copy link
Contributor

jpountz commented Jun 9, 2016

I could be wrong, but since documents are only allowed to have a single parent, you should be able to run this query with a single join instead of 3.

{
  "from": 0,
  "size": 25,
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "filter": [
            {
              "has_parent": {
                "query": {
                  "bool": {
                    "filter": [
                      {
                        "terms": {
                          "itype": [
                            "person"
                          ]
                        }
                      },
                      {
                        "term": {
                          "tags": "bf0c5985e9644cf3:infs/acct/"
                        }
                      },
                      {
                        "term": {
                          "tags": "bf0c5985e9644cf3:infs/proj/21694/"
                        }
                      }
                    ]
                  }
                },
                "parent_type": "influencer"
              }
            },
            {
              "range": {
                "publishedDate": {
                  "from": "2015-04-29T00:00:00-0400",
                  "to": "2016-06-02T23:59:59-0400",
                  "include_lower": true,
                  "include_upper": true
                }
              }
            }
          ]
        }
      }
    }
  },
  "fields": "*",
  "sort": [
    {
      "publishedDate": {
        "order": "desc"
      }
    }
  ]
}

Given that the hot threads suggest that the bottleneck is the join, this should help on both versions and might also help reduce the gap.

@gpstathis
Copy link
Contributor Author

Thanks @jpountz . These queries are dynamically constructed so it looks like we need to go back and take closer look at the code that does the joins on _parent.

Unfortunately, I tested it, and while rewriting it does shave about 1 second from the 13 seconds it used to take on 2.3.3, it's still a 2x performance degradation from 1.4.2.

Here are the hot threads while running it: https://gist.github.com/gpstathis/1b508d08e4e612cf9dd0f0224bfabfcd

So there is still a very large gap that we are striving to explain here.

@jpountz
Copy link
Contributor

jpountz commented Jun 9, 2016

I agree the fact that it is still more than 2x slower than 1.x is puzzling. Can you tell us:

  • how many parents and children documents the index has
  • how many documents match each individual query (the range, the has_parent, the bool under the has_parent, and each query under the bool)
  • are there parts of the query that are often the same?

@gpstathis
Copy link
Contributor Author

@jpountz sent the numbers to your Gmail address.

@jpountz
Copy link
Contributor

jpountz commented Jun 9, 2016

Thanks, these numbers will help me reproduce the problem locally. Could you also give again the output of the profiler, but with the new query this time?

@gpstathis
Copy link
Contributor Author

@jpountz
Copy link
Contributor

jpountz commented Jun 10, 2016

I have been digging a bit, there seem to be several factors at play, but the main one seems to me about how we run conjunctions. We used to have optimizations for very specific cases through FilteredQuery that got lost when we replaced it with BooleanQuery and switched to different bitset impls for the cache (roaring instead of plain bit sets, and FilteredQuery/BooleanFilter had optimizations for the bitset case, look for eg. RandomAccessFilterStrategy in the 1.x hot threads). I started working on some changes in Lucene (see https://issues.apache.org/jira/browse/LUCENE-7330 and https://issues.apache.org/jira/browse/LUCENE-7332) but they won't be available until elasticsearch 5.x.

The fact that parent/child runs on doc values rather than in-memory fielddata in 2.x probably also explains part of the performance hit, but it is unclear how much. I would expect the contribution to be less than the conjunction thing though.

@gpstathis
Copy link
Contributor Author

Thanks for taking a look @jpountz .

@clintongormley clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 13, 2016
@jpountz
Copy link
Contributor

jpountz commented Jun 16, 2016

I also opened https://issues.apache.org/jira/browse/LUCENE-7339, which I think also explains the slow down that you have been seeing. Even though the benchmark on the issue sees a 6% jump, this happens with a range that is not cached, so it would likely be better with a cached range (which I think could be the case in your queries since you said the range should get reused).

If you don't mind I will now close this issue. Things will hopefully get better in 5.0.

@jpountz jpountz closed this as completed Jun 16, 2016
@gpstathis
Copy link
Contributor Author

Thanks @jpountz . Is 5.0 the earliest version that will benefit from these patches?

@jpountz
Copy link
Contributor

jpountz commented Jun 17, 2016

I expect these changes to be in Lucene 6.2, which means they should be in elasticsearch 5.1 in the worst case, but likely in elasticsearch 5.0. However it won't be in any 2.x release.

@gpstathis
Copy link
Contributor Author

@jpountz ok thanks

@gauravbhattpb
Copy link

gauravbhattpb commented Jan 16, 2018

We were in same situation after an upgrade from 1.4.4 to 2.4.1 where we saw search performance degrade. In our case however the issue was with the source filtering (setFetchSource with specific list of fields) call which was taking time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

5 participants