Slowdown on date range queries in ES 2.1 #15994

Closed
jrots opened this Issue Jan 14, 2016 · 16 comments

Comments

Projects
None yet
5 participants
@jrots

jrots commented Jan 14, 2016

On a dataset of +/- 30M (unoptimized / so lot's of segments)
The following query executes faster a lot faster :

curl -XGET "http://localhost:9200/users/person/_search" -d '
    {
        "from": 0,
        "size": 20,
        "_source": false,
        "query": {
            "constant_score": {
                "filter": {
                    "bool": {
                        "must": [
                            {
                                "term": {
                                    "gender": "female"
                                }
                            },
                            {
                                "terms": {
                                    "location.geohash": [
                                        "u143",
                                        "u149",
                                        "u14c",
                                        "u146",
                                        "u14d",
                                        "u14f",
                                        "u147",
                                        "u14e",
                                        "u14g"
                                    ]
                                }
                            },
                            {
                                "term": {
                                    "accountstatus": "active"
                                }
                            },
                            {
                                "terms": {
                                    "country": [
                                        "be"
                                    ]
                                }
                            }
                        ]
                    }
                }
            }
        },
        "post_filter": { 
            "range": {
                "birthdate": {
                    "gte": "1981-01-13",
                    "lte": "1997-01-13"
                }
            }
        }
    }   
    '

< 110ms, consecutive runs,
{"took":109,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":2454,"max_score":1.0

vs. this is one is always : 300ms to 400ms +

curl -XGET "http://localhost:9200/users/person/_search" -d '
  {
        "from": 0,
        "size": 20,
        "_source": false,
        "query": {
            "constant_score": {
                "filter": {
                    "bool": {
                        "must": [
                            {
                                "term": {
                                    "gender": "female"
                                }
                            },
                            {
                                "terms": {
                                    "location.geohash": [
                                        "u143",
                                        "u149",
                                        "u14c",
                                        "u146",
                                        "u14d",
                                        "u14f",
                                        "u147",
                                        "u14e",
                                        "u14g"
                                    ]
                                }
                            },
                            {
                                "term": {
                                    "accountstatus": "active"
                                }
                            },
                            {
                                "terms": {
                                    "country": [
                                        "be"
                                    ]
                                }
                            },
                            {
                                "range": {
                                    "birthdate": {
                                        "gte": "1981-01-13",
                                        "lte": "1997-01-13"
                                    }
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
    '

{"took":304,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":2446,"max_score"

If this expected behaviour, please close the issue -
But I think that ES should analyse the query and know when to do post filtering instead of generating a lot of lucene terms for date fields that slow down queries drastically.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jan 14, 2016

Member

What version of ES?
How many times did you run it?
How many segments do you have?
What hardware are you running this on? (all your search times seem very slow for such a small index)

Member

clintongormley commented Jan 14, 2016

What version of ES?
How many times did you run it?
How many segments do you have?
What hardware are you running this on? (all your search times seem very slow for such a small index)

@jrots

This comment has been minimized.

Show comment
Hide comment
@jrots

jrots Jan 14, 2016

  • ES: 2.1.1 - Lucene 5.3.1, -Xms28g -Xmx28g mlockall = true
  • 6 times consecutively for each query
  • Intel(R) Xeon(R) CPU X5650 @ 2.67GHz, 2 physical cpu's, 12 cores w hyperthreading, 24 processors in total
  • 64GB ram, 1TB SSD
  • segments : https://gist.github.com/jrots/d7020441bf33dca4b67a , 4 shards with 30 segments on each shard

jrots commented Jan 14, 2016

  • ES: 2.1.1 - Lucene 5.3.1, -Xms28g -Xmx28g mlockall = true
  • 6 times consecutively for each query
  • Intel(R) Xeon(R) CPU X5650 @ 2.67GHz, 2 physical cpu's, 12 cores w hyperthreading, 24 processors in total
  • 64GB ram, 1TB SSD
  • segments : https://gist.github.com/jrots/d7020441bf33dca4b67a , 4 shards with 30 segments on each shard
@jimczi

This comment has been minimized.

Show comment
Hide comment
@jimczi

jimczi Jan 14, 2016

Member

And what is the average time when you remove the range filter/post_filter ? The post_filter vs filter are very similar in terms of implementation, the post_filter has to scan the inverted lists also.

Member

jimczi commented Jan 14, 2016

And what is the average time when you remove the range filter/post_filter ? The post_filter vs filter are very similar in terms of implementation, the post_filter has to scan the inverted lists also.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jan 15, 2016

Member

@jrots It seems there may be a misunderstanding of what the post_filter does in Elasticsearch. The post filter is simply a filter that is applied to the query results AFTER aggregations have been calculated - it has nothing to do with executing the filter via the inverted index vs fielddata.

That said, something seems odd about how the query is executed - if the range query is the most costly, the bool query should figure that out and change the execution order to apply the terms queries first. We'll investigate.

Two questions:

  • Is the birthdate field mapped as a string or as a date?
  • Why do the number of results differ for the two queries? Are you actively indexing?
Member

clintongormley commented Jan 15, 2016

@jrots It seems there may be a misunderstanding of what the post_filter does in Elasticsearch. The post filter is simply a filter that is applied to the query results AFTER aggregations have been calculated - it has nothing to do with executing the filter via the inverted index vs fielddata.

That said, something seems odd about how the query is executed - if the range query is the most costly, the bool query should figure that out and change the execution order to apply the terms queries first. We'll investigate.

Two questions:

  • Is the birthdate field mapped as a string or as a date?
  • Why do the number of results differ for the two queries? Are you actively indexing?
@jimczi

This comment has been minimized.

Show comment
Hide comment
@jimczi

jimczi Jan 15, 2016

Member

Why do the number of results differ for the two queries? Are you actively indexing?

@clintongormley I guess that the number of results differs because the post-filtered documents are not taken into account in the total hits. I opened #16021 because I don't know if it's expected or not.

Member

jimczi commented Jan 15, 2016

Why do the number of results differ for the two queries? Are you actively indexing?

@clintongormley I guess that the number of results differs because the post-filtered documents are not taken into account in the total hits. I opened #16021 because I don't know if it's expected or not.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jan 15, 2016

Member

@jimferenczi no - these counts should be the same whether post_filter is used or not. post_filter just affects which documents the aggs see.

Member

clintongormley commented Jan 15, 2016

@jimferenczi no - these counts should be the same whether post_filter is used or not. post_filter just affects which documents the aggs see.

@jrots

This comment has been minimized.

Show comment
Hide comment
@jrots

jrots Jan 15, 2016

I was still actively indexing at that moment - now that's done and I've also ran: force merge max_segments=1 to be sure that we're measuring the right things here.

with range on birthdate :

{"took":872,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":875,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":109,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}

If I change the range of the birthdate with one day f.e. it's again consecutive 800ms+

leaving out the birthdate completely :

{"took":37,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":39062}}
{"took":32,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":39062}}

etc..
even changing some of the params in the query,

running with range on birthdate in post_filter,initial + consecutive

{"took":401,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":54,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":53,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}

birthdate is stored as :

          "birthdate" : {
            "type" : "date",
            "format" : "YYYY-MM-dd"
          },

Wonder if I don't need time precision on my dates (like a birthdate f.e.),
if I'm not better of storing these dates as a fixed int number => floor(( (epochtime) - (timestamp birthdate) )/60/60/24)
This way it would generate less terms in the lucene index, the precision_step would also be 8 in that case, delivering faster responses.

jrots commented Jan 15, 2016

I was still actively indexing at that moment - now that's done and I've also ran: force merge max_segments=1 to be sure that we're measuring the right things here.

with range on birthdate :

{"took":872,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":875,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":109,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}

If I change the range of the birthdate with one day f.e. it's again consecutive 800ms+

leaving out the birthdate completely :

{"took":37,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":39062}}
{"took":32,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":39062}}

etc..
even changing some of the params in the query,

running with range on birthdate in post_filter,initial + consecutive

{"took":401,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":54,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}
{"took":53,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":18892}}

birthdate is stored as :

          "birthdate" : {
            "type" : "date",
            "format" : "YYYY-MM-dd"
          },

Wonder if I don't need time precision on my dates (like a birthdate f.e.),
if I'm not better of storing these dates as a fixed int number => floor(( (epochtime) - (timestamp birthdate) )/60/60/24)
This way it would generate less terms in the lucene index, the precision_step would also be 8 in that case, delivering faster responses.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jan 15, 2016

Member

I was still actively indexing at that moment

OK that explains the varying counts.

Wonder if I don't need time precision on my dates (like a birthdate f.e.), if I'm not better of storing these dates as a fixed int number => floor(( (epochtime) (timestamp birthdate) )/60/60/24)

Before going there... query execution order has had a big rewrite, and it sounds like it is choosing a suboptimal order. @jimferenczi is going to dive into it to see if he can find something that can be improved.

thanks for all the info

Member

clintongormley commented Jan 15, 2016

I was still actively indexing at that moment

OK that explains the varying counts.

Wonder if I don't need time precision on my dates (like a birthdate f.e.), if I'm not better of storing these dates as a fixed int number => floor(( (epochtime) (timestamp birthdate) )/60/60/24)

Before going there... query execution order has had a big rewrite, and it sounds like it is choosing a suboptimal order. @jimferenczi is going to dive into it to see if he can find something that can be improved.

thanks for all the info

@jimczi

This comment has been minimized.

Show comment
Hide comment
@jimczi

jimczi Jan 15, 2016

Member

edited version of the comment which was wrong and misleading.
@jrots I'll try to reproduce.

Member

jimczi commented Jan 15, 2016

edited version of the comment which was wrong and misleading.
@jrots I'll try to reproduce.

@jimczi

This comment has been minimized.

Show comment
Hide comment
@jimczi

jimczi Jan 18, 2016

Member

@jrots I've tried to reproduce with a 40M documents index, 5 shards, each document contain one date field with a random date chosen between "1950-01-01" to "2000-12-31" and two long fields with different cardinalities (one big and one small).
I compared the post_filter+boolean query vs one big boolean query and the results are always the same, there is no gain with the post_filter. I checked with different type of range queries (big range like 1950 => 2000 and small ones like your example).
When you test the performance you can have some variations depending on the activity of your machine in terms of cpu and ram. Are you sure that you stopped the indexing and the queries when you checked the performance of each query ? Can you try to re-run your tests with more than 3 trials per query ?

Member

jimczi commented Jan 18, 2016

@jrots I've tried to reproduce with a 40M documents index, 5 shards, each document contain one date field with a random date chosen between "1950-01-01" to "2000-12-31" and two long fields with different cardinalities (one big and one small).
I compared the post_filter+boolean query vs one big boolean query and the results are always the same, there is no gain with the post_filter. I checked with different type of range queries (big range like 1950 => 2000 and small ones like your example).
When you test the performance you can have some variations depending on the activity of your machine in terms of cpu and ram. Are you sure that you stopped the indexing and the queries when you checked the performance of each query ? Can you try to re-run your tests with more than 3 trials per query ?

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jan 18, 2016

Member

@jrots could you provide a more complete recreation? or would it be possible to upload your index somewhere so that we could try it out?

Member

clintongormley commented Jan 18, 2016

@jrots could you provide a more complete recreation? or would it be possible to upload your index somewhere so that we could try it out?

@jrots

This comment has been minimized.

Show comment
Hide comment
@jrots

jrots Jan 19, 2016

Ok I'll have a look to have a testable index that doesn't contain too much private info

jrots commented Jan 19, 2016

Ok I'll have a look to have a testable index that doesn't contain too much private info

@jrots

This comment has been minimized.

Show comment
Hide comment
@jrots

jrots Feb 1, 2016

Hi guys,
sorry for the late reply.I retried the post_filter and filter on date ranges and it indeed have the same results.. (both quite slow to be honest). Must've been the indexing I was doing at that time that caused the fluctuation in search response times.

I think if you have a lot of documents and are doing a wide numeric range there can be a big slowdown. .

I fixed my problem by rewriting the following range query :

                    "range": {
                                "birthdate": {
                                    "gte": "1967-01-28",
                                    "lte": "1996-01-28"
                                }
                            }

to :

"filter": {
                "bool": {
                    "should": [
                        {
                            "range": {
                                "birthdate": {
                                    "gte": "1967-01-28",
                                    "lte": "1967-12-31"
                                }
                            }
                        },
                        {
                            "range": {
                                "birthdate": {
                                    "gte": "1996-01-01",
                                    "lte": "1996-01-28"
                                }
                            }
                        },
                        {
                            "terms": {
                                "country_birthdate_string": [
                                    "be1-1968",
                                    "be1-1969",
                                    "be0-1970",
                                    "be0-1980",
                                    "be1-1990",
                                    "be1-1991",
                                    "be1-1992",
                                    "be1-1993",
                                    "be1-1994",
                                    "be1-1995"
                                ]
                            }
                        }

0_<year rounded 0> => spans 10 years
1 => spans 1 year

I added a transform to the index mapping that generates these keywords automatically for me:

    "mappings" : {
      "person" : {
        "transform" : {
          "inline" : "ctx._source[\"country_birthdate_string\"] = [ctx._source[\"country\"] + \"0_\"+(int)Math.floor(Integer.parseInt(ctx._source[\"birthdate\"].substring(0,4))/10)*10, ctx._source[\"country\"] + \"1_\"+ctx._source[\"birthdate\"].substring(0,4)];",
          "lang" : "groovy"
        },
        "_all" : {
          "enabled" : false
        },
        "properties" : {
          "birthdate" : {
            "type" : "date",
            "format" : "YYYY-MM-dd"
          },
          "country" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "country_birthdate_string" : {
            "type" : "string",
            "index" : "not_analyzed"
          },

I have a speed up times 10 doing it like this for my use case.
Maybe worth adding something like this higher up in lucene internal/ revisit the numeric tree lookup for large ranges. .
Regards
J

jrots commented Feb 1, 2016

Hi guys,
sorry for the late reply.I retried the post_filter and filter on date ranges and it indeed have the same results.. (both quite slow to be honest). Must've been the indexing I was doing at that time that caused the fluctuation in search response times.

I think if you have a lot of documents and are doing a wide numeric range there can be a big slowdown. .

I fixed my problem by rewriting the following range query :

                    "range": {
                                "birthdate": {
                                    "gte": "1967-01-28",
                                    "lte": "1996-01-28"
                                }
                            }

to :

"filter": {
                "bool": {
                    "should": [
                        {
                            "range": {
                                "birthdate": {
                                    "gte": "1967-01-28",
                                    "lte": "1967-12-31"
                                }
                            }
                        },
                        {
                            "range": {
                                "birthdate": {
                                    "gte": "1996-01-01",
                                    "lte": "1996-01-28"
                                }
                            }
                        },
                        {
                            "terms": {
                                "country_birthdate_string": [
                                    "be1-1968",
                                    "be1-1969",
                                    "be0-1970",
                                    "be0-1980",
                                    "be1-1990",
                                    "be1-1991",
                                    "be1-1992",
                                    "be1-1993",
                                    "be1-1994",
                                    "be1-1995"
                                ]
                            }
                        }

0_<year rounded 0> => spans 10 years
1 => spans 1 year

I added a transform to the index mapping that generates these keywords automatically for me:

    "mappings" : {
      "person" : {
        "transform" : {
          "inline" : "ctx._source[\"country_birthdate_string\"] = [ctx._source[\"country\"] + \"0_\"+(int)Math.floor(Integer.parseInt(ctx._source[\"birthdate\"].substring(0,4))/10)*10, ctx._source[\"country\"] + \"1_\"+ctx._source[\"birthdate\"].substring(0,4)];",
          "lang" : "groovy"
        },
        "_all" : {
          "enabled" : false
        },
        "properties" : {
          "birthdate" : {
            "type" : "date",
            "format" : "YYYY-MM-dd"
          },
          "country" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "country_birthdate_string" : {
            "type" : "string",
            "index" : "not_analyzed"
          },

I have a speed up times 10 doing it like this for my use case.
Maybe worth adding something like this higher up in lucene internal/ revisit the numeric tree lookup for large ranges. .
Regards
J

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Mar 1, 2016

Member

The new point field encoding coming in 5.0 should improve this situation. Closing

Member

clintongormley commented Mar 1, 2016

The new point field encoding coming in 5.0 should improve this situation. Closing

@ChristopheBoucaut

This comment has been minimized.

Show comment
Hide comment
@ChristopheBoucaut

ChristopheBoucaut Apr 6, 2017

Hi @jrots.

Did you test your use case with ES 5.* ? Can you confirm the improve ?

I have the same issue with ES 5.3 and I don't found a "native solution" in ES.

Regards.

ChristopheBoucaut commented Apr 6, 2017

Hi @jrots.

Did you test your use case with ES 5.* ? Can you confirm the improve ?

I have the same issue with ES 5.3 and I don't found a "native solution" in ES.

Regards.

@tvernum

This comment has been minimized.

Show comment
Hide comment
@tvernum

tvernum Jun 28, 2017

Contributor

@hwb1992 Please ask this question on our discuss forum where we can provide better support.

Contributor

tvernum commented Jun 28, 2017

@hwb1992 Please ask this question on our discuss forum where we can provide better support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment