Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect pagination with date_histogram and format in composite aggregation #68963

Closed
zhnpeng opened this issue Feb 14, 2021 · 6 comments · Fixed by #73955
Closed

incorrect pagination with date_histogram and format in composite aggregation #68963

zhnpeng opened this issue Feb 14, 2021 · 6 comments · Fixed by #73955
Assignees
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@zhnpeng
Copy link

zhnpeng commented Feb 14, 2021

ES Version

{
  "name" : "baize-server-d7d300ed",
  "cluster_name" : "",
  "cluster_uuid" : "J8pd0v6-SLy_sYY75rVpAQ",
  "version" : {
    "number" : "7.11.0",
    "build_flavor" : "default",
    "build_type" : "",
    "build_hash" : "8ced7813d6f16d2ef30792e2fcde3e755795ee04",
    "build_date" : "2021-02-08T22:44:01.320463Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Doing a composite aggregation on date_histogram source with format:epoch_second param seems to exhibit some inconsistent behavior:

  • it can return excess values and "get stuck" on the search_after key.

My testing mapping is:

{
  "mappings": {
    "properties": {
      "@timestamp": {
        "format": "yyyy-MM-dd hh:mm:ss",
        "type": "date"
      },
      "app": {
        "type": "keyword"
      },
      "count": {
        "type": "long"
      }
    }
  }
}

And data (and use POST _bulk):

{ "create" : { "_index" : "test_comp_aggs", "_id" : 1 } }
{ "@timestamp" : "2021-02-14 10:00:00", "app" :  "tiktok",  "count": 1 }
{ "create" : { "_index" : "test_comp_aggs", "_id" : 2 } }
{ "@timestamp" : "2021-02-14 10:00:00", "app" :  "wechat",  "count": 1 }
{ "create" : { "_index" : "test_comp_aggs", "_id" : 3 } }
{ "@timestamp" : "2021-02-14 10:00:00", "app" :  "facebook",  "count": 1 }
{ "create" : { "_index" : "test_comp_aggs", "_id" : 4 } }
{ "@timestamp" : "2021-02-14 10:00:00", "app" :  "wechat",  "count": 1 }
{ "create" : { "_index" : "test_comp_aggs", "_id" : 5 } }
{ "@timestamp" : "2021-02-14 10:01:00", "app" :  "wechat",  "count": 2 }
{ "create" : { "_index" : "test_comp_aggs", "_id" : 6 } }
{ "@timestamp" : "2021-02-14 10:01:00", "app" :  "facebook",  "count": 2 }
{ "create" : { "_index" : "test_comp_aggs", "_id" : 7 } }
{ "@timestamp" : "2021-02-14 10:01:00", "app" :  "tiktok",  "count": 2 }

Running:

  1. composite aggregation with a large enough size (10),and format: epoch_second
{
  "size": 0,
  "aggs": {
    "results": {
      "composite": {
        "size": 10,
        "sources": [
            {
              "app": {
                "terms": {
                  "field": "app", 
                  "missing_bucket": true
                }
              }
            }, 
            {
              "ts": {
                "date_histogram": {
                  "field": "@timestamp", 
                  "fixed_interval": "1m", 
                  "time_zone": "Asia/Hong_Kong",
                  "format": "epoch_second"
                }
              }
            }
        ]
      }
    }
  }
}

will return the the correct all (3 in this case) buckets and the latter as the search after key:

      "after_key" : {
        "app" : "wechat",
        "ts" : "1613260800"
      },
  1. then we use after_key from previous result to search again
  "size": 0,
  "aggs": {
    "results": {
      "composite": {
        "after" : {
          "app" : "wechat",
          "ts" : "1613260800"
        },
        "size": 10,
        "sources": [
            {
              "app": {
                "terms": {
                  "field": "app", 
                  "missing_bucket": true
                }
              }
            }, 
            {
              "ts": {
                "date_histogram": {
                  "field": "@timestamp", 
                  "fixed_interval": "1m", 
                  "time_zone": "Asia/Hong_Kong",
                  "format": "epoch_second"
                }
              }
            }
        ]
      }
    }
  }
}

will return excess values, and the after_key go stuck:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "results" : {
      "after_key" : {
        "app" : "wechat",
        "ts" : "1613260800"
      },
      "buckets" : [
        {
          "key" : {
            "app" : "wechat",
            "ts" : "1613260800"
          },
          "doc_count" : 3
        }
      ]
    }
  }
}
@zhnpeng zhnpeng added >bug needs:triage Requires assignment of a team area label labels Feb 14, 2021
@zhnpeng
Copy link
Author

zhnpeng commented Feb 14, 2021

similar to #65685

@jimczi jimczi added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Feb 18, 2021
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@not-napoleon
Copy link
Member

I did some testing with this, and I think the problem is that you're using a time zone with epoch_seconds. It's probably a bug that we even allow you to specify that, since epoch seconds are defined in terms of UTC. But if you remove the time zone in the second query, you'll correctly get no further results.

@not-napoleon not-napoleon self-assigned this May 25, 2021
@zhnpeng
Copy link
Author

zhnpeng commented May 28, 2021

Well Thanks, and what should I do, If I want to format my date into seconds with interval "1d" and timezone "Asia/HongKong" instead of UTC?
for example:

    "ts": {
      "date_histogram": {
        "field": "@timestamp", 
        "fixed_interval": "1d", 
        "time_zone": "Asia/Hong_Kong",
        "format": "epoch_second"
      }
    }

@not-napoleon
Copy link
Member

So I'm looking into fixing the bug with epoch_seconds and timezones. The good news is that it's "just" a format issue, which means the data should be correct and you can get it out in a different format. Unfortunately, I don't think you can get exactly what you want right now, but hopefully we can get you close enough for now.

My best suggestion for a work around is to use a different format, and convert to epoch seconds on the client side. If you use something like iso8601, you should still have enough information to convert to epoch seconds. I know that's not ideal, but it's the best I can think of until I can get a fix in for the format issue.

@zhnpeng
Copy link
Author

zhnpeng commented Jun 5, 2021

Yes, your suggestion is what we are doing for a work around, convert iso8601 format date to epoch seconds in client side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants