Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_doc_count=0 doesn't work with a date_histogram with a filter #4843

Closed
cmaitchison opened this issue Jan 22, 2014 · 13 comments

Comments

Projects
None yet
6 participants
@cmaitchison
Copy link

commented Jan 22, 2014

I'm trying to create a date_histogram for recent events, where days where no events happen are still shown.

{
  "aggs": {
    "events_last_week": {
      "filter": {
        "range": {
          "@timestamp": {
            "from": "2014-01-10"
          }
        }
      },
      "aggs": {
        "events_last_week_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "format": "yyyy-MM-dd",
            "interval": "1d"
          }
        }
      }
    }
  }
}

I get a response like this

"aggregations":  {
  "events_last_week": {
    "doc_count": 33861,
    "events_last_week_histogram": [
      {
        "key_as_string": "2014-01-10",
        "key": 1389744000000,
        "doc_count": 2120
      }, {
        "key_as_string": "2014-01-16",
        "key": 1389830400000,
        "doc_count": 3823
      }, {
        "key_as_string": "2014-01-17",
        "key": 1389916800000,
        "doc_count": 27918
      }
    ]
  }
}

The empty days are not returned. If I construct the query without the filter, the empty days are returned correctly.

There is also an issue even when the empty days are returned correctly without the filter. If, for example, today is "2014-01-22", and the latest timestamp in my data is "2014-01-17", then the 5 days between these two dates are not returned as empty buckets, though all the empty buckets prior to "2014-01-17" are returned correctly.

@uboness

This comment has been minimized.

Copy link
Contributor

commented Jan 22, 2014

@cmaitchison

I can't really reproduce it, I ran the same queries as you and I get the right responses. What es version are you working with? we introduced min_doc_count on 1.0.0.RC1

There is also an issue even when the empty days are returned correctly without the filter. If, for example, today is "2014-01-22", and the latest timestamp in my data is "2014-01-17", then the 5 days between these two dates are not returned as empty buckets, though all the empty buckets prior to "2014-01-17" are returned correctly.

the gaps that are filled are based on the dates in the documents you're aggregating... so the first histogram bucket will be based on the earliest date in the document set and the last bucket will be based on the latest date in the set... then we fill in all gaps between these two buckets.

we can consider adding a "range" settings to the histograms which will enable to define the value range (or date range in case of date_histogram) on which the buckets will be created. In your case, that'll mean that if you define a range of the form "range": { "to" : "now" } along with "min_doc_count" : 0 we'll return all the empty buckets until now (beyond the dates in the document set)

@uboness

This comment has been minimized.

Copy link
Contributor

commented Jan 22, 2014

@cmaitchison scratch that... I finally managed to reproduce it (it happens when you have a single shard)... will work on a fix

@cmaitchison

This comment has been minimized.

Copy link
Author

commented Jan 22, 2014

Wow, nice find! I would never have thought to have mentioned that.

On 22 Jan 2014, at 21:32, uboness notifications@github.com wrote:

@cmaitchison scratch that... I finally managed to reproduce it (it happens when you have a single shard)... will work on a fix


Reply to this email directly or view it on GitHub.

uboness added a commit to uboness/elasticsearch that referenced this issue Jan 23, 2014

Fixed an issue where there are sug aggregations executing on a single…
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closed: elastic#4843
@cmaitchison

This comment has been minimized.

Copy link
Author

commented Jan 23, 2014

Also related to this title, I've found that min_doc_count=0 does not work if all of the buckets would be empty after applying the filter. I can reproduce this issue on an index with 2 shards.

{
  "aggs": {
    "filtered_events": {
      "filter": {
        "and": [
          {
            "range": {
              "@timestamp": {
                "from": 1390267500000,
                "to":   1390267560000
              }
            }
          }
        ]
      },
      "aggs": {
        "filtered_events_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "interval": "1s"
          }
        }
      }
    }
  }
}

The above query should return 60 results, 1 for each second in the minute. If any events are found in that minute then 60 results are returned. If no events are found in that minute then 0 results are returned, when you would expect 60 empty buckets.

My use case is zooming in on a series on a chart. The zero value results are very helpful to know where to plot the zeros on the x-axis.

@cmaitchison

This comment has been minimized.

Copy link
Author

commented Jan 23, 2014

Another related issue I am finding is that sometimes the intervals do not go back far enough.

{
  "aggs": {
    "events_last_week": {
      "filter": {
        "and": [
          {
            "range": {
              "@timestamp": {
                "from": 1390267432894,
                "to": 1390267547037
              }
            }
          }
        ]
      },
      "aggs": {
        "events_last_week_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "interval": "second"
          }
        }
      }
    }
  }
}

returns exactly

{
  "aggregations": {
    "events_last_week": {
      "doc_count": 1099,
      "events_last_week_histogram": [
        {
          "key": 1390267526000,
          "doc_count": 12
        },
        {
          "key": 1390267527000,
          "doc_count": 0
        },
        {
          "key": 1390267528000,
          "doc_count": 29
        },
        {
          "key": 1390267529000,
          "doc_count": 32
        },
        {
          "key": 1390267530000,
          "doc_count": 58
        },
        {
          "key": 1390267531000,
          "doc_count": 64
        },
        {
          "key": 1390267532000,
          "doc_count": 35
        },
        {
          "key": 1390267533000,
          "doc_count": 36
        },
        {
          "key": 1390267534000,
          "doc_count": 43
        },
        {
          "key": 1390267535000,
          "doc_count": 52
        },
        {
          "key": 1390267536000,
          "doc_count": 58
        },
        {
          "key": 1390267537000,
          "doc_count": 62
        },
        {
          "key": 1390267538000,
          "doc_count": 76
        },
        {
          "key": 1390267539000,
          "doc_count": 70
        },
        {
          "key": 1390267540000,
          "doc_count": 53
        },
        {
          "key": 1390267541000,
          "doc_count": 72
        },
        {
          "key": 1390267542000,
          "doc_count": 81
        },
        {
          "key": 1390267543000,
          "doc_count": 48
        },
        {
          "key": 1390267544000,
          "doc_count": 88
        },
        {
          "key": 1390267545000,
          "doc_count": 45
        },
        {
          "key": 1390267546000,
          "doc_count": 83
        },
        {
          "key": 1390267547000,
          "doc_count": 2
        }
      ]
    }
  }
}

But it is missing all of the empty buckets between 1390267432894 and 1390267526000. Again, this is with a 2 shard index on 1.0.0RC1.

@uboness

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2014

@cmaitchison as I mentioned above, the histogram operates on the dataset and extracts the min/max of the histogram from the documents (the earliest/latest). There is no direct relations between the filter aggregation and the histogram aggregations (aggregations are unaware of other aggregations in their hierarchy). We could potentially add a range feature to histogram, but if we do it'll have to be post 1.0.

In the first example you gave, there are no documents in that minute, there are no buckets (as we can't determine the min/max values). For the second example, it might be that the first document in the doc set has a later timestamp than the from one in the filter.

@uboness uboness closed this in da95370 Jan 23, 2014

uboness added a commit that referenced this issue Jan 23, 2014

Fixed an issue where there are sug aggregations executing on a single…
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closes #4843

uboness added a commit that referenced this issue Jan 23, 2014

Fixed an issue where there are sug aggregations executing on a single…
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closes #4843
@cmaitchison

This comment has been minimized.

Copy link
Author

commented Jan 23, 2014

Thanks, @uboness, for your help and excellent explanation. range on histogram is definitely a feature I would use. For now I can fill in the gaps on the client-side. Thanks again.

@uboness

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2014

@cmaitchison no worries... thank you for the bug report! important one!

@erikvanzijst

This comment has been minimized.

Copy link

commented Jan 28, 2014

I'm interested in hard range boundaries (returning empty buckets to fill gaps between from and to in the case of missing documents) as well. Is there an issue tracking this, or shall I raise one?

@deanchen

This comment has been minimized.

Copy link

commented Apr 28, 2015

For anyone who arrived to this thread via Google, hard ranges is supported via the extended_bounds param. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html

@taf2

This comment has been minimized.

Copy link

commented Jun 18, 2015

I'm now experiencing the same issue as reported running es 1.6.0

histogram = {
  invervals: {
    date_histogram: {
      field: 'called_at',
      interval: 'day',
      order: { _key: "asc" },
      min_doc_count: 0 # doesn't appear to have any impact on the final result.
    },
    aggs: stats
  }
}
@taf2

This comment has been minimized.

Copy link

commented Jun 18, 2015

it looks like when nesting a date_histogram within a term aggregation there is no way for the min_doc_count to auto fill the zero results.

aggs: {
   groups: {
     terms: {
       min_doc_count: 0
       script: '...'
    },
   aggs: {
   invervals: {
    date_histogram: {
      field: 'called_at',
      interval: 'day',
      order: { _key: "asc" },
      min_doc_count: 0 # doesn't appear to have any impact on the final result.
    },
    aggs: stats
  }
  }
}
@clintongormley

This comment has been minimized.

Copy link
Member

commented Jun 18, 2015

@taf2 please could you open an issue with a complete recreation which explains the problem?

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Fixed an issue where there are sug aggregations executing on a single…
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closes elastic#4843
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.