Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Logstash Overview Panel missing due to max_buckets #56461

Closed
pickypg opened this issue Jan 31, 2020 · 3 comments · Fixed by #58205
Closed

[Stack Monitoring] Logstash Overview Panel missing due to max_buckets #56461

pickypg opened this issue Jan 31, 2020 · 3 comments · Fixed by #58205
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring

Comments

@pickypg
Copy link
Member

pickypg commented Jan 31, 2020

Given a large enough number of Logstash Pipelines, I have run into a situation where the Logstash Panel does not appear under the Deployment/Cluster overview because Elasticsearch is rejecting the Logstash search due to too many buckets.

I saw this while running v7.5.2 with:

  • 8 Logstash nodes
  • 97 Logstash pipelines

Workaround

For anyone running into this situation, there are at least three workarounds:

  1. Increase the time interval, which will reduce the amount of data being collected by the background query.
  2. Increase the search.max_buckets soft limit in the cluster setting's of the Monitoring cluster that contains the monitoring indices. This can be done dynamically via the _cluster/settings and it defaults to 10000 buckets. Do this with caution because the soft limit exists to limit memory usage in Elasticsearch.
  3. Find the URL, with the cluster_uuid in it, (using either approach above) and navigate to it directly.
@pickypg pickypg added bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring labels Jan 31, 2020
@chrisronline
Copy link
Contributor

@igoristic Is this something you can look into?

@igoristic igoristic self-assigned this Feb 3, 2020
@simianhacker
Copy link
Member

@chrisronline @igoristic What if you changed the groupBy terms agg to a composite agg then paginated through to collect the results, that would fix the max_bucket issue. You would end up with multiple round trips to ES (from the server) but it would also be more resilient for larger clusters (make it slow :D). Correct me if I'm wrong, but if you fixed this in getSeries() it would also fix it in other places too.

And by paginate through the results, I would keep the behavior of getSeries where it returns everything, I would just make the underlying implementation collect everything with the composite agg.

@igoristic
Copy link
Contributor

We were accidentally getting all the pipelines on the Overview page just to see if there is a single bucket (to decide if we want to show Logstash stats). And, since this method had a bug that fetched all nodesCountMetric pipelines per every throughputMetric pipeline, we were essentially doing O(N2) * 2

@pickypg I stress tested this with 100 generator pipelines which did not cause any max buckets errors, and jvm spikes seem to be significantly lower. But, I would like to know how it behaves with your environment

Thanks to @simianhacker's suggestion I did investigated using composite queries, which sped up the query by about ~15%. The composite query for node count looks something like this:

GET *:.monitoring-logstash-6-*,*:.monitoring-logstash-7-*,*:monitoring-logstash-7-*,*:monitoring-logstash-8-*,.monitoring-logstash-6-*,.monitoring-logstash-7-*,monitoring-logstash-7-*,monitoring-logstash-8-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "So2SpBkMT-yvN311fn8q3A"
          }
        },
        {
          "range": {
            "logstash_stats.timestamp": {
              "format": "epoch_millis",
              "gte": 1582256420161,
              "lte": 1582260020161
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "timestamp": {
              "date_histogram": {
                "field": "logstash_stats.timestamp",
                "fixed_interval": "30s"
              }
            }
          }
        ]
      },
      "aggs": {
        "pipelines_nested": {
          "nested": {
            "path": "logstash_stats.pipelines"
          },
          "aggs": {
            "by_pipeline_id": {
              "terms": {
                "field": "logstash_stats.pipelines.id",
                "include": ["random_00", "random_01", "random_02", "random_03", "random_04"],
                "size": 1000
              },
              "aggs": {
                "to_root": {
                  "reverse_nested": {},
                  "aggs": {
                    "node_count": {
                      "cardinality": {
                        "field": "logstash_stats.logstash.uuid"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

However, implementing this in the get_series.js was getting a bit too complex, since histogram/series logic is tightly coupled with other metrics that use the helpers in classes.js. Though I think we should still revisit the composite approach, also mentioned here: #36358

I also, played around with auto_date_histogram, but apparently it too can throw a max buckets error if your buckets strategy isn't allocated properly. The calculation is something like: bucket_size * agg_size, and in reality it'll be more since terms collects up to shard_size (which is 1.5 * size)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants