Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super slow Kibana UI on first load .. related to /_field_stats API call. #9386

Closed
diranged opened this issue Dec 6, 2016 · 17 comments · Fixed by #11114
Closed

Super slow Kibana UI on first load .. related to /_field_stats API call. #9386

diranged opened this issue Dec 6, 2016 · 17 comments · Fixed by #11114
Labels
bug Fixes for quality problems that affect the customer experience discuss Team:Operations Team label for Operations Team

Comments

@diranged
Copy link

diranged commented Dec 6, 2016

Kibana version: 4.6.2
Elasticsearch version: 2.4.1
Server OS version: Ubuntu 14.04.3
Browser version: Chrome 54
Browser OS version: OSX

Description of the problem including expected versus actual behavior:
Issue #4342 introduced a new way to dynamically find indices when doing searching against them. This works and it works just fine for actual searches .. but causes problems when it comes time for Kibana to figure out which indexes to actually search. I think that the change was introduced in #4342, where usage of the /_field_stats api is used to figure out which indexes to search.

We store 30 days worth of indexes named logs-YYYY-MM-DD and events-YYYY-MM-DD in our cluster. Each day we store between 1-2tb of raw data (unduped .. double it for replication). Any given month, we're storing ~90-100TB of duplicated data, and around 30-40 billion events.

We have two clusters ingesting identical data and running the same versions of Kibana and ElasticSearch. The only difference is that on the newer cluster, we used the logs-* and events-* patterns rather than the [logs-]YYYY-MM-DD and [events-]YYYY-MM-DD.

A quick bit of debugging showed that the logs-* configured Kibana instances were issuing different API calls .. in particular, an extremely long lasting /_field_stats call:

[admin@ops-logpipeline-es-frontend-XXX:~]$ time curl --verbose -XPOST "http://localhost:9200/logs-*/_field_stats?level=indices" -d '{"fields":["timestamp"],"index_constraints":{"timestamp":{"max_value":{"gte":1481001283234,"format":"epoch_millis"},"min_value":{"lte":1481002183234,"format":"epoch_millis"}}}}'
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 9200 (#0)
> POST /logs-*/_field_stats?level=indices HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:9200
> Accept: */*
> Content-Length: 176
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 176 out of 176 bytes
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 366
< 
* Connection #0 to host localhost left intact
{"_shards":{"total":420,"successful":420,"failed":0},"indices":{"logs-2016-12-06":{"fields":{"timestamp":{"max_doc":215701193,"doc_count":215701193,"density":100,"sum_doc_freq":862804772,"sum_total_term_freq":-1,"min_value":1480982400000,"min_value_as_string":"2016-12-06T00:00:00.000Z","max_value":1481002690210,"max_value_as_string":"2016-12-06T05:38:10.210Z"}}}}}
real	0m35.944s
user	0m0.007s
sys	0m0.000s

This call takes an average of 32-35s on our cluster almost every single time.. so the question I'm asking here is whether or not this is an expected behavior? Is this a normal amount of time to load up this data? Is it something thats not really well tested on large clusters? Or should we simply go back to the old model where we explicitly list the daily index strategy in our index pattern?

@diranged
Copy link
Author

diranged commented Dec 6, 2016

Note I just found this post back from March where someone else was reporting a similar issue: https://discuss.elastic.co/t/kibana-field-stats-request-takes-20000-ms-if-number-of-indices-and-shards-are-more/43468/15

@Bargs
Copy link
Contributor

Bargs commented Dec 7, 2016

30 seconds is quite unusual. I'd suggest seeking some tuning advice in the Elasticsearch discuss forum. There might be some changes you can make to your ES setup to improve field stats performance.

@diranged
Copy link
Author

@Bargs,
I'm not sure what we can do to tune that query -- its not something thats well documented. Switching away from using the logs-* model and moving back to [logs-]YYYY-MM-DD has given us back the performance we expect. I would be surprised if we're the only people who are seeing this kind of impact.. but was this change actually tested from a performance perspective on "large"ish clusters?

From a performance perspective, our cluster looks something like this:

Masters

3 x c3.large master nodes that do nothing but coordinate stuff.. not part of queries, load balancing, etc,..

Hot data nodes

These ingest daily log/event data and hold ~5 days worth of indices (10 indices total ... logs- and events-). This farm has 18 x m4.4xl instances right now, and they handle the majority of our kibana queries because they have the most current data.

Warm data nodes

We roll data off after 5 days onto a farm of 10 x d2.2xl instances... these machines store 25 days worth of indices (so 50 indices * 14 shards = 700 shards over 10 nodes). Obviously these machines have higher and more static HEAP sizes because they aren't being queried as much and we're not doing active indexing on them.. but ultimately they sit with a pretty low load because they are rarely queried.

Daily Indexing Volume

Each day we index roughly 1-1.2 billion events... total storage size (with replication) is about 1.8TB per day.

@Bargs
Copy link
Contributor

Bargs commented Dec 12, 2016

The main cause of slowness in field_stats I've seen is number of shards. field_stats needs to run per shard, so reducing the number of shards per index might help. Beyond that I don't have a ton of tuning advice myself, did you try posting in the Elasticsearch discuss forum? Since the issue here is field_stats performance we first need to diagnose why field_stats is slow in this particular case. If you post in Discuss and don't get any good advice I can try to bug some ES team members.

@diranged
Copy link
Author

@Bargs,
So we have 30 ~1700 shards in our cluster (p+r) and roughly 1400 of them are on these warm nodes. That said, I think that its a poor UI choice to explicitly recommend against doing the <index>-* pattern when there is such an obvious performance issue on large clusters.

@Bargs Bargs added Team:Operations Team label for Operations Team :Discovery and removed :Discovery labels Dec 19, 2016
@ersushantsood
Copy link

@diranged can you please suggest if you got any inputs on the issue , we are still facing this issue .

@ersushantsood
Copy link

@diranged we are also using logstash-* pattern for field-stats api.

@s1monw
Copy link

s1monw commented Jan 9, 2017

@diranged @ersushantsood can you try to bisect the issue a bit and run the field stats against only a subset of the indices? What would be interesting is:

  • hot vs. cold indices (to see if the nodes make a difference)
  • if you reduce the number of indices does the time taken get reduced too? ie. is it related to the number of indices?
  • if you run the same request twice does it get faster on the second execution?

@cdahlqvist
Copy link

@diranged Have you tried isolating the field stats lookup test to just hot and/or warm nodes by altering the pattern? That should prove whether it is indeed the warm nodes that are causing the slowdown or not.

@laf-rge
Copy link

laf-rge commented Feb 15, 2017

I am experiencing the same issue. Even when I look at just one index the _fieldstats query takes 8 seconds on a 5+1 shard 800 GB index.

@alexfrancoeur
Copy link

Working being done on #11014 should resolve this issue.

@s1monw
Copy link

s1monw commented Apr 10, 2017

oh this makes my day! 👍

@Bargs
Copy link
Contributor

Bargs commented Apr 10, 2017

The new field_caps API won't help us on this issue. This issue is caused by the field_stats call that figures out which indices to query based on the selected time range. To fix this we need to consider #6644

@s1monw
Copy link

s1monw commented Apr 10, 2017

@Bargs we will get rid of field stats so I wonder why you still need it?

@Bargs
Copy link
Contributor

Bargs commented Apr 10, 2017

I don't think we still need it, we can probably just remove our use of it in this scenario. I just wanted to point out that switching to the field_caps API to get searchable/aggregatable status won't affect this issue because this is a separate problem.

@epixa
Copy link
Contributor

epixa commented May 18, 2017

The solution here is to not use the field_stats behavior for index patterns. Since Elasticsearch has improved performance that helps when querying across time based indices now, the "expand" behavior for index patterns that we added as the default behavior early in 4.x that leveraged the field_stats API is no longer necessary and can even be harmful. Starting in 5.4.0, we changed the default for all new index patterns to not use the expanding behavior, and we will soon be deprecating that option entirely. I'd recommend anyone on 5.4.0 onward to uncheck the "Expand index pattern when searching" option for all of their index patterns, and this problem should go away.

@epixa epixa closed this as completed May 18, 2017
@s1monw
Copy link

s1monw commented May 18, 2017

w00t

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience discuss Team:Operations Team label for Operations Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants