Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index creation causes cluster health to turn red momentarily #9106

Closed
ppf2 opened this issue Dec 30, 2014 · 12 comments
Closed

Index creation causes cluster health to turn red momentarily #9106

ppf2 opened this issue Dec 30, 2014 · 12 comments
Labels
discuss :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement

Comments

@ppf2
Copy link
Member

ppf2 commented Dec 30, 2014

It is not uncommon for admins in the field to set up alerts against the cluster health (red/yellow/green). Currently, index creation can cause the cluster health to go red momentarily until its primary shards are allocated (expected). It would be a nice enhancement to have a way to create an index without causing the cluster health to go red (even for a short subsecond durations).

@jpountz
Copy link
Contributor

jpountz commented Jan 9, 2015

Could this be related to #9126? If the index creation API starts waiting for yellow by default then maybe the health status could only take into account the newly created index once the index creation request terminates (including timeouts)?

@barakcoh
Copy link

+1

Marvel is causing us a sub-second red status every day at midnight and it's quite annoying to constantly see it in the Shard Allocation section. We also have the above alert in place. If it happens to query the cluster at that exact time people will get a Twillio call in the middle of the night which is less than ideal.

@clintongormley clintongormley added the :Data Management/Stats Statistics tracking and retrieval APIs label Aug 24, 2015
@jhansen-tt
Copy link

+1
What can I do to help fix this?

@ppf2
Copy link
Member Author

ppf2 commented Oct 19, 2015

+1 Use case: Indexing a ton of data via Logstash hourly indices and seeing red every hour ..

@mikemccand
Copy link
Contributor

+1, the cluster should never go red unless data loss has occurred ... this is a nasty bug in our cluster health.

It's like the smoke alarms that go off in my house when it's too dusty or we are cooking something "unusual".

#9126 seems very much related.

@bleskes
Copy link
Contributor

bleskes commented Oct 20, 2015

@mikemccand it is related though slightly different. Even if we hold back the create index response until the index is green/yellow etc an independent monitoring of cluster health will report it's status.

I'm +100 on solving this but I couldn't come up - to date - with a proper solution. When we create an index we add unassigned primaries + replicas to the routing table. We try to assign the primaries immediately (which may fail because of throttling) and publish the cluster state to the nodes for the primaries to initialize. Here lies the problem - a cluster state with initializing primaries is technically red. Only once the shards are started do we move to yellow. One we could say that a cluster health should ignore initialzing/unassigned shards which are guaranteed to not contain data but then what happens when those primaries can not be assigned (because of allocation filtering or whatever)? we should still communicate that somehow as the situation is wrong. I'd love to hear an elegant suggestion here...

@nik9000
Copy link
Member

nik9000 commented Oct 20, 2015

I had this trouble to - every time I did an online mapping change I had to rebuild the index and stream one index to another - and I had 1600 indexes to do. Icinga generally thought Elasticsearch was flapping at that time because it was.

Maybe ignore indexes less than 60 seconds old in overall cluster state. The index itself should be red, but maybe not the whole cluster.

Any solution to this is going to break a whole lot of tests somewhere but is probably worth it.

@felipegs
Copy link

+1

@clintongormley clintongormley added >enhancement :Cluster and removed :Data Management/Stats Statistics tracking and retrieval APIs labels Dec 3, 2015
@bashok001
Copy link

+1 Very much needed. Our alarms are going off every couple of days. I worry that continuing the practice of waiting it out will one day cost us dearly one day when there is a real problem.

@jeffkirk1
Copy link

+1 I'm experiencing this issue daily as well, coincidental with Marvel index reloads. Elasticsearch 2.2.0. Temporarily disabled Marvel refreshes to compensate but obviously that's not a great long term solution.

@majormoses
Copy link
Contributor

@nik9000

Maybe ignore indexes less than 60 seconds old in overall cluster state. The index itself should be red, but maybe not the whole cluster.

This makes sense to me

@clintongormley
Copy link

Fixed by #18737

@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement
Projects
None yet
Development

No branches or pull requests