New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing an index right after it has been creating leaves it in an unopenable state #3313

Closed
nik9000 opened this Issue Jul 10, 2013 · 9 comments

Comments

Projects
None yet
3 participants
@nik9000
Contributor

nik9000 commented Jul 10, 2013

Closing an index right after it has been creating leaves it in an unopenable state. Specifically, opening the index results in a bunch of shards being unassigned and not getting assigned automatically. Reproduction curl commands ready to pase into a shell: https://gist.github.com/nik9000/5970277

@imotov

This comment has been minimized.

Show comment
Hide comment
@imotov

imotov Jul 14, 2013

Member

Confirmed, this is indeed the case. There is simple workaround for this issue though - just make sure that the index gets to at least to yellow state before trying to close it. I am curious is there a particular use case that led to this being a problem?

Member

imotov commented Jul 14, 2013

Confirmed, this is indeed the case. There is simple workaround for this issue though - just make sure that the index gets to at least to yellow state before trying to close it. I am curious is there a particular use case that led to this being a problem?

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Jul 14, 2013

Contributor

On Sun, Jul 14, 2013 at 1:18 PM, Igor Motov notifications@github.comwrote:

Confirmed, this is indeed the case. There is simple workaround for this
issue though - just make sure that the index gets to at least to yellow
state before trying to close it. I am curious is there a particular use
case that led to this being a problem?

I was building a script that can build an index from scratch as well as
update it portions of its configuration are out of date and being lazy
about how I implemented it figuring that an empty index would be cheap to
open and close. After I stopped being lazy and specifying the analysers
during index creation my problem went away. It'd probably have been good
enough to add a note to
http://www.elasticsearch.org/guide/reference/api/admin-indices-open-close/and
I wouldn't have tried it.

Are there other non-index-creation cases that put the index in this state?
If so it might be worth implementing something stops the close action.

Nik

Contributor

nik9000 commented Jul 14, 2013

On Sun, Jul 14, 2013 at 1:18 PM, Igor Motov notifications@github.comwrote:

Confirmed, this is indeed the case. There is simple workaround for this
issue though - just make sure that the index gets to at least to yellow
state before trying to close it. I am curious is there a particular use
case that led to this being a problem?

I was building a script that can build an index from scratch as well as
update it portions of its configuration are out of date and being lazy
about how I implemented it figuring that an empty index would be cheap to
open and close. After I stopped being lazy and specifying the analysers
during index creation my problem went away. It'd probably have been good
enough to add a note to
http://www.elasticsearch.org/guide/reference/api/admin-indices-open-close/and
I wouldn't have tried it.

Are there other non-index-creation cases that put the index in this state?
If so it might be worth implementing something stops the close action.

Nik

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Jul 14, 2013

Member

@imotov I believe as you mentioned that we can really open an index only after all the primary shards have been allocation at least once. This is because we can't recreate the primaryAllocatedPostApi flag (we could potentially, but its not supported now).

I suggest that a simple fix for now is to reject a close index request if one of its index shard routing info has the primaryAllocatedPostApi set to false.

Member

kimchy commented Jul 14, 2013

@imotov I believe as you mentioned that we can really open an index only after all the primary shards have been allocation at least once. This is because we can't recreate the primaryAllocatedPostApi flag (we could potentially, but its not supported now).

I suggest that a simple fix for now is to reject a close index request if one of its index shard routing info has the primaryAllocatedPostApi set to false.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Jul 18, 2013

Contributor

While working on a patch for the issue kimchy mentioned I noticed a look alike issue: https://gist.github.com/nik9000/6028478

I'll post another github issue after some more investigation.

Contributor

nik9000 commented Jul 18, 2013

While working on a patch for the issue kimchy mentioned I noticed a look alike issue: https://gist.github.com/nik9000/6028478

I'll post another github issue after some more investigation.

@imotov

This comment has been minimized.

Show comment
Hide comment
@imotov

imotov Jul 20, 2013

Member

Hi @nik9000. Thanks for the PR. I am just thinking maybe we can remove assertRed(); and assert false; from the test. This way we are still testing fast closing, but even if closing is not fast enough test wouldn't fail. I also feel that 128 shards might be a bit excessive. Maybe reduce it to 50 or even 20?

I looked at the related problem with quorum as well. Not really sure what we should do about it. Should we even allow creation of an index with 3 replicas and index.recovery.initial_shards=quorum on a single node? On the other side, even if we have 4 nodes, it's not always obvious if we can fulfill index.recovery.initial_shards requirements or not. So, we could store some flag in index metadata that would indicate that this index wasn't fully allocated at least once yet. And if this flag is set, LocalGatewayAllocator would ignore requiredAllocation or as in case of prematurely closed index create missing shards as needed. We could even use this flag to block any operations on such index, so it would be really obvious that this index is not in a proper state. @kimchy what do you think?

Member

imotov commented Jul 20, 2013

Hi @nik9000. Thanks for the PR. I am just thinking maybe we can remove assertRed(); and assert false; from the test. This way we are still testing fast closing, but even if closing is not fast enough test wouldn't fail. I also feel that 128 shards might be a bit excessive. Maybe reduce it to 50 or even 20?

I looked at the related problem with quorum as well. Not really sure what we should do about it. Should we even allow creation of an index with 3 replicas and index.recovery.initial_shards=quorum on a single node? On the other side, even if we have 4 nodes, it's not always obvious if we can fulfill index.recovery.initial_shards requirements or not. So, we could store some flag in index metadata that would indicate that this index wasn't fully allocated at least once yet. And if this flag is set, LocalGatewayAllocator would ignore requiredAllocation or as in case of prematurely closed index create missing shards as needed. We could even use this flag to block any operations on such index, so it would be really obvious that this index is not in a proper state. @kimchy what do you think?

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Jul 20, 2013

Contributor

On Fri, Jul 19, 2013 at 8:14 PM, Igor Motov notifications@github.comwrote:

Hi @nik9000 https://github.com/nik9000. Thanks for the PR.

Thanks for taking the time to read it!

I am just thinking maybe we can remove assertRed(); and assert false;from the test. This way we are still testing fast closing, but even if
closing is not fast enough test wouldn't fail. I also feel that 128 shards
might be a bit excessive. Maybe reduce it to 50 or even 20?

I'm not sure it'd be a good test if sometimes it didn't verify anything.
If we were in JUnit I'd say we could use the Assume api but I'm not really
sure what the right thing is in TestNG.

As to the 128 shards it was just a number that seemed to trigger the
behavior. IIRC 20 wouldn't have consistently triggered the problem on my
laptop. 50 probably would but I didn't want to risk someone having a
faster machine than mine and getting an unexpectedly useless/failing test.

I looked at the related problem with quorum as well. Not really sure what
we should do about it. Should we even allow creation of an index with 3
replicas and index.recovery.initial_shards=quorum on a single node? On
the other side, even if we have 4 nodes, it's not always obvious if we can
fulfill index.recovery.initial_shards requirements or not. So, we could
store some flag in index metadata that would indicate that this index
wasn't fully allocated at least once yet. And if this flag is set,
LocalGatewayAllocator would ignore requiredAllocation or as in case of
prematurely closed index create missing shards as needed. We could even use
this flag to block any operations on such index, so it would be really
obvious that this index is not in a proper state. @kimchyhttps://github.com/kimchywhat do you think?

For my book stopping people when they ask for a configuration that just
isn't going to fully allocate sounds like the right thing to do. It'd
probably make sense to have a force flag that gets the unchecked behavior
but with a warning that things might not work properly if you don't bring
those nodes online.

What about the case where when you create the index everything makes sense
and allocates properly but then you lose a node? Without that node you can
close the index but it won't open again until you bring that node back
online. At least, that is what I saw when I was playing with
#3354.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-21284580
.

Contributor

nik9000 commented Jul 20, 2013

On Fri, Jul 19, 2013 at 8:14 PM, Igor Motov notifications@github.comwrote:

Hi @nik9000 https://github.com/nik9000. Thanks for the PR.

Thanks for taking the time to read it!

I am just thinking maybe we can remove assertRed(); and assert false;from the test. This way we are still testing fast closing, but even if
closing is not fast enough test wouldn't fail. I also feel that 128 shards
might be a bit excessive. Maybe reduce it to 50 or even 20?

I'm not sure it'd be a good test if sometimes it didn't verify anything.
If we were in JUnit I'd say we could use the Assume api but I'm not really
sure what the right thing is in TestNG.

As to the 128 shards it was just a number that seemed to trigger the
behavior. IIRC 20 wouldn't have consistently triggered the problem on my
laptop. 50 probably would but I didn't want to risk someone having a
faster machine than mine and getting an unexpectedly useless/failing test.

I looked at the related problem with quorum as well. Not really sure what
we should do about it. Should we even allow creation of an index with 3
replicas and index.recovery.initial_shards=quorum on a single node? On
the other side, even if we have 4 nodes, it's not always obvious if we can
fulfill index.recovery.initial_shards requirements or not. So, we could
store some flag in index metadata that would indicate that this index
wasn't fully allocated at least once yet. And if this flag is set,
LocalGatewayAllocator would ignore requiredAllocation or as in case of
prematurely closed index create missing shards as needed. We could even use
this flag to block any operations on such index, so it would be really
obvious that this index is not in a proper state. @kimchyhttps://github.com/kimchywhat do you think?

For my book stopping people when they ask for a configuration that just
isn't going to fully allocate sounds like the right thing to do. It'd
probably make sense to have a force flag that gets the unchecked behavior
but with a warning that things might not work properly if you don't bring
those nodes online.

What about the case where when you create the index everything makes sense
and allocates properly but then you lose a node? Without that node you can
close the index but it won't open again until you bring that node back
online. At least, that is what I saw when I was playing with
#3354.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-21284580
.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Jul 20, 2013

Contributor

I could work the test so it used a small number of shards and just tried
again if it didn't hit the problem. That wouldn't be too tough. I'll have
a look at that sometime in the next few days.

On Fri, Jul 19, 2013 at 8:51 PM, Nikolas Everett nik9000@gmail.com wrote:

On Fri, Jul 19, 2013 at 8:14 PM, Igor Motov notifications@github.comwrote:

Hi @nik9000 https://github.com/nik9000. Thanks for the PR.

Thanks for taking the time to read it!

I am just thinking maybe we can remove assertRed(); and assert false;from the test. This way we are still testing fast closing, but even if
closing is not fast enough test wouldn't fail. I also feel that 128 shards
might be a bit excessive. Maybe reduce it to 50 or even 20?

I'm not sure it'd be a good test if sometimes it didn't verify anything.
If we were in JUnit I'd say we could use the Assume api but I'm not really
sure what the right thing is in TestNG.

As to the 128 shards it was just a number that seemed to trigger the
behavior. IIRC 20 wouldn't have consistently triggered the problem on my
laptop. 50 probably would but I didn't want to risk someone having a
faster machine than mine and getting an unexpectedly useless/failing test.

I looked at the related problem with quorum as well. Not really sure what
we should do about it. Should we even allow creation of an index with 3
replicas and index.recovery.initial_shards=quorum on a single node? On
the other side, even if we have 4 nodes, it's not always obvious if we can
fulfill index.recovery.initial_shards requirements or not. So, we could
store some flag in index metadata that would indicate that this index
wasn't fully allocated at least once yet. And if this flag is set,
LocalGatewayAllocator would ignore requiredAllocation or as in case of
prematurely closed index create missing shards as needed. We could even use
this flag to block any operations on such index, so it would be really
obvious that this index is not in a proper state. @kimchyhttps://github.com/kimchywhat do you think?

For my book stopping people when they ask for a configuration that just
isn't going to fully allocate sounds like the right thing to do. It'd
probably make sense to have a force flag that gets the unchecked behavior
but with a warning that things might not work properly if you don't bring
those nodes online.

What about the case where when you create the index everything makes sense
and allocates properly but then you lose a node? Without that node you can
close the index but it won't open again until you bring that node back
online. At least, that is what I saw when I was playing with
#3354.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-21284580
.

Contributor

nik9000 commented Jul 20, 2013

I could work the test so it used a small number of shards and just tried
again if it didn't hit the problem. That wouldn't be too tough. I'll have
a look at that sometime in the next few days.

On Fri, Jul 19, 2013 at 8:51 PM, Nikolas Everett nik9000@gmail.com wrote:

On Fri, Jul 19, 2013 at 8:14 PM, Igor Motov notifications@github.comwrote:

Hi @nik9000 https://github.com/nik9000. Thanks for the PR.

Thanks for taking the time to read it!

I am just thinking maybe we can remove assertRed(); and assert false;from the test. This way we are still testing fast closing, but even if
closing is not fast enough test wouldn't fail. I also feel that 128 shards
might be a bit excessive. Maybe reduce it to 50 or even 20?

I'm not sure it'd be a good test if sometimes it didn't verify anything.
If we were in JUnit I'd say we could use the Assume api but I'm not really
sure what the right thing is in TestNG.

As to the 128 shards it was just a number that seemed to trigger the
behavior. IIRC 20 wouldn't have consistently triggered the problem on my
laptop. 50 probably would but I didn't want to risk someone having a
faster machine than mine and getting an unexpectedly useless/failing test.

I looked at the related problem with quorum as well. Not really sure what
we should do about it. Should we even allow creation of an index with 3
replicas and index.recovery.initial_shards=quorum on a single node? On
the other side, even if we have 4 nodes, it's not always obvious if we can
fulfill index.recovery.initial_shards requirements or not. So, we could
store some flag in index metadata that would indicate that this index
wasn't fully allocated at least once yet. And if this flag is set,
LocalGatewayAllocator would ignore requiredAllocation or as in case of
prematurely closed index create missing shards as needed. We could even use
this flag to block any operations on such index, so it would be really
obvious that this index is not in a proper state. @kimchyhttps://github.com/kimchywhat do you think?

For my book stopping people when they ask for a configuration that just
isn't going to fully allocate sounds like the right thing to do. It'd
probably make sense to have a force flag that gets the unchecked behavior
but with a warning that things might not work properly if you don't bring
those nodes online.

What about the case where when you create the index everything makes sense
and allocates properly but then you lose a node? Without that node you can
close the index but it won't open again until you bring that node back
online. At least, that is what I saw when I was playing with
#3354.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-21284580
.

@imotov

This comment has been minimized.

Show comment
Hide comment
@imotov

imotov Jul 20, 2013

Member

Interesting, I was able to consistently reproduce it with 10 shards (and even 5 in most cases), hence the suggested number. I was thinking about retrying logic as well, but you would still need to remove assertRed() to remove race condition between checking index health and closing the index and then do clean up of index that failed to create the issue.

Member

imotov commented Jul 20, 2013

Interesting, I was able to consistently reproduce it with 10 shards (and even 5 in most cases), hence the suggested number. I was thinking about retrying logic as well, but you would still need to remove assertRed() to remove race condition between checking index health and closing the index and then do clean up of index that failed to create the issue.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Jul 20, 2013

Contributor

I've updated the pull request with a retry logic and it looks like I can
reproduce it with two shards! I suppose I should have tried ratcheting
down the number rather than watching my logs. Anyway I feel better with
the retry logic making sure the test actually does something but runs more
quickly if it can get away with it.

Nik

On Fri, Jul 19, 2013 at 9:04 PM, Igor Motov notifications@github.comwrote:

Interesting, I was able to consistently reproduce it with 10 shards (and
even 5 in most cases), hence the suggested number. I was thinking about
retrying logic as well, but you would still need to remove assertRed() to
remove race condition between checking index health and closing the index
and then do clean up of index that failed to create the issue.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-21285609
.

Contributor

nik9000 commented Jul 20, 2013

I've updated the pull request with a retry logic and it looks like I can
reproduce it with two shards! I suppose I should have tried ratcheting
down the number rather than watching my logs. Anyway I feel better with
the retry logic making sure the test actually does something but runs more
quickly if it can get away with it.

Nik

On Fri, Jul 19, 2013 at 9:04 PM, Igor Motov notifications@github.comwrote:

Interesting, I was able to consistently reproduce it with 10 shards (and
even 5 in most cases), hence the suggested number. I was thinking about
retrying logic as well, but you would still need to remove assertRed() to
remove race condition between checking index health and closing the index
and then do clean up of index that failed to create the issue.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-21285609
.

imotov added a commit to imotov/elasticsearch that referenced this issue Jul 24, 2013

@imotov imotov closed this in 6b8123d Jul 26, 2013

imotov added a commit that referenced this issue Jul 26, 2013

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment