New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An inactive shard is activated by triggered synced flush #13802

Closed
wants to merge 3 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@bleskes
Member

bleskes commented Sep 25, 2015

When a shard becomes inactive we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents come along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard inactive.

A new unit test is introduced and comparable integration tests are removed.

I think we need to push this all the way to 1.7.3 ...

Internal: an inactive shard is temporarily activated by triggered syn…
…ced flush

When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

@bleskes bleskes changed the title from An inactive shard is temporarily activated by triggered synced flush to An inactive shard is activated by triggered synced flush Sep 25, 2015

@bleskes bleskes added the review label Sep 26, 2015

@@ -144,7 +151,7 @@ public IndexingMemoryController(Settings settings, ThreadPool threadPool, Indice
translogBuffer = maxTranslogBuffer;
}
} else {
translogBuffer = ByteSizeValue.parseBytesSizeValue(translogBufferSetting, null);
translogBuffer = ByteSizeValue.parseBytesSizeValue(translogBufferSetting, TRANSLOG_BUFFER_SIZE_SETTING);

This comment has been minimized.

@mikemccand

mikemccand Sep 26, 2015

Contributor

Nice catch.

@mikemccand

mikemccand Sep 26, 2015

Contributor

Nice catch.

@@ -154,18 +161,21 @@ public IndexingMemoryController(Settings settings, ThreadPool threadPool, Indice
// we need to have this relatively small to move a shard from inactive to active fast (enough)

This comment has been minimized.

@mikemccand

mikemccand Sep 26, 2015

Contributor

Pre-existing issue: I think 30s default is too large, because an index that was inactive and suddenly becomes active with a 512 KB indexing buffer for up to 30s of heavy indexing is suddenly writing many, many segments. It would be better if the first indexing op to arrive to an inactive shard could force IMC to wake up and re-evaluate.

I'll open a separate issue about this ...

@mikemccand

mikemccand Sep 26, 2015

Contributor

Pre-existing issue: I think 30s default is too large, because an index that was inactive and suddenly becomes active with a 512 KB indexing buffer for up to 30s of heavy indexing is suddenly writing many, many segments. It would be better if the first indexing op to arrive to an inactive shard could force IMC to wake up and re-evaluate.

I'll open a separate issue about this ...

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand Sep 26, 2015

Contributor

This was a great catch @bleskes, and the cutover from IT to unit test is wonderful.

I left a bunch of small comments, but otherwise LGTM.

I still find that massive if statement in updateShardStatuses confusing but I can't quickly see an easy way to improve it...

Contributor

mikemccand commented Sep 26, 2015

This was a great catch @bleskes, and the cutover from IT to unit test is wonderful.

I left a bunch of small comments, but otherwise LGTM.

I still find that massive if statement in updateShardStatuses confusing but I can't quickly see an easy way to improve it...

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 26, 2015

Member

@mikemccand thx for the review. I pushed another round. Tried to simplify the 'big if'

Member

bleskes commented Sep 26, 2015

@mikemccand thx for the review. I pushed another round. Tried to simplify the 'big if'

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand Sep 26, 2015

Contributor

LGTM, thanks @bleskes!

Contributor

mikemccand commented Sep 26, 2015

LGTM, thanks @bleskes!

@bleskes bleskes closed this in 148265b Sep 27, 2015

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 27, 2015

Member

push this to master. will give CI some time before pushing it all the way back to 1.7...

Member

bleskes commented Sep 27, 2015

push this to master. will give CI some time before pushing it all the way back to 1.7...

@bleskes bleskes deleted the bleskes:indexing_memory_controler branch Sep 27, 2015

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 27, 2015

Member

Pre-existing issue: I think 30s default is too large, because an index that was inactive and suddenly becomes active with a 512 KB indexing

I was thinking about this some more (morning run, he he) - I think we should fold all the decisions and state about being active or not into the indexshard. Then it will be easy to reset on every index operation (immediately). We also already have the notion of an inactivity listener - we can extend it to have an active event as well. We can also add an checkIfInactive method on IndexShard which check if there was any indexing ops (we already have stats) since the last time it was checked. To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Member

bleskes commented Sep 27, 2015

Pre-existing issue: I think 30s default is too large, because an index that was inactive and suddenly becomes active with a 512 KB indexing

I was thinking about this some more (morning run, he he) - I think we should fold all the decisions and state about being active or not into the indexshard. Then it will be easy to reset on every index operation (immediately). We also already have the notion of an inactivity listener - we can extend it to have an active event as well. We can also add an checkIfInactive method on IndexShard which check if there was any indexing ops (we already have stats) since the last time it was checked. To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand Sep 27, 2015

Contributor

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like how IndexShard.updateBufferSize "infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorb ShardIndexingService into IndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it has postCreate and postCreateUnderLock. Seems like maybe an excessive abstraction at this point...

Contributor

mikemccand commented Sep 27, 2015

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like how IndexShard.updateBufferSize "infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorb ShardIndexingService into IndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it has postCreate and postCreateUnderLock. Seems like maybe an excessive abstraction at this point...

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 27, 2015

Member

I personally don't mind folding it into indexshard though having a dedicate stats/active component has benefit (clearer testing suite) . I know Simon has an opinion on this.

On 27 sep. 2015 4:39 PM +0200, Michael McCandlessnotifications@github.com, wrote:

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like howIndexShard.updateBufferSize"infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorbShardIndexingServiceintoIndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it haspostCreateandpostCreateUnderLock. Seems like maybe an excessive abstraction at this point...


Reply to this email directly orview it on GitHub(#13802 (comment)).

Member

bleskes commented Sep 27, 2015

I personally don't mind folding it into indexshard though having a dedicate stats/active component has benefit (clearer testing suite) . I know Simon has an opinion on this.

On 27 sep. 2015 4:39 PM +0200, Michael McCandlessnotifications@github.com, wrote:

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like howIndexShard.updateBufferSize"infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorbShardIndexingServiceintoIndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it haspostCreateandpostCreateUnderLock. Seems like maybe an excessive abstraction at this point...


Reply to this email directly orview it on GitHub(#13802 (comment)).

bleskes added a commit that referenced this pull request Sep 28, 2015

Internal: an inactive shard is activated by triggered synced flush
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Closes #13802

bleskes added a commit that referenced this pull request Sep 28, 2015

Internal: an inactive shard is activated by triggered synced flush
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Closes #13802

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 28, 2015

Internal: an inactive shard is activated by triggered synced flush
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Relates #13802
Includes a backport of #13784

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 28, 2015

Internal: an inactive shard is activated by triggered synced flush
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Relates #13802
Includes a backport of #13784

Closes #13824

@clintongormley clintongormley added v2.0.0-rc1 and removed v2.0.0 labels Oct 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment