Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incrementally transfer chunks per token to improve handover #1764

Open
wants to merge 17 commits into
base: master
from

Conversation

@rfratto
Copy link
Contributor

rfratto commented Oct 28, 2019

Design document: https://docs.google.com/document/d/1y2TdfEQ9ZKh6CpBVB4o6BYjCr-plNRL9jGD6fJ9bMW0/edit#

This PR introduces two incremental chunk transfer process utilized by the lifecycler to reduce spillover and enable dynamic scaling of ingesters. The incremental transfer process takes precedence over the old handover mechanism.

To migrate a cluster to use incremental transfers, two rollouts must be done:

  1. Rollout ingesters with -ingester.leave-incremental-transfer=true
  2. Rollout ingesters with -ingester.join-incremental-transfer=true

I recognize this is a large PR and I have attempted (to the best of my ability) to split it into smaller, independent commits. It's not perfect, but hopefully the commits I have make it easier to review.

Fixes #1277.

/cc @gouthamve @pstibrany @tomwilkie

@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from 3db8e47 to 0737ed0 Oct 28, 2019
@pstibrany

This comment has been minimized.

Copy link
Contributor

pstibrany commented Oct 30, 2019

There are multiple test failures and reported race conditions. Would be nice to fix those.

@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch 7 times, most recently from f67be16 to b7b9ee6 Oct 30, 2019
Copy link
Contributor

pstibrany left a comment

Comments after reviewing first commit (thanks for splitting your work into logical steps!)

pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/ring.go Outdated Show resolved Hide resolved
pkg/ring/ring.go Outdated Show resolved Hide resolved
pkg/ring/replication_strategy.go Outdated Show resolved Hide resolved
pkg/ring/replication_strategy.go Outdated Show resolved Hide resolved
pkg/ring/ring.proto Outdated Show resolved Hide resolved
pkg/ring/ring.proto Outdated Show resolved Hide resolved
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from 5370565 to 03b4dc9 Oct 31, 2019
Copy link
Contributor

pstibrany left a comment

Another round of comments. I still need to better understand and review the real meat of the PR (like entire incremental_transfer.go).

My initial impression is that this is very complex piece of code, that will be hard to find and fix bugs in :-(

docs/arguments.md Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Show resolved Hide resolved
pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/util/test/poll.go Outdated Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from 875afd7 to cbc9007 Oct 31, 2019
@pstibrany pstibrany mentioned this pull request Nov 1, 2019
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
Copy link
Contributor

pstibrany left a comment

Another pass, this time mostly around TokenChecker.

pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from ce1e47e to a364ceb Nov 1, 2019
@rfratto

This comment has been minimized.

Copy link
Contributor Author

rfratto commented Nov 1, 2019

I've addressed most of the review feedback so far. I want to rebase against latest and fix the merge conflicts before I continue addressing feedback. This may take a little bit of time; both the TSDB blocks and the gossip are going to change bits and pieces about the current implementation.

@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch 3 times, most recently from d9009f2 to dd5ac36 Nov 1, 2019
This commit adds state to each token alongside the overall ingester
state. The ingester's state is kept for backwards compatibility and
to making it easier to see what the ingester is doing (i.e., in health
checks or on the ring status web page). Ingesters also store a boolean
indicating whether they have stateful tokens.

The Get method on the ring has been modified to use the token state
rather than ingester state. When reading the ring, if an ingester
doesn't have stateful tokens enabled, token states are set to
match the ingester's state.

Currently, the state of the tokens are forced to have the same state as
the ingester. Further changes will separate out the two concepts.

As old gRPC clients still need to access normalised tokens, the
protobuf could not be updated to change the type of the tokens field.
Rather, an extra "inactiveTokens" field has been added to IngesterDesc.
This field is a map of all non-ACTIVE tokens to the state they are
currently in.

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
Copy link
Contributor

pstibrany left a comment

Yet another round of comments.

Main issue I have is with the requirement that we need to move all the replicated data around, not just the primary data for token. This is further complicated by adjacent tokens belonging to the same ingester, dealing with unhealhty ingesters and token states.

pkg/ingester/client/cortex.proto Outdated Show resolved Hide resolved
pkg/ingester/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ingester/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ingester/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ingester/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/ring_test.go Outdated Show resolved Hide resolved
pkg/ring/model_test.go Outdated Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/incremental_transfer.go Show resolved Hide resolved
@rfratto

This comment has been minimized.

Copy link
Contributor Author

rfratto commented Nov 5, 2019

Main issue I have is with the requirement that we need to move all the replicated data around, not just the primary data for token. This is further complicated by adjacent tokens belonging to the same ingester, dealing with unhealhty ingesters and token states.

Adjacent tokens belonging to the same ingester, I find, is the easier half of the problem. The main complexities with dealing with the ring are when tokens belonging to the same ingester are near, but not next to, each other. For example, when dealing with a ring A1 B A2 C, we have to keep the tokens unmerged to know which subset of ranges A will be handling. Dealing with tokens that are next to each other is handled implicitly through dealing with tokens that are near each other.

Another issue with simplifying how we deal with the ring is the risk of introducing minor differences between what the distributor does and what the ingesters do when moving data around. If this were to happen, then incrementally joining and leaving would stop working properly: spillover may be introduced if the ingester doesn't request data from the proper ingester, and chunks may go untransferred and be flushed before their capacity is reached.

I don't think we can loosen the requirement of not moving all replicated data around. If we didn't do this, every time an ingester leaves, we would lose 1/replicationFactor of our data. If you did a complete rollout, you would lose all replicas outside of the original owner. This would subtly break a lot of things, including querying, which use the quorum to return correct results.

@pstibrany

This comment has been minimized.

Copy link
Contributor

pstibrany commented Nov 5, 2019

I don't think we can loosen the requirement of not moving all replicated data around. If we didn't do this, every time an ingester leaves, we would lose 1/replicationFactor of our data. If you did a complete rollout, you would lose all replicas outside of the original owner. This would subtly break a lot of things, including querying, which use the quorum to return correct results.

Can you elaborate on how we would lose data? In the complete rollout scenario, we can use the same mechanism as we do today. Perhaps the solution could be to warn admin about not leaving too many ingesters at once?

@rfratto

This comment has been minimized.

Copy link
Contributor Author

rfratto commented Nov 5, 2019

Can you elaborate on how we would lose data? In the complete rollout scenario, we can use the same mechanism as we do today. Perhaps the solution could be to warn admin about not leaving too many ingesters at once?

We would lose data because each ingester holds data for replicationFactor total ingesters , including itself. My fallback is to flush anything that didn't get transferred, but we generally have queriers configured to only query the store for stuff that isn't in memory anymore. That means there will be some period where an ingester gets a query for some data that it should be a replica of, but it never received that data during the transfer.

rfratto added 2 commits Oct 2, 2019
This commit introduces managing incremental transfers between ingesters
when a lifecycler joins a ring and when it leaves the ring. The
implementation of the IncrementalTransferer interface will be done in a
future commit.

The LifecyclerConfig has been updated with JoinIncrementalTransfer and
LeaveIncrementalTransfer, available as join_incremental_transfer and
leave_incremental_transfer using the YAML config, and
join-incremental-transfer and leave-incremental-transfer using command
line flags.

When JoinIncrementalTransfer is used, the lifecycler will join the ring
immediately. Tokens will be inserted into the ring one by one, first
into the JOINING state and then the ACTIVE state after requesting chunks
in token ranges they should have data for from neighboring ingesters
in the ring.

When LeaveIncrementalTransfer is used, the lifecycler will incrementally
move tokens in LEAVING state after sending ranges to neighboring
ingesters that should now have data. Enabling LeaveIncrementalTransfer
will disable the handoff process, and flushing non-transferred data always
happens at the end.

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
This commit modifies the ingesters to be aware of the shard token used by
the distributors to send traffic to ingesters. This is a requirement for
incremental transfers, where the shard token is used to determine which
memory series need to be moved.

This assumes that all distributors are using the same sharding mechanism
and always use the same token for a specific series. If the memory
series is appended to with a different token from the one it was created
with, a warning will be logged and the new token will be used.

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
@pstibrany

This comment has been minimized.

Copy link
Contributor

pstibrany commented Nov 5, 2019

@rfratto

This comment has been minimized.

Copy link
Contributor Author

rfratto commented Nov 5, 2019

My understanding is that queriers ask all ingesters, so any ingester with data will reply.

I may be slightly wrong, but I think queriers ask all ingesters and stop once it receives responses from a quorum number of those ingesters. If none of the quorum had any data, then the query results would show no data. Again, unsure, but I believe this is how it works.

rfratto added 2 commits Nov 13, 2019
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Desc.search has been renamed to Desc.findToken. This helps it seem
less similar to Ring.search which searches for an owning token of a
key rather than a specific token.

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from a639978 to 92cd4ec Nov 14, 2019
@tomwilkie tomwilkie added this to Review in progress in TSDB Support Nov 15, 2019
@tomwilkie tomwilkie removed this from Review in progress in TSDB Support Nov 15, 2019
Copy link
Contributor

pstibrany left a comment

I like the improvements since last review. Refactoring of ringIterator -- which simplifies Successor/Predecessors a lot. Improved examples in TestPredecessors / TestSuccessor -- they are very helpful to understand what these methods actually do.

(I'm not done with review and will continue next week. I think I'm starting to understand more and more of this now, which is good. If we could get rid of stateful tokens as well, that would be perfect... not sure who to convince about that :))

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
pkg/ingester/incremental_transfer.go Outdated Show resolved Hide resolved
docs/arguments.md Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/replication_strategy.go Outdated Show resolved Hide resolved
pkg/ring/ring.proto Outdated Show resolved Hide resolved
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from 6ad2e01 to fc9c7e2 Nov 15, 2019
Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from fc9c7e2 to 801dabf Nov 15, 2019
Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
@jtlisi jtlisi self-assigned this Nov 18, 2019
rfratto added 2 commits Nov 19, 2019
This commit reverts the addition of stateful tokens. Everything works in
my local environment, and all the tests pass, but there may be more
opporuntities to remove code that stateful tokens added. Work for that
will be done in a separate commit.

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from 79166d7 to 49930c5 Nov 20, 2019
Copy link
Contributor

pstibrany left a comment

I think this is getting into a pretty good shape, great work Robert! I will give it one more (final?) review after feedback is addressed. I still cannot quite figure out findTransferWorkloadForToken (see comments), but I can now follow most of the code.

If we want to remove extra lines from this PR, there is still room for that by 1) not changing what doesn't need changing (eg. formatting in ring.proto file), 2) not introducing extra structs to pass three or four parameters which then use multiple lines to initialize at call sites, 3) using Empty Responses in grpc calls (lot of generated code for custom empty structs).

pkg/ring/incremental_transfer.go Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
pkg/ring/util.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Show resolved Hide resolved
pkg/ring/incremental_transfer.go Show resolved Hide resolved
Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
Copy link
Contributor

pstibrany left a comment

I have few small comments, and one bigger one... I think ring structure and code should be split into two types, one that is stored in KV store, and other that is used in runtime. I realize it's quite late in the review to be asking for such a change, but this PR seems like a good opportunity to do that, because it is adding many new methods for working with the "runtime" ring, that operate on Tokens field that we want to get rid of.

pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
pkg/ring/incremental_transfer.go Outdated Show resolved Hide resolved
pkg/ring/token_checker.go Outdated Show resolved Hide resolved
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from f03cd20 to c15b37f Dec 6, 2019
Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from c15b37f to 5512b36 Dec 6, 2019
@pstibrany

This comment has been minimized.

Copy link
Contributor

pstibrany commented Dec 6, 2019

Great, thanks for latest batch of changes. Would it also be possible to

  • make migrateRing to return TokenNavigator
  • not store new value to d.Tokens in places like (*Lifecycler).updateLastRing and (*Ring).loop (only two places), and use new TokenNavigator only?

I'm still concerned about using calling SetIngesterTokens with normalise = false parameter, and would like to get rid of them. Ideally, such method would only be done on TokenNavigator instances. Goal is to remove Desc.Tokens completely, adding new places where we rely on it is therefore not good.

@rfratto

This comment has been minimized.

Copy link
Contributor Author

rfratto commented Dec 6, 2019

I can add an equivalent of SetIngesterTokens into the TokenNavigator, but we still need the function to exist in ring.Desc since the ingesters need to incrementally add their tokens over time.

I'll work on using TokenNavigator in more places, the challenge right now is passing around both the navigator and the ring desc, as we need the ring desc for the addresses of the ingesters.

@pstibrany

This comment has been minimized.

Copy link
Contributor

pstibrany commented Dec 6, 2019

I can add an equivalent of SetIngesterTokens into the TokenNavigator, but we still need the function to exist in ring.Desc since the ingesters need to incrementally add their tokens over time.

That's fine, but in ring.Desc, it would respect normalisation flag. It doesn't need to (and cannot) do that in TokenNavigator.

@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch 2 times, most recently from 3ac5330 to 1755968 Dec 6, 2019
migrateRing has been renamed to NewTokenNavigator. References to
migrateRing that updated the token list of ring.Desc have been changed
to leave the ring unmodified and store a separate reference to the
TokenNavigator.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
@rfratto rfratto force-pushed the rfratto:incremental-chunk-transfers branch from 1755968 to 1775fc2 Dec 6, 2019
Copy link
Contributor

pstibrany left a comment

Robert, thank you VERY MUCH for your herculean effort on this PR. I've still left few minor comments, but overall I must say that I am pretty happy with the end result!

pkg/ring/token_navigator.go Outdated Show resolved Hide resolved
pkg/ring/token_navigator.go Outdated Show resolved Hide resolved
pkg/ring/token_navigator.go Outdated Show resolved Hide resolved
pkg/ring/model.go Outdated Show resolved Hide resolved
Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
return nil
}

return i.lastRing.Clone()

This comment has been minimized.

Copy link
@pstibrany

pstibrany Dec 7, 2019

Contributor

Do we still need to have the Clone method? I believe we no longer modify the returned instance from here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.