Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get latest counter before attempting a take to ensure take succeeds #886

Merged
merged 3 commits into from
Jan 4, 2022

Conversation

zengyu714
Copy link
Contributor

@zengyu714 zengyu714 commented Dec 23, 2021

Issue #, if available:

Description of changes:

  • Besides from committing the PR #765 Get latest counter before attempting a take to ensure take succeeds raised by @Renjuju, also addressed comments left by @ashwing.

Previous Changes
ashwing#94
#765

Verification

  1. When artificially introduce delay while fetching leases, kcl instance does attempt to refresh the lease and succeed stealing the lease
  2. Property isMarkedForSteal is updated for each lease in each cycle of renewing leases

See logs here

2022-01-03 15:02:37,855 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - [Functional Test] Check isMarkedForLeaseSteal before assign true values: [false], which is shouldn't be true 
2022-01-03 15:02:37,856 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - [Functional Test] leases with isMarkedForLeaseSteal==true should be different with last cycle: [true] 
2022-01-03 15:02:37,856 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - [Functional Test] leaseKeysToStealFromLastCycleSet: [shardId-000000000014] and leaseKeysToStealFromCurrentCycleSet: [shardId-000000000010] 
2022-01-03 15:02:37,856 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - Worker mac-1 needed 9 leases but none were expired, so it will steal lease shardId-000000000010 from mac-2 
2022-01-03 15:02:37,856 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - Worker mac-1 saw 20 total leases, 0 available leases, 2 workers. Target is 10 leases, I have 1 leases, I will take 1 leases 
2022-01-03 15:02:37,856 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - [Functional Test] Updating stale leases 
2022-01-03 15:02:37,856 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - [Functional Test] Attempt to refresh the lease as expected 
2022-01-03 15:02:37,991 [LeaseCoordinator-0000] INFO  s.a.k.l.dynamodb.DynamoDBLeaseTaker [NONE] - Worker mac-1 successfully took 1 leases: shardId-000000000010 
2022-01-03 15:02:38,404 [KinesisTester-0000] INFO  s.a.kinesis.coordinator.Scheduler [NONE] - Created new shardConsumer for : ShardInfo(streamIdentifierSerOpt=Optional.empty, shardId=shardId-000000000010, concurrencyToken=cc7f5fd6-471b-4a2a-b59e-dc108f19118b, parentShardIds=[], checkpoint={SequenceNumber: 49625136119961813246747861896741176352940323142511886498,SubsequenceNumber: 0}) 

For all changes made for testing, see here: https://gist.github.com/zengyu714/ef2b46b051a97b4e43184948f25d7a93/revisions?diff=split

  • Note the artificial delay is introduced by adding lines before here
                if (Objects.equals(workerIdentifier, "mac-1")){
                    updateAllLeasesTotalTimeMillis += leaseRenewalIntervalMillis;
                }

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Renju Radhakrishnan and others added 2 commits December 23, 2021 13:37
if (lease.isMarkedForLeaseSteal()) {
try {
return leaseRefresher.getLease(lease.leaseKey());
} catch (DependencyException | InvalidStateException | ProvisionedThroughputException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#765 (comment)

this needs to be addressed

The log line information doesn't really make sense. You can keep the original log line as is, but explain why we would run into getting these exception in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think while refreshing the stale lease explains why we get this exception, following the suggestion from Ashwing Like while we tried to update the stale leases in #765 (comment). And the line defaulting to existing lease explains the current status... do you want more context on the failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unable to retrieve the current lease while refreshing the stale lease" this means that we know for a fact that the leases being refreshed are stale while in fact they can be just the same as in DDB. This is not accurate. You can explain in comment naming different cases we run into exceptions.

"Like while we tried to update the stale leases" <- i am not sure this is the right reason why we would run into such exception. He might have confused it with a conditional update; while this is in fact just a get ddb request. I am guessing his intention is we explain how we would run into dependencyException and invalidStateException

Copy link
Contributor

@ashwing ashwing Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are trying to update the local copy of the leases with latest info in the event of leases going stale. Note that we already fetched the leases from ddb table just a while ago, but now we want to get their latest state in order to successfully steal the leases. Having a message like "Failed to fetch latest state of the lease that needs to be stolen" would help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add leasekey to the msg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaseKey is added to the logline

* @param epsilonMillis Allow for some variance when calculating lease expirations
*/
public static long getRenewerTakerIntervalMillis(long leaseDurationMillis, long epsilonMillis) {
return leaseDurationMillis / 3 - epsilonMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this essentially veryOldLeaseDurationNanosMultiplier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, being extracted into a static function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use veryOldLeaseDurationNanosMultiplier instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not from my commit, do you have strong preference about it?

@avahuang0429
Copy link
Contributor

Also can you check why travis ci is complaining? might be transient but double confirm by rerunning

@zengyu714
Copy link
Contributor Author

Also can you check why travis ci is complaining? might be transient but double confirm by rerunning

@avahuang0429 the new run succeeded: https://app.travis-ci.com/github/awslabs/amazon-kinesis-client/builds/243951865

try {
return leaseRefresher.getLease(lease.leaseKey());
} catch (DependencyException | InvalidStateException | ProvisionedThroughputException e) {
log.warn("Failed to fetch latest state of the lease {} that needs to be stolen, "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added leaseKey in the logline

@ashwing
Copy link
Contributor

ashwing commented Jan 3, 2022

Thanks I need to go over the PR one more time. Can you do a functional test for this change and update the CR?

Yu Zeng 10:33 AM
sure, just to confirm tha: does the functional test mean Testing KCL in the local?

Ashwin Giridharan 10:34 AM
yeah basically you need to run two instances locally, with different worker identifier
10:34
so that one can steal leases from another (edited)

10:35
when you artificially introduce delay while fetching leases, the worker should attempt to refresh the lease (as per this PR
) and succeed stealing the lease
10:36
Also we should verify new leases are created every lease renewal i.e we should use isMarkedForSteal for only one cycle. (edited)

Yu Zeng 10:37 AM
got it, thanks!

@zengyu714
Copy link
Contributor Author

zengyu714 commented Jan 4, 2022

Thanks I need to go over the PR one more time. Can you do a functional test for this change and update the CR?

Yu Zeng 10:33 AM sure, just to confirm tha: does the functional test mean Testing KCL in the local?

Ashwin Giridharan 10:34 AM yeah basically you need to run two instances locally, with different worker identifier 10:34 so that one can steal leases from another (edited)

10:35 when you artificially introduce delay while fetching leases, the worker should attempt to refresh the lease (as per this PR ) and succeed stealing the lease 10:36 Also we should verify new leases are created every lease renewal i.e we should use isMarkedForSteal for only one cycle. (edited)

Yu Zeng 10:37 AM got it, thanks!

@ashwing Verification section is added in the description with example logs: #886 (comment)

Copy link
Contributor

@ashwing ashwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zengyu714 zengyu714 merged commit a3e51d5 into awslabs:master Jan 4, 2022
@zengyu714 zengyu714 deleted the refresh-release branch January 4, 2022 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants