Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(eks): version update completes prematurely #7526

Merged
merged 10 commits into from Apr 23, 2020

Conversation

eladb
Copy link
Contributor

@eladb eladb commented Apr 22, 2020

Commit Message

fix(eks): version update completes prematurely (#7526)

The UpdateClusterVersion operation takes a while to begin and until then, the cluster's status is still ACTIVE instead UPDATING as expected. This causes the isComplete handler, which is called immediately, to think that the operation is complete, when it hasn't even began.

Modify how IsComplete is implemented for cluster version (and config) updates. Extract the update ID and use DescribeUpdate to monitor the status of the update. This also allows us to fix a latent bug and fail the update in case the version update failed.

The update ID is returned from OnEvent via a custom fields called EksUpdateId and passed on to the subsequent IsComplete invocation. This was already supported by the custom resource provider framework but not documented or officially tested, so we've added that here as well (docs + test).

TESTING: Added unit tests to verify the new type of update waiter and performed a manual upgrade tests while examining the logs.

Fixes #7457

End Commit Message

  • Manual test

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began.

Add logic to the cluster version update `onEvent` method to wait up to 5 minutes until the cluster status is no longer `ACTIVE`, so that the subsequent `isComplete` query will be based on the version update operation itself.

Extended the timeout of `onEvent` to 15m to ensure it does not interrupt the operation.

TESTING: Updated unit tests to verify this retry behavior and performed a manual upgrade tests while examining the logs.

Fixes #7457
@eladb eladb requested a review from a team April 22, 2020 20:58
@eladb eladb self-assigned this Apr 22, 2020
@eladb eladb added the pr/do-not-merge This PR should not be merged at this time. label Apr 22, 2020
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Apr 22, 2020
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 97fe9d1
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@eladb eladb removed the pr/do-not-merge This PR should not be merged at this time. label Apr 23, 2020
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 258fbe7
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 1ddfb78
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@rix0rrr rix0rrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep(5) does not seem like the best way to fix a race condition.

According to the API documentation, you are supposed to use a token from the return value of the UpdateClusterVersion function and poll DescribeUpdate with that.

If that solution doesn't work for some reason, your rationale for this change should describe why that is.

@eladb
Copy link
Contributor Author

eladb commented Apr 23, 2020

sleep(5) does not seem like the best way to fix a race condition.

The sleep is not the fix for the race condition. It's basically a short backoff before querying the cluster's status again.

@eladb
Copy link
Contributor Author

eladb commented Apr 23, 2020

According to the API documentation, you are supposed to use a token from the return value of the UpdateClusterVersion function and poll DescribeUpdate with that.

The reason I am looking at the cluster's status, which, according to the documentation is expected to be in UPDATING during the version update (and it is) is to simplify the isComplete handler. It always waits for the cluster to become ACTIVE.

@rix0rrr
Copy link
Contributor

rix0rrr commented Apr 23, 2020

is to simplify the isComplete handler. It always waits for the cluster to become ACTIVE.

That is the comment I'm looking for (in the codebase): why are we deviating from the expected/recommended pattern to start and wait for a version update to complete?

The reason the sleep-based pattern concerns me is because we could be missing a version update that starts and completes between two calls of DescribeCluster, while you're waiting to transition from ACTIVE -> UPDATING. Now you never see the cluster updating and it will be broken as well. I'd hate to trade one race condition for another, especially if there's a deterministic API available.

Now, granted... this may be unlikely because everything is probably slow as molasses. But what about a cluster without nodes in it? Won't that complete in a jiffy?

I'll ship it if you add the rationale to a comment in the codebase.

Elad Ben-Israel added 2 commits April 23, 2020 15:10
… isComplete

This was already supported, just add some docs and tests to make sure this continues to be supported.
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 4f18dc3
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 074a7d2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 6463344
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@rix0rrr rix0rrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for humoring me <3

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 2de3eff
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 1be46b6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify
Copy link
Contributor

mergify bot commented Apr 23, 2020

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: d6a0d69
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify
Copy link
Contributor

mergify bot commented Apr 23, 2020

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit 307c8b0 into master Apr 23, 2020
@mergify mergify bot deleted the benisrae/eks-fix-version-update branch April 23, 2020 18:24
eladb pushed a commit that referenced this pull request May 6, 2020
…ersion

Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since #7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes.

The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes.

Fixes #7794
mergify bot pushed a commit that referenced this pull request May 6, 2020
…ersion (#7830)

Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since #7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes.

The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes.

Fixes #7794
karupanerura pushed a commit to karupanerura/aws-cdk that referenced this pull request May 7, 2020
…ersion (aws#7830)

Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since aws#7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes.

The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes.

Fixes aws#7794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution/core This is a PR that came from AWS.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade
3 participants