Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C9DiskResize keeps failing #77

Closed
aws-ps-hobson opened this issue Feb 8, 2021 · 20 comments
Closed

C9DiskResize keeps failing #77

aws-ps-hobson opened this issue Feb 8, 2021 · 20 comments

Comments

@aws-ps-hobson
Copy link

Failed to create resource. An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.

image

@awsimaya
Copy link
Contributor

awsimaya commented Apr 6, 2021

Did this error get resolved at all?

@edwio
Copy link

edwio commented Apr 27, 2021

Still a recurring problem, the lambda times out after reaching the maximum retires. as this will be different for each account
thumbnail_image

@rafaelpereyra
Copy link
Contributor

What region are you deploying the Cloud9 Stack? Can you check the CloudWatch logs for that lambda function and post it?

@edwio
Copy link

edwio commented May 14, 2021

eu-west-1, here is the error in the log of the lambda C9DiskResizeLambda function:

{
    "timestamp": "2021-04-26 19:54:45,192",
    "level": "DEBUG",
    "location": "crhelper.utils._send_response:19",
    "RequestType": "Create",
    "StackId": "arn:aws:cloudformation:eu-west-1:443682937418:stack/C9-Observability-Workshop/deae7fa0-a6c8-11eb-a1d8-0ad741ae72c5",
    "RequestId": "c6b84ba4-5c78-45f5-9b42-92c27540ab77",
    "LogicalResourceId": "C9DiskResize",
    "aws_request_id": "d373d7c2-aa10-4434-8e8a-d2b49d0e741e",
    "message": {
        "Status": "FAILED",
        "PhysicalResourceId": "C9-Observability-Workshop_C9DiskResize_NPBB7K5V",
        "StackId": "arn:aws:cloudformation:eu-west-1:443682937418:stack/C9-Observability-Workshop/deae7fa0-a6c8-11eb-a1d8-0ad741ae72c5",
        "RequestId": "c6b84ba4-5c78-45f5-9b42-92c27540ab77",
        "LogicalResourceId": "C9DiskResize",
        "Reason": "An error occurred (Unavailable) when calling the ModifyVolume operation (reached max retries: 4): The service is unavailable. Please try again shortly.",
        "Data": {}
    }

Also,
Manual option for deploying the lab instead of Cloud 9 isn't working correctly, seems that there is some פrerequisites, like adding permissions to S3, and envsetup.sh script is failing as the following commands are not installed:

  • pip
  • npm
  • git

@rafaelpereyra
Copy link
Contributor

Hello, looks like the C9 instance is not ready for the automation to execute.

Can you manually create a Cloud9 Instance and attach the Instance role to it via the AWS Console?

Regarding your second comment, the script is designed to run in Cloud9 were all those applications are already installed (pip, npm, git). Are you running the script from your local machine?

@edwio
Copy link

edwio commented May 14, 2021

I tried your suggestion, and manually created Cloud9, everything seems to be working, until I ran the last command, in the Deploy the stack section, I'm getting an error when running the command: 'cdk deploy Applications --require-approval never':

Received response status [FAILED] from custom resource. Message returned: Error: b'serviceaccount/petsite-sa created\nservice/service-petsite created\ndeployment.apps/petsite-deployment created\nError from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"\n' Logs: /aws/lambda/Applications-ApplicationsMyCluster-Handler886CB40B-P7J44HT23PXJ at invokeUserFunction (/var/task/framework.js:95:19) at process._tickCallback (internal/process/next_tick.js:68:7) (RequestId: 65c72463-1979-428a-aa9f-4dd4c328fb5e)

image

image

i have added the logs of /aws/lambda/Applications-ApplicationsMyCluster-Handler886CB40B-YWFJYW1ENTN3:

[ERROR] Exception: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": serviceaccounts "petsite-sa" already exists\nError from server (Invalid): error when creating "/tmp/manifest.yaml": Service "service-petsite" is invalid: spec.ports[0].nodePort: Invalid value: 30300: provided port is already allocated\nError from server (AlreadyExists): error when creating "/tmp/manifest.yaml": deployments.apps "petsite-deployment" already exists\nError from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"\n'
Traceback (most recent call last):
  File "/var/task/index.py", line 14, in handler
    return apply_handler(event, context)
  File "/var/task/apply/__init__.py", line 60, in apply_handler
    kubectl('create', manifest_file, *kubectl_opts)
  File "/var/task/apply/__init__.py", line 87, in kubectl
    raise Exception(output)

@rafaelpereyra
Copy link
Contributor

Can you check the pods running in your cluster? Looks like the Helm chart for the AWS Load Balancer controller was not deployed properly (webhook is not available).

@edwio
Copy link

edwio commented May 15, 2021

Seems to be running on fine ECS side:

Clusters:
image

Task Definitions:
image

But I don't see any pods running in the EKS:
image

Am I missing something?

@rafaelpereyra
Copy link
Contributor

By default, even if the role you're using is Admin of the account your won't have enough permissions in the Kubernetes RBAC to see that dashboard (hence the message).

We added some instruction to add your role to the RBAC in order to get you access to EKS Console here.

You should however be allowed to list the pods using kubectl from the Cloud9 environment. Can you do that please and check if the AWS Load balancer is running?

@edwio
Copy link

edwio commented May 18, 2021

How do I found my value for CONSOLE_ROLE_ARN=<Enter your Role ARN>?

Regards the EC2 Load Balancer, they are in active state:

image

@rafaelpereyra
Copy link
Contributor

That is the ARN of the role you use to connect to the AWS Console.

Load balancers are created by CDK Services Stack, but inside your EKS Cluster there is component deployed (AWS Load Balancer) that is failing according to the log message you sent.

@edwio
Copy link

edwio commented May 18, 2021

How can I fix that (AWS Load Balancer)?

Further more,
What is the difference between envsetup.sh and envsetup_ee.sh which one I need to run when using CD9 manually?

@rafaelpereyra
Copy link
Contributor

We need to see the reason why is it failing to help you with that.

For the bash script, just follow the instruction here:

https://observability.workshop.aws/en/installation/not_using_ee/_deploy_app.html#install-tools-and-clone-the-repository

The second script (_ee) is used for Event Engine.

@edwio
Copy link

edwio commented May 19, 2021

Where can I see the data why it is failing?

@rafaelpereyra
Copy link
Contributor

You'll see the logs in the EKS cluster using kubectl tool.

Let's try a different approach, can you tear down that environment completely and start from scratch? Please just create the C9 Environment first from Cloudshell as explained here. Please copy and paste the whole list of commands

curl -O https://raw.githubusercontent.com/aws-samples/one-observability-demo/main/cloud9-cfn.yaml

aws cloudformation create-stack --stack-name C9-Observability-Workshop --template-body file://cloud9-cfn.yaml --capabilities CAPABILITY_NAMED_IAM

aws cloudformation wait stack-create-complete --stack-name C9-Observability-Workshop

echo -e "Cloud9 Instance is Ready!!\n\n"

Are there any SCP applied to your account or is this a personal account? Does your role have Admin access to the environment? (looking at reasons why C9 launch / resize would have failed in the first place).

@edwio
Copy link

edwio commented May 19, 2021

found the problem,
the pod that running the aws load balancer, is pulling the image from an ecr in us-west-2 region, which is a region, not accessible for us. due to our organization policy (eu- only).

running kubectl describe pods, against other pods.
I can see that all other pods images, are being pulled from ecr in the eu- region.

how come aws load balancer is being pulled from different region?

also,
is it possible to edit the yaml file, and to specifies an ecr in the eu- region for the aws load balancer?

@rafaelpereyra
Copy link
Contributor

We're installing AWS Load Balancer in CDK using the project Helm chart here.

The default image is configured here.

The current policy in your organization is preventing from pulling cross-region so I'll suggest you to change the Helm Chart default value in your local CDK file (see link in the first paragraph) to include an ECR image path that is allowed inside your organization with the value image.repository.

The image is not available in eu-west-1 so you'll probably will need to pull it from us-west-2 and push it into your ECR.

@edwio
Copy link

edwio commented Jun 28, 2021

@rafaelpereyra when trying to edit the image key, specified in the deployment of the aws-load-balancer,

From:

602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.1.3

To:

602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.1.3

By using the following command :

kubectl edit deployment servicesawsloadbalancercontroller2049c530-aws-load-balancer-con -n kube-system

I'm getting the following error: kubectl Edit cancelled, no changes made

@rafaelpereyra
Copy link
Contributor

Hello,

Looks like an issue with the editor you're using, maybe not saving the changes. Please use this instead:

kubectl set image deployment/servicesawsloadbalancercontrollerXXXXXX-aws-load-balancer-con -n kube-system aws-load-balancer-controller=602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.2.1

@engrun
Copy link

engrun commented Oct 18, 2021

Related to original issue with C9DiskResize

Stumbled upon this today
The resolution, in my case, was to temporarily set default-ebs-encryption to false in the EC2 console.

(This was set to true on my account)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants