Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading snapshot always gets stuck with no info #2667

Closed
grosser opened this issue Dec 16, 2022 · 6 comments
Closed

Uploading snapshot always gets stuck with no info #2667

grosser opened this issue Dec 16, 2022 · 6 comments
Labels
status/needs-info Further information is requested type/support User support related issues.

Comments

@grosser
Copy link

grosser commented Dec 16, 2022

cargo make ami
[cargo-make] INFO - cargo make 0.36.3
[cargo-make] INFO - Build File: Makefile.toml
[cargo-make] INFO - Task: ami
[cargo-make] INFO - Profile: development
[cargo-make] INFO - Running Task: setup
[cargo-make] INFO - Running Task: setup-build
[cargo-make] INFO - Running Task: fetch-sources
[cargo-make] INFO - Running Task: publish-tools
[cargo-make] INFO - Running Task: ami
/Users/user/Code : decoded 2147483648 bytes
/Users/user/Code : decoded 1073741824 bytes
02:56:32 [INFO] Registering 'bottlerocket-aws-k8s-1.22-x86_64-v1.11.1-104f8e0f-dirty' in us-west-2
  Uploading snapshot  [=>                                                ] 90/4096 (5s)

and it just sits there ...
on multiple tries the number is always different 90/4096 or 62/4096

Platform I'm building on:

  • Mac M1
  • v1.11.1

What I expected to happen:
Upload to succeed or fail with an error message.

What actually happened:
Upload gets stuck.

How to reproduce the problem:
make ami
+

# Infra.toml
[aws]
regions = ["us-west-2"]
@grosser grosser added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Dec 16, 2022
@stmcginnis
Copy link
Contributor

stmcginnis commented Dec 16, 2022

Hey @grosser - thanks for reporting this. I'm unable to reproduce the failure locally, so I think we'll have to try to see what might be different in your environment that could be causing this error.

Based on the output, it looks like this is sitting somewhere in the coldsnap tool. It looks like it must be in this loop:

https://github.com/awslabs/coldsnap/blob/develop/src/upload.rs#L171-L188

That's looping through each block of the snapshot and uploading to EBS.

It hasn't out right failed (yet) so I think it's in that retry back off. If I'm reading that right, if it fails uploading a block it will retry up to 12 11 times, backing off (attempt * 2) each time. The call to upload_block() probably has its own internal timeout as well, so that could increase the amount of time to takes.

It may be interesting to let is sit in this state for some time and see if it ever does completely time out and give any kind of useful error message.

Another option would be to run the coldsnap command directly. Unfortunately it doesn't look like there is anything like a --verbose arg that might tell us more, but it would at least isolate things.

You could also try to manually upload the img directly with EBS and see if that provides any additional information. It's possible that could fail too, and if so, provide a better message of what is happening that is causing this problem.

@stmcginnis stmcginnis added status/needs-info Further information is requested type/support User support related issues. and removed type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels Dec 16, 2022
@grosser
Copy link
Author

grosser commented Dec 17, 2022

I'm not sure what the total time math comes out to, but some kind of feedback after >2-5min "Warning: retrying failed connection" would be nice :)
I let it run for >1.5h and nothing happened.
I'm not sure how to use coldsnap since I don't know what the path the the image is that it's trying to upload (--verbose option would be helpful there too).
... I could try uploading a random file with coldsnap, is that what you are suggesting ?

@grosser
Copy link
Author

grosser commented Dec 17, 2022

Ok found it ... permission error, so the bug is that it should not retry on these

Failed to put block 1551 for snapshot 'snap-0f48e9c316f6fa504': TransientError: connection closed before message completed
Failed to put block 1552 for snapshot 'snap-0f48e9c316f6fa504': AccessDeniedException: User: arn:aws:sts::589470546123:assumed-role/compute-arf/foo@bar.com is not authorized to perform: ebs:PutSnapshotBlock on resource: arn:aws:ec2:us-west-2::snapshot/snap-0f48e9c316f6fa504 because no identity-based policy allows the ebs:PutSnapshotBlock action

@grosser
Copy link
Author

grosser commented Dec 17, 2022

... and it should also not retry for 2 hours 🤦

@grosser
Copy link
Author

grosser commented Dec 17, 2022

some documentation for "here are all the aws permissions you need" would be nice too

@stmcginnis
Copy link
Contributor

I've filed an issue in coldsnap with these details: awslabs/coldsnap#216

I'm only familiar with that code as much as it took to track down where it was failing, but I'll see if I can dig in there and make some of those changes.

I'm going to close this issue as the fix will need to be in the coldsnap tool and there's nothing in bottlerocket we change to workaround it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-info Further information is requested type/support User support related issues.
Projects
None yet
Development

No branches or pull requests

2 participants