Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced timeout and log output for IMDS timeout #192

Merged
merged 2 commits into from
Aug 10, 2020

Conversation

willarmiros
Copy link
Contributor

Description of changes:
Currently, EC2Plugin.isEnabled calls the IMDS endpoint to determine whether it is operating on EC2. This is fine when we are operating on EC2, since from my tests it takes <3 milliseconds to reach IMDS on an instance.

However, this is problematic as we move toward an opt-out plugin model. In this model, we check isEnabled for each plugin regardless of environment, so both failing and passing should be fast. Currently EC2Plugin.isEnabled() takes 4 seconds to fail, because the timeout for the IMDS endpoint check is 2 seconds and it checks IMDSv1 and v2 before failing. This is far too long for non-EC2 customers using opt-out plugins.

This PR lowers the timeout quite a bit and suppresses logging if we hit it to be friendlier to non-EC2, opt-out plugin customers. Given how fast successful IMDS requests are, the lower timeout should not impact the plugin's detection ability. To be even faster, we could avoid checking the second check to IMDSv1 when the v2 request times out, but that would require a bit of weird throwing logic.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -46,7 +47,7 @@
AMI_ID,
}

private static final int TIMEOUT_MILLIS = 2000;
private static final int TIMEOUT_MILLIS = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping a higher (maybe not this high) read timeout and only lowering connect?

Copy link

@awssandra awssandra Aug 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering wether there's a better detection mechanism for EC2, before we actually start up the plug-in... similar to how we detect Lambda environments from env vars. Failing fast is good too though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling it's possible to speed up the positive case but we may be stuck on the negative case. EC2 provides low level customization of everything so env cars, file system can't be relied. I found a post where the hostname will end with an Amazon suffix by default but that's only if it's not overridden...

@willarmiros Out of curiosity what JVM did you test the timeout on? Java 11+ have a completely revamped network stack and I wonder if this may have improved. I think most tools fail fast when an IP is completely unreachable, which should be what's happening with metadata service which uses a link local, not public, IP address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awssandra as Rag said, can't rely on much other than IMDS. There's no single file to check for that exists on all instances on Windows & Linux. Hostname is overridden in VPCs which many EC2 Customers use.

@anuraaga I tested the timeout using Java 8. I can change it to have say a 100 ms Connect timeout and 1 second read timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most tools fail fast when an IP is completely unreachable

Do you mean they throw an exception before even timing out? Would this reliably happen in Java 8 & 11?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just meant that newer network stack might through an exception right away regardless of the timeout value. Not a big deal mostly just curious :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants