Skip to content

Conversation

@thesquelched
Copy link

Add ssl.SSLError to the list of exceptions that libcloud.utils.misc.retry catches, since that SSLError can be a transient error. Also:

  • Clean up retry a bit, fixing an issue in which keyword arguments defaults are not applied
  • Enable retries in tests that are designed to test retries (previous, some tests passed trivially)

See: https://issues.apache.org/jira/browse/LIBCLOUD-728

@Kami
Copy link
Member

Kami commented Jul 30, 2015

Thanks.

Can you please provide more context on this change? Why would we want to retry on an SSL exception?

SSL exception usually indicates some deeper issue which shouldn't simply be ignored and retried.

As far as the tests go, yeah, they are currently not really testing retries (see #515 (comment)). /cc @cryptickp

@thesquelched
Copy link
Author

I see this exception somewhat frequently when using libcloud over HTTPS (I think against Rackspace DNS): ssl.SSLError: The read operation timed out. Normally, I'd expect a timeout to raise socket.timeout, but apparently this is not the case. Yes, this may also catch non-transient SSLError's, but it's the only way to get retries to work for timeouts over SSL.

@Kami
Copy link
Member

Kami commented Jul 30, 2015

I see, thanks for the clarification.

We should dig in and see from which exception it inherits. We definitely shouldn't catch SSLError directly since this exception is also thrown on invalid certificate and other certificate errors. Masking and ignoring this error is bad and open to security issues.

If we can't find a good base exceptions which doesn't have negative security implications we could allow user to pass a callable to the retry method. This callable could then match the exception error message or similar.

@thesquelched
Copy link
Author

Last I checked, there isn't a particularly rich hierarchy of SSL exceptions; in fact, it looks as though the base exception is the one being thrown.

I don't see that it exacerbates security issues. We're not masking or ignoring the exception; just retrying the call until it times out. A bad cert will always keep throwing exceptions and will eventually time out, raising the original SSLError.

@erjohnso
Copy link
Contributor

Note that there are GCE users reporting this error also. We're still investigating, but from the user's perspective, HTTP response is not coming back and after 180 seconds (default timeout), the ssl.SSLError is being thrown.

See this thread: https://groups.google.com/forum/#!topic/gce-discussion/LSPnKn9a4zk

FWIW, Google errors are not being reported via any other clients other than libcloud. Still researching...

@cryptickp
Copy link

@Kami @thesquelched This test actually test retries, got tied up with other stuff but will fix incorrect test soon.

@Kami
Copy link
Member

Kami commented Aug 2, 2015

@thesquelched

Last I checked, there isn't a particularly rich hierarchy of SSL exceptions; in fact, it looks as though the base exception is the one being thrown.

I don't see that it exacerbates security issues. We're not masking or ignoring the exception; just retrying the call until it times out. A bad cert will always keep throwing exceptions and will eventually time out, raising the original SSLError.

Yeah, the "problem" is (or it might be) that on the first call an SSL exception is thrown (e.g. invalid cert or similar), but later on when we retry a different exception is thrown (e.g. timeout or connection refused).

The retry code right now on timeout re-throws the last exception (by design) which means it could potentially "mask" the original SSL exception.

This might not be an issue, but I'm always careful when I touch security related code and it's also possible that we are missing some other potentially dangerous edge cases or scenario.

@pquerna @alex It would be great if you guys can have a look and chime in as well.

@Kami
Copy link
Member

Kami commented Aug 2, 2015

@erjohnso

Note that there are GCE users reporting this error also. We're still investigating, but from the user's perspective, HTTP response is not coming back and after 180 seconds (default timeout), the ssl.SSLError is being thrown.

See this thread: https://groups.google.com/forum/#!topic/gce-discussion/LSPnKn9a4zk

FWIW, Google errors are not being reported via any other clients other than libcloud. Still researching...

That's rather weird, indeed.

We don't do anything much different than other libraries. I'm kinda curious which version of the OpenSSL library those people are using.

@alex
Copy link
Contributor

alex commented Aug 2, 2015

Is there no way to retry SSL.Error only for certain values? (e.g. read timeout gets a retry, invalid certificate does not)

@Kami
Copy link
Member

Kami commented Aug 2, 2015

@alex Yeah, that's what I suggested above (also checking the message in addition to the exception type).

@thesquelched
Copy link
Author

@Kami I added the SSLError parsing stuff; only read timeout gets a retry

@thesquelched
Copy link
Author

Update on this?

@jimbobhickville
Copy link

I saw this mentioned in irc today as not making the release. Afaict, @thesquelched addressed all the issues brought up. If there's additional feedback, please let him have it so we can get this fixed.

@allardhoeve
Copy link
Contributor

I think this is fine, although I don't really understand the testcase. That often means it is too complex (too much mocking). I'm used to pretty heavy mocking but it took me three times to get it. Maybe some docs on how you mock the inner request but test the outer request?

@allardhoeve
Copy link
Contributor

Also, sorry about the delay

@allardhoeve
Copy link
Contributor

And finally, a big thank you for the effort. It can be disheartening when your work doesn't make it into a release, so. Thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have a fallback to raise the original exception? Dunno how I missed that before.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does now

@jimbobhickville
Copy link

@Kami @allardhoeve Any chance of this getting merged before the next release? Since it's only enabled by configuration, it should be safe even if it's not 100% perfect (although I believe @thesquelched has addressed all the issues now).

@allardhoeve
Copy link
Contributor

seems fine to me

@asfgit asfgit closed this in 5733de3 Jan 10, 2016
asfgit pushed a commit that referenced this pull request Jan 10, 2016
@Kami
Copy link
Member

Kami commented Jan 10, 2016

Made some minor style changes and merged it intro trunk.

Sorry for the delay and thanks!

@Kami
Copy link
Member

Kami commented Jan 14, 2016

It looks like this issue could be related to the same underlying root cause which is described here - https://libcloud.apache.org/blog/2016/01/14/notice-for-linode-users.html

As it only happens sometimes it could mean that one of the load-balancers the request is routed to only supports TLS >= 1.1 (or similar) and that's why the exception can't consistently be reproduced.

If that's indeed the case we should revert this change - I was already very skeptical about this change from the beginning since it looks like it's just masking the symptoms and not addressing the actual root cause.

/cc @thesquelched @jimbobhickville @erjohnso

@jimbobhickville
Copy link

That's getting socket.error, which the existing retry behavior would catch already, if the user opted-in for automatic retries. We only see the SSLError that is accounted for here when the network or API service is being flaky. It's usually accompanied by other network-related errors on other requests around the same time. I understand your concerns, I really do, but the worst-case scenario here is you get a slight delay as it exhausts the retries or your application will continue to function while masking over an issue that is outside of your control. It's not like it's going to do something dangerous by re-submitting the request to a broken API. And it isn't the default behavior, you have to turn it on explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants