New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LIBCLOUD-728] Add SSLError to retry decorator exceptions #556

Closed
wants to merge 1 commit into
base: trunk
from

Conversation

Projects
None yet
8 participants
@thesquelched

thesquelched commented Jul 30, 2015

Add ssl.SSLError to the list of exceptions that libcloud.utils.misc.retry catches, since that SSLError can be a transient error. Also:

  • Clean up retry a bit, fixing an issue in which keyword arguments defaults are not applied
  • Enable retries in tests that are designed to test retries (previous, some tests passed trivially)

See: https://issues.apache.org/jira/browse/LIBCLOUD-728

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Jul 30, 2015

Member

Thanks.

Can you please provide more context on this change? Why would we want to retry on an SSL exception?

SSL exception usually indicates some deeper issue which shouldn't simply be ignored and retried.

As far as the tests go, yeah, they are currently not really testing retries (see #515 (comment)). /cc @cryptickp

Member

Kami commented Jul 30, 2015

Thanks.

Can you please provide more context on this change? Why would we want to retry on an SSL exception?

SSL exception usually indicates some deeper issue which shouldn't simply be ignored and retried.

As far as the tests go, yeah, they are currently not really testing retries (see #515 (comment)). /cc @cryptickp

@thesquelched

This comment has been minimized.

Show comment
Hide comment
@thesquelched

thesquelched Jul 30, 2015

I see this exception somewhat frequently when using libcloud over HTTPS (I think against Rackspace DNS): ssl.SSLError: The read operation timed out. Normally, I'd expect a timeout to raise socket.timeout, but apparently this is not the case. Yes, this may also catch non-transient SSLError's, but it's the only way to get retries to work for timeouts over SSL.

thesquelched commented Jul 30, 2015

I see this exception somewhat frequently when using libcloud over HTTPS (I think against Rackspace DNS): ssl.SSLError: The read operation timed out. Normally, I'd expect a timeout to raise socket.timeout, but apparently this is not the case. Yes, this may also catch non-transient SSLError's, but it's the only way to get retries to work for timeouts over SSL.

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Jul 30, 2015

Member

I see, thanks for the clarification.

We should dig in and see from which exception it inherits. We definitely shouldn't catch SSLError directly since this exception is also thrown on invalid certificate and other certificate errors. Masking and ignoring this error is bad and open to security issues.

If we can't find a good base exceptions which doesn't have negative security implications we could allow user to pass a callable to the retry method. This callable could then match the exception error message or similar.

Member

Kami commented Jul 30, 2015

I see, thanks for the clarification.

We should dig in and see from which exception it inherits. We definitely shouldn't catch SSLError directly since this exception is also thrown on invalid certificate and other certificate errors. Masking and ignoring this error is bad and open to security issues.

If we can't find a good base exceptions which doesn't have negative security implications we could allow user to pass a callable to the retry method. This callable could then match the exception error message or similar.

@thesquelched

This comment has been minimized.

Show comment
Hide comment
@thesquelched

thesquelched Jul 30, 2015

Last I checked, there isn't a particularly rich hierarchy of SSL exceptions; in fact, it looks as though the base exception is the one being thrown.

I don't see that it exacerbates security issues. We're not masking or ignoring the exception; just retrying the call until it times out. A bad cert will always keep throwing exceptions and will eventually time out, raising the original SSLError.

thesquelched commented Jul 30, 2015

Last I checked, there isn't a particularly rich hierarchy of SSL exceptions; in fact, it looks as though the base exception is the one being thrown.

I don't see that it exacerbates security issues. We're not masking or ignoring the exception; just retrying the call until it times out. A bad cert will always keep throwing exceptions and will eventually time out, raising the original SSLError.

@erjohnso

This comment has been minimized.

Show comment
Hide comment
@erjohnso

erjohnso Jul 30, 2015

Member

Note that there are GCE users reporting this error also. We're still investigating, but from the user's perspective, HTTP response is not coming back and after 180 seconds (default timeout), the ssl.SSLError is being thrown.

See this thread: https://groups.google.com/forum/#!topic/gce-discussion/LSPnKn9a4zk

FWIW, Google errors are not being reported via any other clients other than libcloud. Still researching...

Member

erjohnso commented Jul 30, 2015

Note that there are GCE users reporting this error also. We're still investigating, but from the user's perspective, HTTP response is not coming back and after 180 seconds (default timeout), the ssl.SSLError is being thrown.

See this thread: https://groups.google.com/forum/#!topic/gce-discussion/LSPnKn9a4zk

FWIW, Google errors are not being reported via any other clients other than libcloud. Still researching...

@cryptickp

This comment has been minimized.

Show comment
Hide comment
@cryptickp

cryptickp Jul 31, 2015

@Kami @thesquelched This test actually test retries, got tied up with other stuff but will fix incorrect test soon.

cryptickp commented Jul 31, 2015

@Kami @thesquelched This test actually test retries, got tied up with other stuff but will fix incorrect test soon.

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Aug 2, 2015

Member

@thesquelched

Last I checked, there isn't a particularly rich hierarchy of SSL exceptions; in fact, it looks as though the base exception is the one being thrown.

I don't see that it exacerbates security issues. We're not masking or ignoring the exception; just retrying the call until it times out. A bad cert will always keep throwing exceptions and will eventually time out, raising the original SSLError.

Yeah, the "problem" is (or it might be) that on the first call an SSL exception is thrown (e.g. invalid cert or similar), but later on when we retry a different exception is thrown (e.g. timeout or connection refused).

The retry code right now on timeout re-throws the last exception (by design) which means it could potentially "mask" the original SSL exception.

This might not be an issue, but I'm always careful when I touch security related code and it's also possible that we are missing some other potentially dangerous edge cases or scenario.

@pquerna @alex It would be great if you guys can have a look and chime in as well.

Member

Kami commented Aug 2, 2015

@thesquelched

Last I checked, there isn't a particularly rich hierarchy of SSL exceptions; in fact, it looks as though the base exception is the one being thrown.

I don't see that it exacerbates security issues. We're not masking or ignoring the exception; just retrying the call until it times out. A bad cert will always keep throwing exceptions and will eventually time out, raising the original SSLError.

Yeah, the "problem" is (or it might be) that on the first call an SSL exception is thrown (e.g. invalid cert or similar), but later on when we retry a different exception is thrown (e.g. timeout or connection refused).

The retry code right now on timeout re-throws the last exception (by design) which means it could potentially "mask" the original SSL exception.

This might not be an issue, but I'm always careful when I touch security related code and it's also possible that we are missing some other potentially dangerous edge cases or scenario.

@pquerna @alex It would be great if you guys can have a look and chime in as well.

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Aug 2, 2015

Member

@erjohnso

Note that there are GCE users reporting this error also. We're still investigating, but from the user's perspective, HTTP response is not coming back and after 180 seconds (default timeout), the ssl.SSLError is being thrown.

See this thread: https://groups.google.com/forum/#!topic/gce-discussion/LSPnKn9a4zk

FWIW, Google errors are not being reported via any other clients other than libcloud. Still researching...

That's rather weird, indeed.

We don't do anything much different than other libraries. I'm kinda curious which version of the OpenSSL library those people are using.

Member

Kami commented Aug 2, 2015

@erjohnso

Note that there are GCE users reporting this error also. We're still investigating, but from the user's perspective, HTTP response is not coming back and after 180 seconds (default timeout), the ssl.SSLError is being thrown.

See this thread: https://groups.google.com/forum/#!topic/gce-discussion/LSPnKn9a4zk

FWIW, Google errors are not being reported via any other clients other than libcloud. Still researching...

That's rather weird, indeed.

We don't do anything much different than other libraries. I'm kinda curious which version of the OpenSSL library those people are using.

@alex

This comment has been minimized.

Show comment
Hide comment
@alex

alex Aug 2, 2015

Contributor

Is there no way to retry SSL.Error only for certain values? (e.g. read timeout gets a retry, invalid certificate does not)

Contributor

alex commented Aug 2, 2015

Is there no way to retry SSL.Error only for certain values? (e.g. read timeout gets a retry, invalid certificate does not)

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Aug 2, 2015

Member

@alex Yeah, that's what I suggested above (also checking the message in addition to the exception type).

Member

Kami commented Aug 2, 2015

@alex Yeah, that's what I suggested above (also checking the message in addition to the exception type).

@thesquelched

This comment has been minimized.

Show comment
Hide comment
@thesquelched

thesquelched Aug 26, 2015

@Kami I added the SSLError parsing stuff; only read timeout gets a retry

thesquelched commented Aug 26, 2015

@Kami I added the SSLError parsing stuff; only read timeout gets a retry

@thesquelched

This comment has been minimized.

Show comment
Hide comment
@thesquelched

thesquelched Oct 6, 2015

Update on this?

thesquelched commented Oct 6, 2015

Update on this?

@jimbobhickville

This comment has been minimized.

Show comment
Hide comment
@jimbobhickville

jimbobhickville Oct 14, 2015

Contributor

I saw this mentioned in irc today as not making the release. Afaict, @thesquelched addressed all the issues brought up. If there's additional feedback, please let him have it so we can get this fixed.

Contributor

jimbobhickville commented Oct 14, 2015

I saw this mentioned in irc today as not making the release. Afaict, @thesquelched addressed all the issues brought up. If there's additional feedback, please let him have it so we can get this fixed.

@allardhoeve

This comment has been minimized.

Show comment
Hide comment
@allardhoeve

allardhoeve Nov 23, 2015

Contributor

I think this is fine, although I don't really understand the testcase. That often means it is too complex (too much mocking). I'm used to pretty heavy mocking but it took me three times to get it. Maybe some docs on how you mock the inner request but test the outer request?

Contributor

allardhoeve commented Nov 23, 2015

I think this is fine, although I don't really understand the testcase. That often means it is too complex (too much mocking). I'm used to pretty heavy mocking but it took me three times to get it. Maybe some docs on how you mock the inner request but test the outer request?

@allardhoeve

This comment has been minimized.

Show comment
Hide comment
@allardhoeve

allardhoeve Nov 23, 2015

Contributor

Also, sorry about the delay

Contributor

allardhoeve commented Nov 23, 2015

Also, sorry about the delay

@allardhoeve

This comment has been minimized.

Show comment
Hide comment
@allardhoeve

allardhoeve Nov 23, 2015

Contributor

And finally, a big thank you for the effort. It can be disheartening when your work doesn't make it into a release, so. Thanks.

Contributor

allardhoeve commented Nov 23, 2015

And finally, a big thank you for the effort. It can be disheartening when your work doesn't make it into a release, so. Thanks.

exc = sys.exc_info()[1]
if TRANSIENT_SSL_ERROR in str(exc):
raise TransientSSLError(*exc.args)

This comment has been minimized.

@jimbobhickville

jimbobhickville Dec 7, 2015

Contributor

Shouldn't this have a fallback to raise the original exception? Dunno how I missed that before.

@jimbobhickville

jimbobhickville Dec 7, 2015

Contributor

Shouldn't this have a fallback to raise the original exception? Dunno how I missed that before.

This comment has been minimized.

@thesquelched
@thesquelched
@jimbobhickville

This comment has been minimized.

Show comment
Hide comment
@jimbobhickville

jimbobhickville Jan 7, 2016

Contributor

@Kami @allardhoeve Any chance of this getting merged before the next release? Since it's only enabled by configuration, it should be safe even if it's not 100% perfect (although I believe @thesquelched has addressed all the issues now).

Contributor

jimbobhickville commented Jan 7, 2016

@Kami @allardhoeve Any chance of this getting merged before the next release? Since it's only enabled by configuration, it should be safe even if it's not 100% perfect (although I believe @thesquelched has addressed all the issues now).

@allardhoeve

This comment has been minimized.

Show comment
Hide comment
@allardhoeve

allardhoeve Jan 9, 2016

Contributor

seems fine to me

Contributor

allardhoeve commented Jan 9, 2016

seems fine to me

@asfgit asfgit closed this in 5733de3 Jan 10, 2016

asfgit pushed a commit that referenced this pull request Jan 10, 2016

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Jan 10, 2016

Member

Made some minor style changes and merged it intro trunk.

Sorry for the delay and thanks!

Member

Kami commented Jan 10, 2016

Made some minor style changes and merged it intro trunk.

Sorry for the delay and thanks!

@Kami

This comment has been minimized.

Show comment
Hide comment
@Kami

Kami Jan 14, 2016

Member

It looks like this issue could be related to the same underlying root cause which is described here - https://libcloud.apache.org/blog/2016/01/14/notice-for-linode-users.html

As it only happens sometimes it could mean that one of the load-balancers the request is routed to only supports TLS >= 1.1 (or similar) and that's why the exception can't consistently be reproduced.

If that's indeed the case we should revert this change - I was already very skeptical about this change from the beginning since it looks like it's just masking the symptoms and not addressing the actual root cause.

/cc @thesquelched @jimbobhickville @erjohnso

Member

Kami commented Jan 14, 2016

It looks like this issue could be related to the same underlying root cause which is described here - https://libcloud.apache.org/blog/2016/01/14/notice-for-linode-users.html

As it only happens sometimes it could mean that one of the load-balancers the request is routed to only supports TLS >= 1.1 (or similar) and that's why the exception can't consistently be reproduced.

If that's indeed the case we should revert this change - I was already very skeptical about this change from the beginning since it looks like it's just masking the symptoms and not addressing the actual root cause.

/cc @thesquelched @jimbobhickville @erjohnso

@jimbobhickville

This comment has been minimized.

Show comment
Hide comment
@jimbobhickville

jimbobhickville Jan 14, 2016

Contributor

That's getting socket.error, which the existing retry behavior would catch already, if the user opted-in for automatic retries. We only see the SSLError that is accounted for here when the network or API service is being flaky. It's usually accompanied by other network-related errors on other requests around the same time. I understand your concerns, I really do, but the worst-case scenario here is you get a slight delay as it exhausts the retries or your application will continue to function while masking over an issue that is outside of your control. It's not like it's going to do something dangerous by re-submitting the request to a broken API. And it isn't the default behavior, you have to turn it on explicitly.

Contributor

jimbobhickville commented Jan 14, 2016

That's getting socket.error, which the existing retry behavior would catch already, if the user opted-in for automatic retries. We only see the SSLError that is accounted for here when the network or API service is being flaky. It's usually accompanied by other network-related errors on other requests around the same time. I understand your concerns, I really do, but the worst-case scenario here is you get a slight delay as it exhausts the retries or your application will continue to function while masking over an issue that is outside of your control. It's not like it's going to do something dangerous by re-submitting the request to a broken API. And it isn't the default behavior, you have to turn it on explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment