Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(transport): Add extra socket options to the connection pool to fi… #1323

Conversation

sliwinski-milosz
Copy link

…x 'Connection reset by peer' errors (#1198)

I had to add hasattr for the options as on some systems some socket options are not available.
eg on my MacBook and Python3.7:

(Pdb) socket.SO_KEEPALIVE
8
(Pdb) socket.TCP_KEEPIDLE
*** AttributeError: module 'socket' has no attribute 'TCP_KEEPIDLE'
(Pdb) socket.TCP_KEEPINTVL
257
(Pdb) socket.TCP_KEEPCNT
258

However those are available on Ubuntu20.04:

docker run -t -i --rm ubuntu:20.04 bash
apt-get update
...
apt-get install -y python3.8
...
python3.8
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.SO_KEEPALIVE
9
>>> socket.TCP_KEEPIDLE
4
>>> socket.TCP_KEEPINTVL
5
>>> socket.TCP_KEEPCNT
6

Let me know if you have any idea how we can test this. Like I have described in the issue, I don't know the easy steps to reproduce this issue locally.
However it happens everyday on the system that I run. I have tested it there and the changes from this PR fixed the issue.

@sl0thentr0py
Copy link
Member

Hi @sliwinski-milosz thx for the PR!
We are aware of this issue and it's a bit tricky because it's so low level/hard to repro as you said.
I'll let the test suite run for now.

@sl0thentr0py
Copy link
Member

I would say exposing socket_options on the transport class is better than hardcoding those magic values ourselves?

@antonpirker
Copy link
Member

Just to get this right:

Do I get this right?

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Jan 27, 2022

Because of the missing options sometimes it fails to send events to sentry backend.

I would phrase it slightly differently. It is not because of missing options but because of specific scale behaviours/traffic distributions presumably that this fails. This is fairly non-trivial to repro because it is a low-level problem that only happens on certain systems of a certain size under a certain load distribution.

Those magical values might work for certain systems but I do not know why precisely those values are selected. It is much better to leave this as a choice for the end-user than making those precise choices ourselves. If these were sane defaults, I would expect an upstream library (urllib3) to have incorporated them themselves, in which case we do not need this fix at all.

@sliwinski-milosz
Copy link
Author

sliwinski-milosz commented Jan 27, 2022

@sl0thentr0py you have mentioned that you are aware of the issue. Does that mean you have seen it happening on your configurations? If yes, could you please provide more details?

So far I was able to reproduce it on ubuntu:18.04 (Docker). As ERROR logging is disabled by default at the moment I don't know if it fails the same way on other configurations.

I am not sure if at this stage we can assume that it affects only certain systems.
The urllib3 default values for pool worked fine for me for all other tools that I used so far. This makes me think that this issue is specific for Sentry. This is why I wouldn't expect urllib3 to have different default options - as maybe those don't apply to other cases.
Also, initially I raised the issue for the frontend hoping that something can be improved on their side.

In my opinion the next steps are:

  1. Update the code to allow passing those additional socket options through parameter (configuration).
    This is a safe update. It adds some flexibility and doesn't affect current users at all.
  2. Test extra options on multiple configuration where we face the issue.
  3. Try to find the configuration which is not affected by the issue (where we cannot reproduce it).

If it fails the same way on all systems and after testing it will turn out that the fix works properly - then I would use those extra options as default (still leaving the option to override them).

Lets check also if other sdk clients use the keep-alive option or not.

Short term I think we should:

  1. Fix Sentry is not logging ERROR logs by default #1191 so people are aware if their integration is reliable.
  2. Allow passing extra socket options through configuration.
  3. Update documentation mentioning about the issue and possible solution (extra socket options through configuration).

What do you think?

Could you please also help in fixing the failing ci / lint (pull_request) check? How can this be fixed?

@sl0thentr0py
Copy link
Member

Does that mean you have seen it happening on your configurations?

No I just meant the previous issues here.

Lets check also if other sdk clients use the keep-alive option or not.

I'm not aware of any low level TCP settings in any of the other SDKs.

Could you please also help in fixing the failing ci / lint (pull_request) check? How can this be fixed?

I don't think we'll be merging this PR in its current form. I think exposing socket_options is a better option here and I can make a PR next week or so with tests etc.

Tangentially, I'm still not convinced tweaking TCP settings is the right thing to do here. Network failures can happen due to whatever reason and that's not the abstraction layer the SDK should care about. What it should do is handle network failures gracefully/retry if possible etc. I will think more about this and see what to do here.

@sliwinski-milosz
Copy link
Author

I am closing this PR as there are other ideas to approach this issue as mentioned in the previous comment.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants