Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit TCP writes w/Tornado to 2GB #6557

Merged
merged 10 commits into from
Jun 24, 2022
Merged

Conversation

hhuuggoo
Copy link
Contributor

@hhuuggoo hhuuggoo commented Jun 10, 2022

Closes #6556

  • Tests added / passed
  • Passes pre-commit run --all-files

jakirkham and others added 8 commits July 28, 2021 15:49
Works around the same OpenSSL issue seen for reading except this does so
for writing. As individual frames may not be this large, this may be
less of an issue. Still this is a good preventative measure to protect
users.
All chunks will be of non-trivial size. If they are trivial, the loop
would have already ended.
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Hugo! 🙏

Added some suggestions below to update this a bit relative to other parts of the code that have since been updated as well as with simplification we can make here

distributed/comm/tcp.py Outdated Show resolved Hide resolved
distributed/comm/tcp.py Outdated Show resolved Hide resolved
distributed/comm/tcp.py Outdated Show resolved Hide resolved
distributed/comm/tcp.py Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member

ok to test

@jakirkham jakirkham changed the title Send lt 2gb bufs Limit TCP writes w/Tornado to 2GB Jun 10, 2022
@jakirkham
Copy link
Member

Also updated the title for clarity. Hope that is ok :)

@github-actions
Copy link
Contributor

github-actions bot commented Jun 10, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±    0         15 suites  ±0   7h 33m 53s ⏱️ + 1h 8m 0s
  2 879 tests +  23    2 763 ✔️  -     9    83 💤 +  2  29 +27  4 🔥 +3 
21 330 runs  +171  20 328 ✔️ +117  964 💤 +19  34 +32  4 🔥 +3 

For more details on these failures and errors, see this check.

Results for commit 4950d33. ± Comparison against base commit b119eb9.

♻️ This comment has been updated with latest results.

@jakirkham jakirkham mentioned this pull request Jun 10, 2022
9 tasks
@mrocklin
Copy link
Member

Ideally there would be a regression test for this. It looks like you have something close to an MCVE in the original issue. If you grep through tests you'll likely find gen_tls_cluster which should help to set things up for you.

@hhuuggoo
Copy link
Contributor Author

Ideally there would be a regression test for this. It looks like you have something close to an MCVE in the original issue. If you grep through tests you'll likely find gen_tls_cluster which should help to set things up for you.

Sounds good. I'll see if I can get a regression test together that doesn't involve making a 2GB array.

@mrocklin
Copy link
Member

mrocklin commented Jun 10, 2022 via email

@jakirkham
Copy link
Member

Wouldn't be surprised if the only way to create the issue is sending a 2GB+ object

@jakirkham
Copy link
Member

Took the liberty of committing those suggestions. Hope that is ok

@hhuuggoo
Copy link
Contributor Author

Wouldn't be surprised if the only way to create the issue is sending a 2GB+ object

If that's acceptable for CI then sure. I was thinking of trying to monkeypatch something to bring detect the condition.

@jakirkham - I likely won't get to this until next week - don't feel bad if you want to pick this up instead - I'm going to be a lot slower than you.

@jakirkham
Copy link
Member

Ok I couldn't tell if this was something pressing that needed to be in the upcoming release or not. As it sounds like it can wait, happy to wait :)

@hhuuggoo
Copy link
Contributor Author

Ok I couldn't tell if this was something pressing that needed to be in the upcoming release or not. As it sounds like it can wait, happy to wait :)

I don't know if this is a pressing issue for anyone - for us it's not. Most of our dask usage does not use SSL, and monkey patching in our case is an acceptable workaround. But I do think that anything that increases robustness for large objects is always good (even though ideally people wouldn't do that).

@jakirkham
Copy link
Member

jakirkham commented Jun 10, 2022

Yeah previously it was seen to be an issue in receiving primarily ( #4538 ).

While I don't think we ever saw a good reproducer at the time, my guess is individual frames on the write side were <2GB, but the combination of them was 2GB+, which caused an issue in read. We tried to fix write as well (via sharding), but it appears we still need this sliding window solution.

This issue goes away when using newer APIs from OpenSSL 1.1.1+, which Python 3.10 does. Though Python 3.10 is still pretty new, so there are plenty of cases where this issue can still show up. Eventually it will diminish as those previous Python versions leave circulation.

@jsignell jsignell mentioned this pull request Jun 20, 2022
9 tasks
@jakirkham
Copy link
Member

It is worth noting we don't seem to have a test for the read side currently. Given that and the fact this would involve largish buffers that might be painful on CI. Wonder if we should skip it. Thoughts?

@jsignell
Copy link
Member

I just pushed a change to make pre-commit happy. I have no opinion about the difficulty of adding testing.

@jakirkham jakirkham marked this pull request as ready for review June 22, 2022 19:36
@jakirkham
Copy link
Member

Thanks Julia! 🙏

@jakirkham
Copy link
Member

It is worth noting we don't seem to have a test for the read side currently. Given that and the fact this would involve largish buffers that might be painful on CI. Wonder if we should skip it. Thoughts?

@mrocklin, any thoughts on this?

@jsignell
Copy link
Member

As long as you are on board @jakirkham, I will plan on merging this before the release tomorrow unless there is any more feedback.

@jakirkham
Copy link
Member

SGTM. Thanks Julia 🙏

@hhuuggoo
Copy link
Contributor Author

@jakirkham I'm sorry I still haven't done anything on this - if it's easier feel free to just take this over. I don't want to block anything.

@jsignell jsignell merged commit f129485 into dask:main Jun 24, 2022
@jakirkham
Copy link
Member

No worries Hugo. The changes seem fine. Think we are skipping the test for now. We can always revisit that if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SSL Overflow Errors on distributed 2022.05
5 participants