Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/run-rbd-unit-tests.sh: increase ip-local-port-range #48018

Closed
wants to merge 3 commits into from

Conversation

tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Sep 8, 2022

Signed-off-by: Kefu Chai tchaikov@gmail.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 8, 2022

jenkins test make check

@github-actions
Copy link

github-actions bot commented Sep 8, 2022

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 9, 2022

https://jenkins.ceph.com/job/ceph-pull-requests/103655/

the output of nestat is way too large, so i pasted it as a gist: https://gist.github.com/tchaikov/b3858da3afcba00eb71394dffb5899d1

to understand the root cause better

See-also: https://tracker.ceph.com/issues/57116
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
in case we are running out of available ports while performing tests.

See-also: https://tracker.ceph.com/issues/57116
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
less repeatings this way.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
@tchaikov tchaikov changed the title [DNM] just for testing test/run-rbd-unit-tests.sh: increase ip-local-port-range Sep 9, 2022
@tchaikov tchaikov removed the DNM label Sep 9, 2022
@tchaikov tchaikov marked this pull request as ready for review September 9, 2022 17:34
@tchaikov tchaikov requested a review from a team as a code owner September 9, 2022 17:34
@ljflores
Copy link
Contributor

ljflores commented Sep 9, 2022

@tchaikov made this Tracker to document your additional fix: https://tracker.ceph.com/issues/57491

Can you update the commit?

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this change. https://gist.github.com/tchaikov/b3858da3afcba00eb71394dffb5899d1 shows over 28000 TIME_WAIT sockets to 172.21.2.2:6810. I don't think unittest_librbd is involved in that in any way -- I would expect it to generate ~20 TIME_WAIT sockets to localhost -- so I don't see any reason to touch run-rbd-unit-tests.sh.

And even if this wasn't done from run-rbd-unit-tests.sh but in some other file, why are we attempting to increase ip_local_port_range instead of tracking down what is responsible for tens of thousands of locally closed sockets?

@tchaikov
Copy link
Contributor Author

I don't understand the purpose of this change. https://gist.github.com/tchaikov/b3858da3afcba00eb71394dffb5899d1 shows over 28000 TIME_WAIT sockets to 172.21.2.2:6810. I don't think unittest_librbd is involved in that in any way -- I would expect it to generate ~20 TIME_WAIT sockets to localhost -- so I don't see any reason to touch run-rbd-unit-tests.sh.

And even if this wasn't done from run-rbd-unit-tests.sh but in some other file, why are we attempting to increase ip_local_port_range instead of tracking down what is responsible for tens of thousands of locally closed sockets?

i created this PR in the same spirit of #47962. why? because i think i understand the issue better, and want to get the "make check" completed sooner with less number of sleep() calls.

i admit that i didn't understand the root cause of "what is responsible for tens of thousands of locally closed sockets" back then, neither do i now. why did i created these PRs without understanding the root case? because i want to "workaround" this issue to alleviate the pain we are suffering in the past weeks. and because i don't have enough bandwidth working on this issue. if any of us had come up with a better fix i would not do these.

i noticed that the #48014 was created claiming

Fixes: https://tracker.ceph.com/issues/57116

that PR was approved and backported. that change didn't come with a root cause analysis explaining why we had this issue. and it reverted some changes proposed by the author made in #47962. again, i admit that #47962 also failed to root cause the issue. that's why i didn't claim that i fixed the issue.

i thought we reviewed the PRs with a consistent standard. the comment is confusing.

anyway, if this change is not following our guideline or best practice. i am closing this change for good.

@tchaikov tchaikov closed this Sep 11, 2022
@tchaikov tchaikov deleted the wip-98 branch September 11, 2022 01:12
@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 11, 2022

@tchaikov made this Tracker to document your additional fix: https://tracker.ceph.com/issues/57491

Can you update the commit?

hi Laura, thanks. i don't intend to continue working on this change or this issue anymore. so i reset the related fields in https://tracker.ceph.com/issues/57491. sorry for the noise.

@ljflores
Copy link
Contributor

No worries, Kefu!

@idryomov
Copy link
Contributor

Hi Kefu,

I don't understand the purpose of this change.

[...]

i created this PR in the same spirit of #47962. why? because i think i understand the issue better, and want to get the "make check" completed sooner with less number of sleep() calls.

This wasn't noted anywere -- neither in the PR description nor in any of the commit messages. "want to get the "make check" completed sooner with less number of sleep() calls" makes the purpose of this change and its disposition (i.e. whether it is an improvement to the workaround or an actual fix) clear. I hope you would agree that it is much more descriptive than "increase ip-local-port-range" which was all I had to go on at the time of review.

The other thing that confused me was that you added this increase just in run-rbd-unit-tests.sh and not in RGW test runner as well. This led me on a false trail of assuming that you found the root cause and it turned out to be one of RBD tests. In an attempt to reverse engineer that (false) assumption, I actually had to dig up ip_local_port_range documentation to make sure that I'm not misunderstanding what it does because it was quite puzzling...

i admit that i didn't understand the root cause of "what is responsible for tens of thousands of locally closed sockets" back then, neither do i now. why did i created these PRs without understanding the root case? because i want to "workaround" this issue to alleviate the pain we are suffering in the past weeks. and because i don't have enough bandwidth working on this issue. if any of us had come up with a better fix i would not do these.

I didn't see a single occurrence of this issue on runs that included both #47962 and #48014 and therefore thought that the workaround was good enough and that the pain was gone. I didn't realize that you had another improvement to the workaround in mind.

i noticed that the #48014 was created claiming

Fixes: https://tracker.ceph.com/issues/57116

that PR was approved and backported. that change didn't come with a root cause analysis explaining why we had this issue. and it reverted some changes proposed by the author made in #47962. again, i admit that #47962 also failed to root cause the issue. that's why i didn't claim that i fixed the issue.

The Fixes tag was copied from #47962. I backported #48014 together with #47962 (for pacific, quincy had already received #47962 as an ad-hoc backport by then) under that Fixes tag because I didn't see a single occurrence of this issue with #47962 + #48014, but did see a couple with just #47962.

i thought we reviewed the PRs with a consistent standard. the comment is confusing.

anyway, if this change is not following our guideline or best practice. i am closing this change for good.

I think our approach to test-failure issues has been fairly consistent: get as reliable as possible workaround in place, then work on an actual fix as time allows. Now that you have explained that this PR is a further improvement to the workaround, I guess the question is: is the existing #47962 + #48014 workaround good enough? If the answer is yes, we should move to root causing and working on an actual fix. If the answer is no, e.g. because some make check runs still fail on EADDRINUSE or because make check now takes significantly longer, I'm totally open to reconsidering this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants