Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable flaky DNS cancellation tests during jitstress #70089

Merged
merged 4 commits into from
Jun 2, 2022

Conversation

rzikm
Copy link
Member

@rzikm rzikm commented Jun 1, 2022

Related to #69993.

Original fix (#70044), while removing the flakyness, made the test effectively useless (see #70044 (comment)). Based on the discussion we decided that it is better to disable the test during jitstress.

This PR also reverts #70009, as the implementation of async name resolution on *nix platforms was reverted in #48666

@ghost
Copy link

ghost commented Jun 1, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Related to #69993.

Original fix (#70044), while removing the flakyness, made the test effectively useless (see #70044 (comment)). Based on the discussion we decided that it is better to disable the test during jitstress.

This PR also reverts #70009, as the implementation of async name resolution on *nix platforms was reverted.

Author: rzikm
Assignees: rzikm
Labels:

area-System.Net

Milestone: -

@rzikm rzikm requested a review from antonfirsov June 1, 2022 15:17
Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your patience cleaning this up!

@jakobbotsch
Copy link
Member

This test does not only fail in jitstress runs. As I have tried to point out in the original issue, jitstress just runs a lot of configurations and therefore runs tests a lot of times, so if they are flaky it is often exposed under jitstress.

There are at least 3 failures of this test in the runtime pipeline in the past 2 days, i.e. without any jitstress.

@JulieLeeMSFT
Copy link
Member

Agreed with @jakobbotsch. CC @karelz.

This test does not only fail in jitstress runs. As I have tried to point out in the original issue, jitstress just runs a lot of configurations and therefore runs tests a lot of times, so if they are flaky it is often exposed under jitstress.

There are at least 3 failures of this test in the runtime pipeline in the past 2 days, i.e. without any jitstress.

@rzikm
Copy link
Member Author

rzikm commented Jun 2, 2022

This test does not only fail in jitstress runs. As I have tried to point out in the original issue, jitstress just runs a lot of configurations and therefore runs tests a lot of times, so if they are flaky it is often exposed under jitstress.

There are at least 3 failures of this test in the runtime pipeline in the past 2 days, i.e. without any jitstress.

Those failures are on linux, I mistakenly thought that async name resolution has been implemented on Linux in #34633, so I enabled the test on UNIX platforms (it was disabled before). I did not notice that the PR was later reverted (the issue mentioned in the ActiveIssue attribute was still marked as closed). This PR again disables the tests on UNIX so those failures will not repeat.

There are no recent failures of this test on Windows.

@rzikm
Copy link
Member Author

rzikm commented Jun 2, 2022

The test failure seems unrelated.

@jakobbotsch
Copy link
Member

Those failures are on linux, I mistakenly thought that async name resolution has been implemented on Linux in #34633, so I enabled the test on UNIX platforms (it was disabled before). I did not notice that the PR was later reverted (the issue mentioned in the ActiveIssue attribute was still marked as closed). This PR again disables the tests on UNIX so those failures will not repeat.

Got it, but this still just seems like a short-term fix. Any future change, test mode, or even just regular background processes can affect timings in the test execution and has the potential of hitting the same problem. For example, if we wanted to run libraries tests under GCStress (which has been discussed before) that would also affect timings significantly.

With that said, I am more assured now given that the Linux failures were something else.

@rzikm rzikm merged commit f873951 into dotnet:main Jun 2, 2022
@antonfirsov
Copy link
Member

antonfirsov commented Jun 2, 2022

@jakobbotsch unfortunately there are many networking features which can not be tested at all in a manner that is not dependent on timing. We can choose to not test them at all, or to respect the fact that they need better isolation in the test infra. I'm very much for the latter, since I believe we should not leave features untested, even if it comes with a price.

For example, if we wanted to run libraries tests under GCStress (which has been discussed before) that would also affect timings significantly.

Then we will have to exclude timing-dependent networking tests from GCStress.

A long term solution could be to create some sort of marker attribute so these tests can be auto-excluded from runs where isolation is not possible.

@jakobbotsch
Copy link
Member

We can choose to not test them at all, or to respect the fact that they need better isolation in the test infra.

This does not sound like the only possibilities to me. Other possibilities could be test stubs or a test DNS server that the test controls where timings can be deterministically controlled (I realize this is a lot of work).

@wfurt
Copy link
Member

wfurt commented Jun 2, 2022

we should control the Helix environment. Part of the problem is that some networking test are time dependent e.g. they test that something does happen in given time frame. While there is always potential for improvements it is not trivial too make the 100% reliable. It is also possible that the failures in jitsress are real bugs instead of just timing differences. But the investigation is not easy and there are more important tasks for the release.

@antonfirsov
Copy link
Member

antonfirsov commented Jun 2, 2022

a test DNS server

This can solve the DNS issues but not other cases where networking code may become racy.

test stubs

This is definitely a lot of work for DNS or Sockets, not convinced it's worth it. Having a mindful approach to Helix way seems easier to me.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 2, 2022
@karelz karelz added this to the 7.0.0 milestone Jul 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants