Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive long running tests on Linux #56477

Closed
karelz opened this issue Jul 28, 2021 · 9 comments · Fixed by #56966
Closed

Massive long running tests on Linux #56477

karelz opened this issue Jul 28, 2021 · 9 comments · Fixed by #56966
Labels
area-System.Net.Http os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Milestone

Comments

@karelz
Copy link
Member

karelz commented Jul 28, 2021

Failures 7/8-9/6 (incl. PRs):

Day Run OS Notes Runtime
7/16 PR #55776 - Preview 7 branch Fedora.32.Amd64.Open 32 tests CoreCLR
7/16 PR #55710 Centos.8.Amd64.Open 37 tests CoreCLR
7/16 PR #55666 Centos.8.Amd64.Open 39 tests CoreCLR
7/21 PR #56035 - Preview 7 branch Ubuntu.1604.Amd64.Open 43 tests CoreCLR
7/21 PR #55664 Alpine.312.Amd64.Open 38 tests CoreCLR
7/22 PR #55683 Fedora.32.Amd64.Open 42 tests CoreCLR
7/25 PR #56277 Fedora.32.Amd64.Open 45 tests CoreCLR
7/27 PR #54991 Centos.8.Amd64.Open 43 tests CoreCLR
7/27 PR #56413 Debian.10.Amd64.Open 41 tests CoreCLR
7/28 PR #55102 Ubuntu.1604.Amd64.Open 36 tests CoreCLR
7/28 PR #56488 Fedora.32.Amd64.Open 40 tests CoreCLR
7/29 PR #56600 Debian.10.Amd64.Open 37 tests CoreCLR
7/30 Official run Alpine.312.Amd64.Open 41 tests CoreCLR
7/31 PR #56568 Centos.8.Amd64.Open 36 tests CoreCLR
7/31 Official run Fedora.34.Amd64.Open 32 tests CoreCLR
8/3 PR #56768 Ubuntu.1804.Amd64.Open 36 tests CoreCLR
8/4 PR #55916 SLES.15.Amd64.Open 37 tests CoreCLR
8/4 PR #54640 Centos.8.Amd64.Open 40 tests CoreCLR
8/4 PR #56862 Centos.8.Amd64.Open 38 tests CoreCLR
8/5 PR #56909 Fedora.34.Amd64.Open -- CoreCLR
8/11 PR #57178 Alpine.312.Amd64.Open -- CoreCLR
8/11 Attempted fix in PR #56966
(also in release/6.0 and 6.0-rc1 branches)
8/12 Official run Debian.10.Amd64.Open -- CoreCLR
8/17 PR #57530 (created on 8/17) Debian.10.Amd64.Open 38 tests CoreCLR
8/19 PR #57741 - release/6.0 SLES.15.Amd64.Open 39 tests CoreCLR
8/29 Official run (main = 7.0) Debian.10.Amd64.Open 36 tests CoreCLR
9/3 Official run - release/6.0 Ubuntu.1604.Amd64.Open 39 tests CoreCLR

Addressed on 8/11 in PR #56966

Perhaps a regression introduced around 7/16? Higher frequency: ~1/day

Prior to 7/8 we do not have console outputs, but the frequency seems to be significantly lower - <1/week across all OS versions

Data from 3/29-7/8:

Day Run OS Runtime
3/29 Official run SLES.12.Amd64.Open CoreCLR
4/9 Official run OSX.1015.Amd64.Open CoreCLR
4/14 Official run Windows.10.Amd64.ServerRS5.Open CoreCLR
4/26 PR #51873 Windows.10.Amd64.Server19H1.Open CoreCLR
5/14 Official run Alpine.312.Amd64.Open CoreCLR
5/18 PR #52887 Windows.10.Amd64.Server19H1.Open CoreCLR
5/31 PR #53471 Debian.10.Amd64.Open CoreCLR
6/8 PR #53851 Centos.8.Amd64.Open CoreCLR
6/24 PR #54618 Windows.10.Amd64.Android.Open Mono
@karelz karelz added area-System.Net.Http os-linux Linux OS (any supported distro) labels Jul 28, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 28, 2021
@ghost
Copy link

ghost commented Jul 28, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Failures 7/8-7/27 (incl. PRs):

Day Run OS Notes Runtime
7/16 PR #55776 - Preview 7 branch Fedora.32.Amd64.Open 32 tests CoreCLR
7/16 PR #55710 Centos.8.Amd64.Open 37 tests CoreCLR
7/16 PR #55666 Centos.8.Amd64.Open 39 tests CoreCLR
7/21 PR #56035 - Preview 7 branch Ubuntu.1604.Amd64.Open 43 tests CoreCLR
7/21 PR #55664 Alpine.312.Amd64.Open 38 tests CoreCLR
7/22 PR #55683 Fedora.32.Amd64.Open 42 tests CoreCLR
7/25 PR #56277 Fedora.32.Amd64.Open 45 tests CoreCLR
7/27 PR #54991 Centos.8.Amd64.Open 43 tests CoreCLR
7/27 PR #56413 Debian.10.Amd64.Open 41 tests CoreCLR

Perhaps a regression introduced around 7/16? Higher frequency: ~1/day

Prior to 7/8 we do not have console outputs, but the frequency seems to be significantly lower - <1 per week across all OS versions

Data from 3/29-7/8:

Day Run OS Runtime
3/29 Official run SLES.12.Amd64.Open CoreCLR
4/9 Official run OSX.1015.Amd64.Open CoreCLR
4/14 Official run Windows.10.Amd64.ServerRS5.Open CoreCLR
4/26 PR #51873 Windows.10.Amd64.Server19H1.Open CoreCLR
5/14 Official run Alpine.312.Amd64.Open CoreCLR
5/18 PR #52887 Windows.10.Amd64.Server19H1.Open CoreCLR
5/31 PR #53471 Debian.10.Amd64.Open CoreCLR
6/8 PR #53851 Centos.8.Amd64.Open CoreCLR
6/24 PR #54618 Windows.10.Amd64.Android.Open Mono
Author: karelz
Assignees: -
Labels:

area-System.Net.Http, os-linux

Milestone: -

@karelz
Copy link
Member Author

karelz commented Jul 29, 2021

Triage: We need to look into this for 6.0. Ideally try to repro locally. If not, we can force crash via Timeouts on tests that hang with highest frequency ...

@ManickaP
Copy link
Member

ManickaP commented Aug 3, 2021

Triage: we should check and fix all places where we call sync in async code paths as noted by @wfurt in #56758

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 6, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 11, 2021
@karelz
Copy link
Member Author

karelz commented Aug 20, 2021

It seems it was not fully addressed yet :( ... we've got 2 hits after the fix got in - reopening
Looks like we will need a dump to find the root cause ...

@karelz karelz reopened this Aug 20, 2021
@aik-jahoda
Copy link
Contributor

Are the recent hits on the PR or main? Maybe the main with fix wasn't merged to the PRs yet.

@karelz
Copy link
Member Author

karelz commented Aug 26, 2021

@aik-jahoda the last 3 failures in top post are already with the change present (I added the attempted fix into the timeline to make it easier to reason about it).

@karelz
Copy link
Member Author

karelz commented Aug 27, 2021

Weird, no occurrences in last week (until 8/26), closing again. We can reopen if it happens again with higher frequency.
Sorry for the false alarm ...

@karelz karelz closed this as completed Aug 27, 2021
@karelz
Copy link
Member Author

karelz commented Sep 1, 2021

Another hit in main (7.0), so it is still around, but with lower frequency. I will reopen once it has a few occurrences.

@wfurt
Copy link
Member

wfurt commented Sep 1, 2021

Links? If there are same tests hanging we can perhaps add some instrumentation to get core dump.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants