Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to fix issue #63788: "NetworkChange.NetworkAddressChanged event leaks threads on Linux #63963

Merged
merged 13 commits into from Jan 27, 2022

Conversation

vdjuric
Copy link
Contributor

@vdjuric vdjuric commented Jan 18, 2022

Adding "s_socketCreateContext" counter which changes each time CreateSocket() method is called. Method LoopReadSocket can then use this information to exit thread.

Fixes #63788

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jan 18, 2022
@ghost
Copy link

ghost commented Jan 18, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Adding "s_socketCreateContext" counter which changes each time CreateSocket() method is called. Method LoopReadSocket can then use this information to exit thread.

Author: vdjuric
Assignees: -
Labels:

area-System.Net, community-contribution

Milestone: -

@dnfadmin
Copy link

dnfadmin commented Jan 18, 2022

CLA assistant check
All CLA requirements met.

@ManickaP
Copy link
Member

Do you think you could also bring the repro code as a test?

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 19, 2022

Hi,

Yes, I will try to add minimal functional test for this issue, but first I have to setup local Linux development environment in WSL, so I can rebuild .NET Core locally from fork and run unit tests. Might take some time :)

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 20, 2022

Hi,

After creating initial commit (5c33cce) and then writing functional test, I noticed issue with leaking threads was still not fixed. I investigated a little bit and it seems native code in pal_networkchange.c, method SystemNative_ReadEvents was also blocking thread because recvmsg waited indefinitely...

I then added timeout for receive method (currently set to one second) and it seems this fixed the issue. Functional test now reports 0 threads after all event subscribers were removed.

NOTE: I am not Linux and not C expert, so I suggest good review on this one :)

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!!!
Some comments for the native part.
@wfurt could you also check this out? I don't see anything wrong with this, but I'm also not fully confident I'm not missing something.

src/native/libs/System.Native/pal_networkchange.c Outdated Show resolved Hide resolved
static async Task<int> GetNumberOfNetworkAddressChangeThreadsAsync()
{
int pid = Process.GetCurrentProcess().Id;
ProcessStartInfo psi = new ProcessStartInfo("ps", $"-T -p {pid}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately calling ps fails on some of our docker images:

System.Net.NetworkInformation.Tests.NetworkChangedTests.NetworkAddressChanged_AddRemoveMultipleTimes_CheckForLeakingThreads [FAIL]
      System.ComponentModel.Win32Exception : An error occurred trying to start process 'ps' with working directory '/root/helix/work/workitem/e'. No such file or directory

I need to follow up with this with our infra, whether to condition the test or fix the images...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed few tests were failing because "ps" was missing. I don't know if it is good practice to call external programs from tests. Unfortunately I don't think such test can be done with C# managed code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have some precedent for calling external process from a test and I don't think there's another (managed) way to count the threads by their names.
I found an example of a check that tests a similar thing:


And the implementation of the check leads to this:
private static Lazy<bool> s_supportsNullEncryption = new Lazy<bool>(() =>

Basically trying to call the required CLI command and if it fails disabling the tests that depends on it.

Do you think you could add something like this for this test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, issue to add ps into the docker image: dotnet/dotnet-buildtools-prereqs-docker#565

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thank you for this. I think this should now be implemented with d893003 and e94ab24

Also, [OuterLoop()] should be probably enabled for this test as you suggested and remove TODO.


[PlatformSpecific(TestPlatforms.Linux)]
[Fact]
public async void NetworkAddressChanged_AddRemoveMultipleTimes_CheckForLeakingThreads()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should probably put this into outerloop since it has a delay, but that can be done once we figure out the failing ps.

vdjuric and others added 2 commits January 23, 2022 11:11
ManickaP replaced tabs with spaces

Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 23, 2022

@ManickaP thank you for your suggestions and help!

Created two commits: afd9caa and 69f536f

I also added TODO comment into pal_networkchange.c to discuss if current one second timeout is good default value for setsockopt, because this means C# thread will call Interop.Sys.ReadEvents every second ...

Also, there is simple workaround for all these issues: on application startup simply register one NetworkChange.NetworkAddressChanged event and don't unsubscribe from it :)

// as described on GitHub: https://github.com/dotnet/runtime/issues/63788
// For each CreateSocket() call we increment this variable and pass it to ".NET Network Address Change" thread.
// This is then used in method LoopReadSocket as additional condition to exit thread.
private static volatile int s_socketCreateContext;
Copy link
Member

@tmds tmds Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: instead of adding s_socketCreateContext you can put the socket handle in a reference type.

class ListenerSocket { public ListenerSocket(int handle); public int Handle; }
static volatile ListenerSocket? s_socket;

While successive ListenerSockets can have the same Handle, the reference will never be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thank you for this suggestion! Seems more elegant solution, I changed this with following commit: e130fdc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
You can remove SocketWrapper.NotSet and use null instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had nullable implementation, but then changed it to non-nullable one because I don't like nulls :)

Don't know what is better here ...

:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null is better. It is a compile-time constant. It's also the default value for fields, and it's a good value to indicate not set.

{
Interop.Sys.ReadEvents(socket, &ProcessEvent);
//we can continue processing events
Interop.Sys.ReadEvents(initiallyCreatedSocket.Socket, &ProcessEvent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the condition-check in the while loop, and passing the handle to ReadEvents the handle may be re-used for something other. That means we'll be reading for some one else's file descriptor.

Using Socket/SafeSocketHandle allows to fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be bigger change than @vdjuric may want to take. If we feel that would simplify the code, I think we should open separate issue for it @tmds.
Alternatively, we can possibly use SafeHandle but that would also be much bigger change IMHO.

@@ -44,6 +45,20 @@ Error SystemNative_CreateNetworkChangeListenerSocket(int32_t* retSocket)
return (Error)(SystemNative_ConvertErrorPlatformToPal(errno));
}

// Added receive timeout to prevent recvmsg method in SystemNative_ReadEvents to block thread indefinitely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using Socket this is not needed either.

new Thread(s => LoopReadSocket((int)s!))
s_socket = new SocketWrapper(newSocket);

new Thread(args => LoopReadSocket((SocketWrapper)args!))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and using Socket we could do async reading, instead of making a thread.

@tmds
Copy link
Member

tmds commented Jan 25, 2022

I've highlighted what we can improve further by using Socket.
@vdjuric do you want to look into it? Or otherwise, we can tackle it in a follow-up PR.

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 25, 2022

@tmds I think those Socket improvements would be better as part of another PR. Regarding async Socket reading I also think this would be better than creating new Thread.

Using null instead of SocketWrapper.NotSet can be easily rewritten, this is not a problem.

I am wondering why some tests are failing although conditional fact was added. One of the logs reports "Exception while trying to run 'ps' command for user 'helixbot'" (I added additional checkin for diagnostics to report which user is running this), another returns "Process 'ps' returned exit code 1".

Those exceptions should also happen during conditional fact check, which should return "false" --> test should be skipped.

if (OperatingSystem.IsWindows())
{
	return false;
}

// On other platforms we will try it.
try
{
	_ = ProcessUtil.GetProcessThreadsWithPsCommand(Process.GetCurrentProcess().Id);
	return true;
}
catch { return false; }

@ManickaP
Copy link
Member

The static method that is evaluated in the ConditionalFact mas be accessible within the test class. I.e.: you need to add simple static property to NetworkChangedTests like this:

private static bool SupportsNullEncryption => TestConfiguration.SupportsNullEncryption;

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 25, 2022

Hi,

I think this is added:

        [PlatformSpecific(TestPlatforms.Linux)]
        //[OuterLoop()] //TODO: add Outer Loop attribute?
        [ConditionalFact(nameof(SupportsGettingThreadsWithPsCommand))]
        public void NetworkAddressChanged_AddRemoveMultipleTimes_CheckForLeakingThreads()
        {
            for (int i = 1; i <= 10; i++)
            {
                NetworkChange.NetworkAddressChanged += _addressHandler;
                NetworkChange.NetworkAddressChanged -= _addressHandler;
            }

            Thread.Sleep(2000); //allow some time for threads to exit

            //We are searching for threads containing ".NET Network Ad"
            //because ps command trims actual thread name ".NET Network Address Change".
            //This thread is created in:
            //  src/libraries/System.Net.NetworkInformation/src/System/Net/NetworkInformation/NetworkAddressChange.Unix.cs
            int numberOfNetworkAddressChangeThreads = ProcessUtil.GetProcessThreadsWithPsCommand(Process.GetCurrentProcess().Id)
                .Where(e => e.IndexOf(".NET Network Ad") > 0).Count();

            Assert.Equal(0, numberOfNetworkAddressChangeThreads); //there should be no threads because there are no event subscribers
        }

        private static bool SupportsGettingThreadsWithPsCommand
            => TestConfiguration.SupportsGettingThreadsWithPsCommand;

For local test I also hardcoded property SupportsGettingThreadsWithPsCommand to always return false and it was skipped from local tests.

@ManickaP
Copy link
Member

As to the socket improvements, the change in this PR as-is is not making the existing situation worse so we should proceed with getting this in. The suggested changes can be tackled separately. @tmds can you file an issue for them or should I?

@ManickaP
Copy link
Member

ManickaP commented Jan 25, 2022

Got it, the function that calls the process uses yield return. Since you throw away the resulting IEnumerable and never enumerate the iterator, it will never get invoked, thus won't throw in the test for ConditionalFact.
You can just add simple ToList(), FirstOrDefault() etc. or rewrite the function to return pre-calculated array/list and it'll start working as expected.

SharpLab MoveNext is never called if the enumerator is discarded.

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 25, 2022

Wow, very very complex code :) Sorry for this! In 821a5d1 this is now replaced with more simple version (List<string>).

It seems I don't have luck, now those few tests are canceled. Linux doesn't like me :)

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 25, 2022

Please ignore my previous post, I think I understand now, will do another experimental checkin ...


public int Socket { get; }

public static readonly SocketWrapper NotSet = new SocketWrapper(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While unlikely, 0 is valid handle. We should perhaps use -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for this!

With commit b6e944d this is now changed, we simply set SocketWrapper to null as suggested by @tmds

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 26, 2022

Mini dilemmas for b6e944d in method CloseSocket:

        private static void CloseSocket()
        {
            Debug.Assert(s_socket != null, "s_socket was null when CloseSocket was called.");
            Interop.Error result = Interop.Sys.CloseNetworkChangeListenerSocket(s_socket != null ? s_socket.Socket : -1);
            if (result != Interop.Error.SUCCESS)
            {
                string message = Interop.Sys.GetLastErrorInfo().GetErrorMessage();
                throw new NetworkInformationException(message);
            }

            s_socket = null;
        }

If s_socket would be null when calling this method, should we:

  1. leave code as is (we will pass -1 to Interop.Sys.CloseNetworkChangeListenerSocket and this should return some error)
  2. explicitly throw exception instead of line with Debug.Assert
  3. quietly ignore this after Debug.Assert and simply return from function before Interop.Sys.CloseNetworkChangeListenerSocket is called (no exceptions are thrown)

@tmds
Copy link
Member

tmds commented Jan 26, 2022

leave code as is (we will pass -1 to Interop.Sys.CloseNetworkChangeListenerSocket and this should return some error)

Yes, this can stay as is.

@ManickaP
Copy link
Member

/azp list

@azure-pipelines
Copy link

@ManickaP
Copy link
Member

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ManickaP
Copy link
Member

Interop failures unrelated: #64172

@ManickaP
Copy link
Member

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failures are unrelated, added test runs as expected.
And I think all outstanding comments are resolved or shelved, so we can merge this.

@ManickaP ManickaP merged commit 9ae484f into dotnet:main Jan 27, 2022
@ManickaP
Copy link
Member

Thanks @vdjuric for the contribution and fixing this!

@vdjuric
Copy link
Contributor Author

vdjuric commented Jan 27, 2022

Hi, thank you very much and no problem at all, you all helped me very much! It was great learning experience (first time building .NET with WSL :) ).

@karelz karelz added this to the 7.0.0 milestone Jan 27, 2022
@dotnet dotnet locked as resolved and limited conversation to collaborators Feb 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NetworkChange.NetworkAddressChanged event leaks threads on Linux
6 participants