New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to fix issue #63788: "NetworkChange.NetworkAddressChanged event leaks threads on Linux #63963
Conversation
…d event leaks threads on Linux"
Tagging subscribers to this area: @dotnet/ncl Issue DetailsAdding "s_socketCreateContext" counter which changes each time CreateSocket() method is called. Method LoopReadSocket can then use this information to exit thread.
|
Do you think you could also bring the repro code as a test? |
Hi, Yes, I will try to add minimal functional test for this issue, but first I have to setup local Linux development environment in WSL, so I can rebuild .NET Core locally from fork and run unit tests. Might take some time :) |
Hi, After creating initial commit (5c33cce) and then writing functional test, I noticed issue with leaking threads was still not fixed. I investigated a little bit and it seems native code in pal_networkchange.c, method SystemNative_ReadEvents was also blocking thread because recvmsg waited indefinitely... I then added timeout for receive method (currently set to one second) and it seems this fixed the issue. Functional test now reports 0 threads after all event subscribers were removed. NOTE: I am not Linux and not C expert, so I suggest good review on this one :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!!!
Some comments for the native part.
@wfurt could you also check this out? I don't see anything wrong with this, but I'm also not fully confident I'm not missing something.
static async Task<int> GetNumberOfNetworkAddressChangeThreadsAsync() | ||
{ | ||
int pid = Process.GetCurrentProcess().Id; | ||
ProcessStartInfo psi = new ProcessStartInfo("ps", $"-T -p {pid}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately calling ps
fails on some of our docker images:
System.Net.NetworkInformation.Tests.NetworkChangedTests.NetworkAddressChanged_AddRemoveMultipleTimes_CheckForLeakingThreads [FAIL]
System.ComponentModel.Win32Exception : An error occurred trying to start process 'ps' with working directory '/root/helix/work/workitem/e'. No such file or directory
I need to follow up with this with our infra, whether to condition the test or fix the images...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed few tests were failing because "ps" was missing. I don't know if it is good practice to call external programs from tests. Unfortunately I don't think such test can be done with C# managed code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some precedent for calling external process from a test and I don't think there's another (managed) way to count the threads by their names.
I found an example of a check that tests a similar thing:
runtime/src/libraries/System.Net.Security/tests/FunctionalTests/ServerAllowNoEncryptionTest.cs
Line 74 in 5a12420
[ConditionalFact(nameof(SupportsNullEncryption))] |
And the implementation of the check leads to this:
runtime/src/libraries/System.Net.Security/tests/FunctionalTests/TestConfiguration.cs
Line 36 in 5a9e584
private static Lazy<bool> s_supportsNullEncryption = new Lazy<bool>(() => |
Basically trying to call the required CLI command and if it fails disabling the tests that depends on it.
Do you think you could add something like this for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, issue to add ps
into the docker image: dotnet/dotnet-buildtools-prereqs-docker#565
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
[PlatformSpecific(TestPlatforms.Linux)] | ||
[Fact] | ||
public async void NetworkAddressChanged_AddRemoveMultipleTimes_CheckForLeakingThreads() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we should probably put this into outerloop since it has a delay, but that can be done once we figure out the failing ps
.
ManickaP replaced tabs with spaces Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
@ManickaP thank you for your suggestions and help! Created two commits: afd9caa and 69f536f I also added TODO comment into pal_networkchange.c to discuss if current one second timeout is good default value for setsockopt, because this means C# thread will call Interop.Sys.ReadEvents every second ... Also, there is simple workaround for all these issues: on application startup simply register one NetworkChange.NetworkAddressChanged event and don't unsubscribe from it :) |
// as described on GitHub: https://github.com/dotnet/runtime/issues/63788 | ||
// For each CreateSocket() call we increment this variable and pass it to ".NET Network Address Change" thread. | ||
// This is then used in method LoopReadSocket as additional condition to exit thread. | ||
private static volatile int s_socketCreateContext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: instead of adding s_socketCreateContext
you can put the socket handle in a reference type.
class ListenerSocket { public ListenerSocket(int handle); public int Handle; }
static volatile ListenerSocket? s_socket;
While successive ListenerSockets can have the same Handle, the reference will never be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thank you for this suggestion! Seems more elegant solution, I changed this with following commit: e130fdc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
You can remove SocketWrapper.NotSet
and use null
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I had nullable implementation, but then changed it to non-nullable one because I don't like nulls :)
Don't know what is better here ...
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null
is better. It is a compile-time constant. It's also the default value for fields, and it's a good value to indicate not set.
{ | ||
Interop.Sys.ReadEvents(socket, &ProcessEvent); | ||
//we can continue processing events | ||
Interop.Sys.ReadEvents(initiallyCreatedSocket.Socket, &ProcessEvent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between the condition-check in the while loop, and passing the handle to ReadEvents
the handle may be re-used for something other. That means we'll be reading for some one else's file descriptor.
Using Socket
/SafeSocketHandle
allows to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -44,6 +45,20 @@ Error SystemNative_CreateNetworkChangeListenerSocket(int32_t* retSocket) | |||
return (Error)(SystemNative_ConvertErrorPlatformToPal(errno)); | |||
} | |||
|
|||
// Added receive timeout to prevent recvmsg method in SystemNative_ReadEvents to block thread indefinitely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using Socket
this is not needed either.
new Thread(s => LoopReadSocket((int)s!)) | ||
s_socket = new SocketWrapper(newSocket); | ||
|
||
new Thread(args => LoopReadSocket((SocketWrapper)args!)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and using Socket
we could do async reading, instead of making a thread.
I've highlighted what we can improve further by using |
@tmds I think those Socket improvements would be better as part of another PR. Regarding async Socket reading I also think this would be better than creating new Thread. Using null instead of SocketWrapper.NotSet can be easily rewritten, this is not a problem. I am wondering why some tests are failing although conditional fact was added. One of the logs reports "Exception while trying to run 'ps' command for user 'helixbot'" (I added additional checkin for diagnostics to report which user is running this), another returns "Process 'ps' returned exit code 1". Those exceptions should also happen during conditional fact check, which should return "false" --> test should be skipped. if (OperatingSystem.IsWindows())
{
return false;
}
// On other platforms we will try it.
try
{
_ = ProcessUtil.GetProcessThreadsWithPsCommand(Process.GetCurrentProcess().Id);
return true;
}
catch { return false; } |
The static method that is evaluated in the runtime/src/libraries/System.Net.Security/tests/FunctionalTests/ServerAllowNoEncryptionTest.cs Line 97 in 5a12420
|
Hi, I think this is added: [PlatformSpecific(TestPlatforms.Linux)]
//[OuterLoop()] //TODO: add Outer Loop attribute?
[ConditionalFact(nameof(SupportsGettingThreadsWithPsCommand))]
public void NetworkAddressChanged_AddRemoveMultipleTimes_CheckForLeakingThreads()
{
for (int i = 1; i <= 10; i++)
{
NetworkChange.NetworkAddressChanged += _addressHandler;
NetworkChange.NetworkAddressChanged -= _addressHandler;
}
Thread.Sleep(2000); //allow some time for threads to exit
//We are searching for threads containing ".NET Network Ad"
//because ps command trims actual thread name ".NET Network Address Change".
//This thread is created in:
// src/libraries/System.Net.NetworkInformation/src/System/Net/NetworkInformation/NetworkAddressChange.Unix.cs
int numberOfNetworkAddressChangeThreads = ProcessUtil.GetProcessThreadsWithPsCommand(Process.GetCurrentProcess().Id)
.Where(e => e.IndexOf(".NET Network Ad") > 0).Count();
Assert.Equal(0, numberOfNetworkAddressChangeThreads); //there should be no threads because there are no event subscribers
}
private static bool SupportsGettingThreadsWithPsCommand
=> TestConfiguration.SupportsGettingThreadsWithPsCommand; For local test I also hardcoded property SupportsGettingThreadsWithPsCommand to always return false and it was skipped from local tests. |
As to the socket improvements, the change in this PR as-is is not making the existing situation worse so we should proceed with getting this in. The suggested changes can be tackled separately. @tmds can you file an issue for them or should I? |
Got it, the function that calls the process uses SharpLab |
Wow, very very complex code :) Sorry for this! In 821a5d1 this is now replaced with more simple version ( It seems I don't have luck, now those few tests are canceled. Linux doesn't like me :) |
Please ignore my previous post, I think I understand now, will do another experimental checkin ... |
|
||
public int Socket { get; } | ||
|
||
public static readonly SocketWrapper NotSet = new SocketWrapper(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While unlikely, 0 is valid handle. We should perhaps use -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mini dilemmas for b6e944d in method private static void CloseSocket()
{
Debug.Assert(s_socket != null, "s_socket was null when CloseSocket was called.");
Interop.Error result = Interop.Sys.CloseNetworkChangeListenerSocket(s_socket != null ? s_socket.Socket : -1);
if (result != Interop.Error.SUCCESS)
{
string message = Interop.Sys.GetLastErrorInfo().GetErrorMessage();
throw new NetworkInformationException(message);
}
s_socket = null;
} If
|
Yes, this can stay as is. |
src/libraries/System.Net.NetworkInformation/tests/FunctionalTests/NetworkAddressChangedTests.cs
Outdated
Show resolved
Hide resolved
…sts/NetworkAddressChangedTests.cs
/azp list |
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
Interop failures unrelated: #64172 |
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failures are unrelated, added test runs as expected.
And I think all outstanding comments are resolved or shelved, so we can merge this.
Thanks @vdjuric for the contribution and fixing this! |
Hi, thank you very much and no problem at all, you all helped me very much! It was great learning experience (first time building .NET with WSL :) ). |
Adding "s_socketCreateContext" counter which changes each time CreateSocket() method is called. Method LoopReadSocket can then use this information to exit thread.
Fixes #63788