-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FreeBSD threading tests fail because of timing-values outside acceptable deltas. #4276
Comments
@josteink The ones within 20% or so are probably tweak able, but the other failures you mentioned are off by an order of magnitude. They probably warrant investigation. |
Probably it is worth mentioning that when @mmitche ran test suite on CI, only two tests were failing on FreeBSD: dotnet/dotnet-ci#20 (comment). |
While we could probably do what @kangaroo is suggesting here, I'm interested in what the backstory behind these tests are:
If it's the latter story which is the case, obviously just adjusting the deltas to fit our real-world values is a bad idea. Given the lack of documentation, I'm more inclined to believe it's the former one though and in that case, I think real-world data showing that things works as intended should trump a decade-old thin-air values. Anyone have any concrete knowledge about this? I've also see that the tests only emit the reported values if the tests fail. I'll modify the code to also have them report values in case of success and then run those tests on a Linux-host to see what the real difference between the systems actually is. |
On my Linux-laptop (much less RAM, much fewer cores), I'm getting the following results after modifying the tests to always report actual values:
That's a consistent delta of Does OSX also suffer from a lower timer-resolution? @kangaroo Would you be willing to run my modified SleepEx test2 and report back the values we get on FreeBSD's closest neighbour? You can pull it from here: josteink/coreclr@0943519 |
@josteink What is the HAVE_WORKING_CLOCK_GETTIME value detected on the FreeBSD? |
From CMakeCache.txt;
This results in config.h:
|
Can you tell us which version of FreeBSD you are using? Also, can you point at the specific code you are using to both 1) do the timing, and 2) do the timed sleep? FreeBSD's timer infrastructure changed in 10 and now uses deadline timers similar to Linux. I would expect it to exhibit much closer wakeup times as a result. |
@bsdjhb Thanks for the feedback. I'm running on this system: $ uname -a
FreeBSD freebsd-frankfurt 10.1-RELEASE-p9 FreeBSD 10.1-RELEASE-p9 #0: Tue Apr 7 01:09:46 UTC 2015 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 The code doing the timing is found here: OldTickCount = GetTickCount();
ret = SleepEx(ChildThreadSleepTime, Alertable);
NewTickCount = GetTickCount(); To me that looks fairly basic and doesn't leave much room for things to go wrong. Edit: There are three other tests failing suffering from the same symptoms, but I figured highlighting the most basic one should be sufficient. |
On Monday, May 25, 2015 05:01:28 AM Jostein Kj�nigsen wrote:
Ok, the only remaining question I have is how are GetTickCount() and SleepEx() John Baldwin |
I'm looking into that particular detail right now. From what I can tell, it's this function here: DWORD
PALAPI
GetTickCount(
VOID)
{
DWORD retval = 0;
PERF_ENTRY(GetTickCount);
ENTRY("GetTickCount ()\n");
// Get the 64-bit count from GetTickCount64 and truncate the results.
retval = (DWORD) GetTickCount64();
LOGEXIT("GetTickCount returns DWORD %u\n", retval);
PERF_EXIT(GetTickCount);
return retval;
} The function it references is cobbled with PALAPI
ULONGLONG
GetTickCount64()
{
ULONGLONG retval = 0;
#if HAVE_CLOCK_MONOTONIC
{
struct timespec ts;
if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0)
{
ASSERT("clock_gettime(CLOCK_MONOTONIC) failed; errno is %d (%s)\n", errno, strerror(errno));
goto EXIT;
}
retval = (ts.tv_sec * tccSecondsToMillieSeconds)+(ts.tv_nsec / tccMillieSecondsToNanoSeconds);
}
#elif HAVE_GETHRTIME
{
retval = (ULONGLONG)(gethrtime() / tccMillieSecondsToNanoSeconds);
}
#elif HAVE_READ_REAL_TIME
{
timebasestruct_t tb;
read_real_time(&tb, TIMEBASE_SZ);
if (time_base_to_time(&tb, TIMEBASE_SZ) != 0)
{
ASSERT("time_base_to_time() failed; errno is %d (%s)\n", errno, strerror(errno));
goto EXIT;
}
retval = (tb.tb_high * tccSecondsToMillieSeconds)+(tb.tb_low / tccMillieSecondsToNanoSeconds);
}
#else
{
struct timeval tv;
if (gettimeofday(&tv, NULL) == -1)
{
ASSERT("gettimeofday() failed; errno is %d (%s)\n", errno, strerror(errno));
goto EXIT;
}
retval = (tv.tv_sec * tccSecondsToMillieSeconds) + (tv.tv_usec / tccMillieSecondsToMicroSeconds);
}
#endif // HAVE_CLOCK_MONOTONIC
EXIT:
return retval;
} For this code to be run, we depend on
So yes, it seems your assumptions are correct. Edit2: This is now verified via LLDB. We're using this implementation. With that one cleared up... Edit: As for SleepEx, that can be found here: VOID
PALAPI
Sleep(IN DWORD dwMilliseconds)
{
PERF_ENTRY(Sleep);
ENTRY("Sleep(dwMilliseconds=%u)\n", dwMilliseconds);
CPalThread * pThread = InternalGetCurrentThread();
PAL_ERROR palErr = InternalSleepEx(pThread, dwMilliseconds, FALSE);
if (NO_ERROR != palErr)
{
ERROR("Sleep(dwMilliseconds=%u) failed [error=%u]\n",
dwMilliseconds, palErr);
pThread->SetLastError(palErr);
}
LOGEXIT("Sleep returns VOID\n");
PERF_EXIT(Sleep);
} That function does indeed seem quite a bit more involved, especially the InternalSleepEx it references. This too is verified via LLDB. If we're having issues performing properly, I'm starting to guess this is the area causing problems. Edit 3: Looking even deeper into this, what we're looking at is a whole lot more than just a regular platform Thread.Sleep(). I'm going to give running this a try with PAL-tracing enabled and see just how deep the rabbit-whole goes. |
Depending on how ESX is set up (it is the hypervisor hosting Do you experience the same issues when running the tests on bare metal? On Mon, May 25, 2015 at 10:01 PM, Jostein Kjønigsen <
|
I honestly don't have any bare metal to test on, and I suspect it's a genuine code issue somewhere here, but that would be an interesting thing to test none the less. Anyone willing to give it a go? |
I downloaded FreeBSD 11.0 memstick img on my laptop from ftp://ftp.freebsd.org/pub/FreeBSD/snapshots/i386/i386/ISO-IMAGES/11.0/ (the 633MB one), but couldn't get too far with booting on bare metal. I must have done something wrong while following these steps: https://koitsu.wordpress.com/2009/11/03/writing-freebsd-memstick-img-to-a-usb-drive-in-windows/. I will try it on my desktop. Meanwhile perhaps a dedicated FreeBSD user from GitHub community can test it? @vitallium,@basarevych,@saper,@kvasdopil,@alexlehm |
@jasonwilliams200OK can you put the stuff you need to build on Windows (from |
There is no i386 CoreCLR support currently. |
Yup there is no i386 support yet as you can see https://github.com/dotnet/coreclr/wiki/Developer-Guide and https://github.com/dotnet/coreclr/issues/1078. Don't have the dev machine ready at the moment. @kangaroo, @josteink, would you guys please create a repo on GitHub and add a bin generated by I think in build.cmd here: https://github.com/dotnet/coreclr/blob/master/build.cmd#L34, there should be: :: despite of the similarities with Linux
if /i "%1" == "bsdmscorlib" (set __MscorlibOnly=1&set __BuildOS=BSD&shift&goto Arg_Loop) So we get to execute clear command: |
While true, it's not because it would be incredibly hard to implement. FreeBSD-support is currently being worked on, and amd64 was the obvious first target, since that's what most systems are running on. Unless I'm turning senile, getting i386 support implemented should be a fairly small task and mainly involves setting up the right register-mappings in PAL's context.h. But you know... It hasn't been a priority so far when basic amd64-support isn't fully landed. Edit: Checking build.sh I see there's no general support for 32-bit at all. There's @jasonwilliams200OK I can see what I can get done later on today when I get into the office and get all the grunt-work out of the way. |
True and even for Windows, the official stance: https://github.com/dotnet/coreclr/issues/1078#issuecomment-107718404, seems like only x64 is supported at the moment, while in JIT code I can see references of x86, x64, ARM32, ARM64, MIPS, MIPS64, SPARC, RISC, SuperH and some others. |
Sorry for creating confusion: in FreeBSD parlance 'i386' is 32-bit Intel CISC platform and 'amd64' is 64-bit AMD/Intel platform, so you need ftp://ftp.freebsd.org/pub/FreeBSD/snapshots/i386/i386/ISO-IMAGES/11.0/ which @jasonwilliams200OK mentioned is a 32-bit image, so it might not boot or it certainly won't run CoreCLR. I can test on native "amd64" (64-bit AMD/Intel64 platform) - even on a laptop I have here right now; the problem is I have no modern Windows with development stack anywhere, so it would be cool if somebody could just drop some binary components |
@saper I'm on it. Edit: Debug-build from http://invalid.ed.ntnu.no/~jostein/mscorlib.dll The rest of the files needed for |
@josteink Supporting x86 is not as simple as it might seem. Besides the PAL changes that should not be a rocket science, we also need a 32 bit jitter, which is not opensourced yet, a managed frames stack unwinding for 32 bit and a bunch of asm helpers. On 32 bit Windows, stack unwinding is done in a completely different way than on x64. Since the jitter needs to be compatible with the way managed frames are unwound, we will need to implement x86 unwinder on Unix that works the same way as on 32 bit Windows or change the jitter to generate say DWARF format instead and implement unwinder for that. |
@janvorli Thanks for the extensive feedback. I'll just go ahead and put "optimistic" on my resume then :) |
Thanks @josteink! (pls keep in mind that edits do not generate notifications in Github....) Which git commit hash should I try to compile? I follow https://github.com/dotnet/coreclr/blob/master/Documentation/freebsd-instructions.md with the current
which code should I be compiling? |
I used libunwind from the ports tree last time I set up a build. No patching should be needed. With the right dependencies in place, building coreclr from master should be fine. Make sure you install clang3.5 or newer (via PKG install). In the absence of that the build will attempt to use clang 3.4 even if the build is doomed to fail. |
Using clang35 from ports (it's 3.5.2 currently there). I'm on #CoreCLR on freenode, I think I sent you an invite (not to overload this bug) |
The output from the incremental build: https://gist.github.com/saper/eafc21d80b753511270d |
To not spam this thread with FreeBSD build-setup instructions, I've created a gitter-room for the occation. https://gitter.im/josteink/bsd-threading-talk @ajensenwaud You're wanted here :) |
Let me restart using a clean environment, will report back. |
So, on a bare metal I got:
https://gist.github.com/2d05ba56e1dfaed33f7c \o/ |
Alright I'd up the timeouts then |
@saper, is the josteink/runtime@0943519 change included? I'd be interested in seeing the actual values reported on bare metal, to know if this is just hypervisor load issues, or there's also remaining FreeBSD vs Linux differences. |
Nope, but I will give it a try |
This is what I got (the machine is not at all impressive):
Second time:
I just realized this is run in a jail (same OS, same architecture). |
No jail, no
This is
|
Ok so we can conclude that on proper iron, without a hypervisor, we are getting completely acceptable values here. Great job, @saper! If so the issue is a FreeBSD platform-specific on hypervisors in general and the solution should be to up the values or disable these specific tests for FreeBSD, then add the tests to the CI-build. Good to see issue getting solved :) |
The freebsd sleep test is failing intermittently on the CI host. See issue https://github.com/dotnet/coreclr/issues/1221 |
From #4255:
For FreeBSD 4 of the pal-tets are failing:
But looking at the output, all the failing tests are timing-based, and we can roughly say the tests fails because of deltas averagely about 10ms outside tolerance-values.
As you see repsectively 19, 17, 34, 27 milliseconds outside acceptable deltas. On another test-run only 3 of those tests fail for the same source.
This sort of error can easily be attributed to the specific server they run on. On another server these tests might pass just fine.
Looking even further into the Sleep2 test-case (which most others are referencing for their delta-values), we're clearly within reasonable range:
You have thread sleep for 2000ms, interrupt it after 1000ms, and verify that it didn't sleep the full 2000ms. And we're failing that test by using 1069ms instead og 1050ms.
I honestly think some of these tests should have their deltas reworked, but I'd love to have some second opinions. We're much closer to 1000 than 2000, so I'm not sure where a good tolerance-point would be.
Anyone have an opinion on what a canonical fix actually is here? The threading code comes from the platform and our ability to fine-tune it is limited. Should we adjust the deltas instead?
I'm open to anything.
cc: @akoeplinger @janvorli
The text was updated successfully, but these errors were encountered: