Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associating threadpool queues with CPU cores. #18403

Closed
wants to merge 5 commits into from
Closed

Conversation

@VSadov
Copy link
Member

VSadov commented Jun 10, 2018

New implementation of ThreadPool work queues as lock-free queues softly associated with CPU cores.

The implementation avoids 1:1 strict mapping between work queues and working threads which is often a far greater number than the number of cores.

This implementation improves handling of cases where the number of workers needs to exceeds the number of cores. It avoids precipitous performance drops because:

  • the cost of snooping/stealing does not grow with the number of workers.
  • the local queues tend to be deeper (since there are fewer of them) and thus stealing is less frequent and less contentious with enqueuing/popping.
  • blocking or preempting a worker does not result in its entire work queue be only accessible through stealing.
  • reduced likelihood that the workitem will have to execute on a core different from where it was scheduled.

Fixes:#19088

Also may alleviate issues as described in http://labs.criteo.com/2018/10/net-threadpool-starvation-and-how-queuing-makes-it-worse/

@VSadov VSadov added the * NO MERGE * label Jun 10, 2018
@VSadov VSadov force-pushed the VSadov:queue branch 2 times, most recently from 0dc3949 to 7de443a Jun 10, 2018
@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Jun 16, 2018

@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Jun 16, 2018

@VSadov, what is the expected action item at this point with this PR?

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Jun 19, 2018

@stephentoub - nothing yet. I had an idea and tried to do some experiments.

The idea is that queue-per-core approach (vs. current queue-per-thread) could result in fewer cases of cross-core task stealing. Fewer queues would imply that we would not steal as much.
That may, in theory, improve overall use of core/node/socket caches.

However the change would have to permit occasionally concurrent pushes/pops of the tasks and it is a bit delicate to do that both correctly and efficiently.

@VSadov VSadov changed the title [WIP] Queue [WIP] Associate work-stealing threadpool queues with cores. Jun 20, 2018
@VSadov VSadov force-pushed the VSadov:queue branch from 703c467 to a19e8bc Jun 20, 2018
@VSadov VSadov force-pushed the VSadov:queue branch 8 times, most recently from e9f8baf to 392a802 Nov 16, 2018
@VSadov VSadov force-pushed the VSadov:queue branch 9 times, most recently from 94ef9cf to 1dec7bf Nov 24, 2018
@VSadov VSadov force-pushed the VSadov:queue branch from c812fb3 to d59fc70 Dec 3, 2018
@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 5, 2018

On a small machine we are same or faster. Sometimes significantly faster.

== On 6-core HT (12 logical) Coffee Lake

New TP

Operations per second on 12 Cores
                                                                             Parallelism
QUWI Queue Local (TP)           11.863 M    21.482 M    63.511 M    65.463 M    70.178 M
- Depth    2                    19.658 M    60.078 M    74.544 M    74.749 M    79.322 M
- Depth   16                    88.052 M    89.848 M    89.070 M    88.829 M    88.211 M
- Depth   64                    91.341 M    90.969 M    91.594 M    90.574 M    90.517 M
- Depth  512                    88.284 M    82.610 M    85.012 M    85.156 M    86.239 M

Old TP (built locally from master)

Operations per second on 12 Cores
                                                                             Parallelism
QUWI Queue Local (TP)            1.627 M     6.544 M    42.433 M    59.153 M    63.583 M
- Depth    2                     5.980 M    14.980 M    53.432 M    63.621 M    65.425 M
- Depth   16                    41.942 M    52.816 M    67.832 M    71.683 M    66.542 M
- Depth   64                    63.310 M    66.954 M    74.178 M    73.454 M    68.344 M
- Depth  512                    67.548 M    72.194 M    73.018 M    73.141 M    68.743 M
@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 5, 2018

Even better story on a manycore machine -

== On 24-core HT (48 logical) Ivy Bridge

New TP

Operations per second on 48 Cores
                                                                             Parallelism
QUWI Queue Local (TP)           12.372 M    17.483 M    33.804 M    33.435 M    35.208 M
- Depth    2                    19.574 M    24.258 M    36.337 M    35.919 M    36.351 M
- Depth   16                    36.980 M    38.494 M    38.410 M    38.359 M    38.524 M
- Depth   64                    38.625 M    38.216 M    38.721 M    38.443 M    38.134 M
- Depth  512                    36.077 M    36.006 M    36.221 M    36.406 M    35.633 M

Old TP (built locally from master)

Operations per second on 48 Cores
                                                                             Parallelism
QUWI Queue Local (TP)          477.418 k   494.027 k    12.201 M    28.761 M    30.615 M
- Depth    2                   494.151 k   532.484 k    21.288 M    31.291 M    31.246 M
- Depth   16                    13.096 M    20.896 M    31.108 M    33.691 M    32.046 M
- Depth   64                    26.644 M    28.623 M    32.680 M    33.431 M    32.176 M
- Depth  512                    28.531 M    29.024 M    29.314 M    29.353 M    28.808 M
@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Dec 5, 2018

These are all ideal cases - I am using @benaadams TP benchmark/stress app ( https://github.com/benaadams/ThreadPoolTaskTesting ) where tasks do not wait for anything, or do much work at all. Even GC is suspended while scenarios run.

In reality tasks do more work than that and often perform some measure of waiting (intentional or not). TP compensates by adding a few more workers.
The idea is to improve that scenario, but we want to be good on ideal/regular cases too so I use that as a measure.

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Sep 30, 2019

There appears to be an issue with threads trampling on one another due to queue sharing. Repro:

using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

static class TaskTreeOrderTest
{
    private static readonly int ThreadCount = Environment.ProcessorCount * 2;
    private const byte TreeDepth = 8;
    private const byte ItemCountPerNode = 8;

    private static void Main()
    {
        ThreadPool.SetMinThreads(ThreadCount, ThreadCount);
        ThreadPool.SetMaxThreads(ThreadCount, ThreadCount);

        var threadArrayCapacity = (ThreadCount + 1) * 16;
        var taskLatencySquareSumsUs = new double[threadArrayCapacity];

        const int LatencyBucketCount = 6;
        var taskLatencyCountsBuckets = new int[LatencyBucketCount][];
        var taskLatencySumsUsBuckets = new double[LatencyBucketCount][];
        for (int b = 0; b < LatencyBucketCount; ++b)
        {
            taskLatencyCountsBuckets[b] = new int[threadArrayCapacity];
            taskLatencySumsUsBuckets[b] = new double[threadArrayCapacity];
        }

        (byte, byte) TaskCommon(object data)
        {
            var startTicksShifted = Stopwatch.GetTimestamp() << 16;
            var ulongData = (ulong)(long)data;
            var scheduleTicksShifted = (long)(ulongData & ~(ulong)0xffff);
            var latencyUs = scheduleTicksShifted == 0 ? 0 : TicksToUs((startTicksShifted - scheduleTicksShifted) >> 16);
            var depth = (byte)((ushort)ulongData >> 8);
            var itemCount = (byte)ulongData;

            if (scheduleTicksShifted != 0)
            {
                var threadArrayIndex = t_threadArrayIndex;
                if (threadArrayIndex == 0)
                    threadArrayIndex = InitializeThreadArrayIndex();

                int bucketIndex;
                if (latencyUs < 1)
                    bucketIndex = 0;
                else if (latencyUs < 10)
                    bucketIndex = 1;
                else if (latencyUs < 100)
                    bucketIndex = 2;
                else if (latencyUs < 1000)
                    bucketIndex = 3;
                else if (latencyUs < 10000)
                    bucketIndex = 4;
                else
                    bucketIndex = 5;

                ++taskLatencyCountsBuckets[bucketIndex][threadArrayIndex];
                taskLatencySumsUsBuckets[bucketIndex][threadArrayIndex] += latencyUs;
                taskLatencySquareSumsUs[threadArrayIndex] += latencyUs <= 1 ? 1 : latencyUs * latencyUs;
            }

            return (depth, itemCount);
        }

        // Simulate parallel test execution under directory tree
        Func<object, Task<bool>> searchStart = null;
        searchStart = async data =>
        {
            (var depth, var itemCount) = TaskCommon(data);
            Random rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random();

            if (depth == 0)
            {
                for (int i = 0; i < itemCount; ++i)
                {
                    // Simulate rare test failure
                    if (rng.Next(1_000_000) == 0)
                        return false;
                }
                return true;
            }

            Task<bool> processRemainingItemsTask = null;
            if (itemCount > 1)
                processRemainingItemsTask = ScheduleTask(searchStart, depth, (byte)(itemCount - 1));

            bool success = itemCount == 0 || await searchStart(PackParameters(0, (byte)(depth - 1), ItemCountPerNode));

            if (processRemainingItemsTask != null && !await processRemainingItemsTask)
                success = false;
            return success;
        };
        var scheduleTaskWorkItem = new ScheduleTaskWorkItem(searchStart);

        ScheduleGlobalWorkItem(scheduleTaskWorkItem);

        var sw = new Stopwatch();
        var sb = new StringBuilder();
        var taskLatencyCountDiffs = new int[LatencyBucketCount];
        var taskLatencySumDiffsUs = new double[LatencyBucketCount];
        var taskLatencyBucketNames = new string[]
        {
            "                <      1 us",
            "                <     10 us",
            "                <    100 us",
            "                <   1000 us",
            "                <  10000 us",
            "                >= 10000 us"
        };
        while (true)
        {
            double taskLatencySquareSumDiff = 0;
            for (int b = 0; b < LatencyBucketCount; ++b)
            {
                taskLatencyCountDiffs[b] = 0;
                taskLatencySumDiffsUs[b] = 0;
                for (int i = 16; i < threadArrayCapacity; i += 16)
                {
                    taskLatencyCountDiffs[b] -= taskLatencyCountsBuckets[b][i];
                    taskLatencySumDiffsUs[b] -= taskLatencySumsUsBuckets[b][i];
                    taskLatencySquareSumDiff -= taskLatencySquareSumsUs[i];
                }
            }

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int b = 0; b < LatencyBucketCount; ++b)
            {
                for (int i = 16; i < threadArrayCapacity; i += 16)
                {
                    taskLatencyCountDiffs[b] += taskLatencyCountsBuckets[b][i];
                    taskLatencySumDiffsUs[b] += taskLatencySumsUsBuckets[b][i];
                    taskLatencySquareSumDiff += taskLatencySquareSumsUs[i];
                }
            }

            var pendingWorkItemCount = ThreadPool.PendingWorkItemCount;

            double elapsedS = sw.Elapsed.TotalSeconds;
            for (int b = 0; b < LatencyBucketCount; ++b)
            {
                sb.AppendLine(
                    $"{taskLatencyBucketNames[b]} | " +
                    $"Count/s: {taskLatencyCountDiffs[b] / elapsedS,15:0.000} | " +
                    $"Average latency (us): {taskLatencySumDiffsUs[b] / taskLatencyCountDiffs[b],15:0.000}");
            }
            sb.AppendLine($"Biased average latency (us) | {Math.Sqrt(taskLatencySquareSumDiff / taskLatencyCountDiffs.Sum()),15:0.000}");
            sb.AppendLine($"                   Count/ms | {taskLatencyCountDiffs.Sum() / (elapsedS * 1000),15:0.000}");
            sb.AppendLine($"    Pending work item count | {pendingWorkItemCount,11}");
            Console.WriteLine(sb.ToString());
            sb.Clear();
        }
    }

    private static int s_previousThreadArrayIndex;

    [ThreadStatic]
    private static int t_threadArrayIndex;

    [ThreadStatic]
    private static Random t_rng;

    private static readonly double TicksPerUs = Stopwatch.Frequency / (double)1_000_000;
    private static double TicksToUs(long ticks) => ticks / TicksPerUs;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static int InitializeThreadArrayIndex()
    {
        if (t_threadArrayIndex != 0)
            throw new InvalidOperationException();
        int i = Interlocked.Increment(ref s_previousThreadArrayIndex) * 16;
        t_threadArrayIndex = i;
        return i;
    }

    private static object PackParameters(long ticks, byte depth, byte itemCount) =>
        (ticks << 16) | (ushort)((ushort)depth << 8) | itemCount;

    private static Task<bool> ScheduleTask(Func<object, Task<bool>> func, byte depth, byte itemCount)
    {
        var t = new Task<Task<bool>>(func, PackParameters(Stopwatch.GetTimestamp(), depth, itemCount));
        t.Start();
        return t.Unwrap();
    }

    private static void ScheduleGlobalWorkItem(IThreadPoolWorkItem workItem) =>
        ThreadPool.UnsafeQueueUserWorkItem(workItem, preferLocal: false);

    private sealed class ScheduleTaskWorkItem : IThreadPoolWorkItem
    {
        private Func<object, Task<bool>> _taskStart;

        public ScheduleTaskWorkItem(Func<object, Task<bool>> taskStart) => _taskStart = taskStart;

        async void IThreadPoolWorkItem.Execute()
        {
            var startTicks = Stopwatch.GetTimestamp();
            await ScheduleTask(_taskStart, TreeDepth, ItemCountPerNode);
            var endTicks = Stopwatch.GetTimestamp();
            Console.WriteLine(
                $"-------------------------------- " +
                $"{TicksToUs(endTicks - startTicks) / 1000,15:0.000 ms} " +
                $"--------------------------------{Environment.NewLine}");

            ScheduleGlobalWorkItem(this);
        }
    }
}

I tried to sort of simulate parallel test execution with a depth-first search through simulated directories, or just a general depth-first search. Upon visiting a node in the tree (a directory), a task is scheduled to process the remaining children at that depth while the visitor processes one child in depth-first fashion. Each of those tasks would again schedule another task to process the remaining children while processing one child.

Before:

--------------------------------     4002.962 ms --------------------------------

                <      1 us | Count/s:       24744.556 | Average latency (us):           0.889
                <     10 us | Count/s:     3690695.825 | Average latency (us):           1.116
                <    100 us | Count/s:      518347.648 | Average latency (us):          20.164
                <   1000 us | Count/s:        8430.773 | Average latency (us):         648.340
                <  10000 us | Count/s:        5314.880 | Average latency (us):        4125.855
                >= 10000 us | Count/s:        1182.849 | Average latency (us):       20465.543
Biased average latency (us) |        1792.411
                   Count/ms |        4248.717
    Pending work item count |          75

                <      1 us | Count/s:       28780.910 | Average latency (us):           0.885
                <     10 us | Count/s:     3673167.601 | Average latency (us):           1.116
                <    100 us | Count/s:      515960.725 | Average latency (us):          20.128
                <   1000 us | Count/s:        8632.307 | Average latency (us):         647.784
                <  10000 us | Count/s:        5121.193 | Average latency (us):        4286.862
                >= 10000 us | Count/s:        1437.080 | Average latency (us):       25437.611
Biased average latency (us) |        1939.106
                   Count/ms |        4233.100
    Pending work item count |          74

                <      1 us | Count/s:       23358.023 | Average latency (us):           0.890
                <     10 us | Count/s:     3686806.194 | Average latency (us):           1.115
                <    100 us | Count/s:      517698.654 | Average latency (us):          20.140
                <   1000 us | Count/s:        8531.761 | Average latency (us):         656.063
                <  10000 us | Count/s:        5309.087 | Average latency (us):        4293.581
                >= 10000 us | Count/s:        1371.351 | Average latency (us):       33306.076
Biased average latency (us) |        4380.721
                   Count/ms |        4243.075
    Pending work item count |          75

                <      1 us | Count/s:       29220.008 | Average latency (us):           0.886
                <     10 us | Count/s:     3641870.549 | Average latency (us):           1.117
                <    100 us | Count/s:      511642.685 | Average latency (us):          20.150
                <   1000 us | Count/s:        8411.190 | Average latency (us):         651.754
                <  10000 us | Count/s:        5158.572 | Average latency (us):        4226.089
                >= 10000 us | Count/s:        1497.394 | Average latency (us):       25436.375
Biased average latency (us) |        2015.481
                   Count/ms |        4197.800
    Pending work item count |          75

                <      1 us | Count/s:       22742.143 | Average latency (us):           0.890
                <     10 us | Count/s:     3673435.542 | Average latency (us):           1.115
                <    100 us | Count/s:      515746.550 | Average latency (us):          20.128
                <   1000 us | Count/s:        8444.148 | Average latency (us):         660.146
                <  10000 us | Count/s:        5257.603 | Average latency (us):        4224.866
                >= 10000 us | Count/s:        1395.841 | Average latency (us):       32506.831
Biased average latency (us) |        4324.338
                   Count/ms |        4227.022
    Pending work item count |          73

                <      1 us | Count/s:       30626.793 | Average latency (us):           0.884
                <     10 us | Count/s:     3651167.078 | Average latency (us):           1.117
                <    100 us | Count/s:      513081.858 | Average latency (us):          20.141
                <   1000 us | Count/s:        8372.721 | Average latency (us):         657.792
                <  10000 us | Count/s:        5176.332 | Average latency (us):        4347.262
                >= 10000 us | Count/s:        1410.288 | Average latency (us):       28624.315
Biased average latency (us) |        2984.782
                   Count/ms |        4209.835
    Pending work item count |          79

                <      1 us | Count/s:       26408.640 | Average latency (us):           0.885
                <     10 us | Count/s:     3674182.513 | Average latency (us):           1.116
                <    100 us | Count/s:      515831.268 | Average latency (us):          20.143
                <   1000 us | Count/s:        8445.039 | Average latency (us):         658.852
                <  10000 us | Count/s:        5397.430 | Average latency (us):        4231.808
                >= 10000 us | Count/s:        1389.615 | Average latency (us):       26672.509
Biased average latency (us) |        2633.646
                   Count/ms |        4231.655
    Pending work item count |          69

                <      1 us | Count/s:       32783.746 | Average latency (us):           0.881
                <     10 us | Count/s:     3628520.909 | Average latency (us):           1.119
                <    100 us | Count/s:      509873.106 | Average latency (us):          20.153
                <   1000 us | Count/s:        8482.784 | Average latency (us):         648.593
                <  10000 us | Count/s:        5202.827 | Average latency (us):        4228.868
                >= 10000 us | Count/s:        1444.131 | Average latency (us):       23712.969
Biased average latency (us) |        2018.200
                   Count/ms |        4186.308
    Pending work item count |          61

--------------------------------     3975.624 ms --------------------------------

Highest latencies, pending work item count, and throughput are somewhat stable.

After (without prototype fixes above):

--------------------------------     6353.504 ms --------------------------------

                <      1 us | Count/s:      417426.511 | Average latency (us):           0.714
                <     10 us | Count/s:     1946215.729 | Average latency (us):           1.273
                <    100 us | Count/s:      259480.021 | Average latency (us):          20.984
                <   1000 us | Count/s:       13478.388 | Average latency (us):         336.676
                <  10000 us | Count/s:       18187.303 | Average latency (us):        4782.717
                >= 10000 us | Count/s:       75027.276 | Average latency (us):      117167.900
Biased average latency (us) |       75911.307
                   Count/ms |        2729.815
    Pending work item count |        7276

                <      1 us | Count/s:      402144.501 | Average latency (us):           0.714
                <     10 us | Count/s:     1889516.783 | Average latency (us):           1.267
                <    100 us | Count/s:      256847.026 | Average latency (us):          20.721
                <   1000 us | Count/s:        9422.161 | Average latency (us):         345.595
                <  10000 us | Count/s:        8353.205 | Average latency (us):        4336.369
                >= 10000 us | Count/s:       70433.135 | Average latency (us):      162822.743
Biased average latency (us) |       74668.565
                   Count/ms |        2636.717
    Pending work item count |       26622

                <      1 us | Count/s:      410443.218 | Average latency (us):           0.720
                <     10 us | Count/s:     1916025.762 | Average latency (us):           1.269
                <    100 us | Count/s:      259114.694 | Average latency (us):          20.735
                <   1000 us | Count/s:       10926.144 | Average latency (us):         347.645
                <  10000 us | Count/s:        9353.543 | Average latency (us):        4193.634
                >= 10000 us | Count/s:       72095.373 | Average latency (us):      313375.126
Biased average latency (us) |      138522.198
                   Count/ms |        2677.959
    Pending work item count |       36676

                <      1 us | Count/s:      412860.031 | Average latency (us):           0.712
                <     10 us | Count/s:     1937285.838 | Average latency (us):           1.268
                <    100 us | Count/s:      261069.548 | Average latency (us):          20.960
                <   1000 us | Count/s:       10783.635 | Average latency (us):         343.608
                <  10000 us | Count/s:        9698.128 | Average latency (us):        4446.942
                >= 10000 us | Count/s:       75475.525 | Average latency (us):      438130.779
Biased average latency (us) |      203218.412
                   Count/ms |        2707.173
    Pending work item count |       59381

                <      1 us | Count/s:      432722.150 | Average latency (us):           0.717
                <     10 us | Count/s:     2030056.099 | Average latency (us):           1.266
                <    100 us | Count/s:      275790.794 | Average latency (us):          21.157
                <   1000 us | Count/s:       10886.067 | Average latency (us):         381.148
                <  10000 us | Count/s:       11660.014 | Average latency (us):        4394.552
                >= 10000 us | Count/s:       70446.731 | Average latency (us):      538562.356
Biased average latency (us) |      231838.804
                   Count/ms |        2831.562
    Pending work item count |       43905

                <      1 us | Count/s:      419699.905 | Average latency (us):           0.718
                <     10 us | Count/s:     1789718.807 | Average latency (us):           1.289
                <    100 us | Count/s:      240757.521 | Average latency (us):          21.130
                <   1000 us | Count/s:       10019.027 | Average latency (us):         359.446
                <  10000 us | Count/s:       10616.826 | Average latency (us):        4630.261
                >= 10000 us | Count/s:       69069.345 | Average latency (us):      527723.084
Biased average latency (us) |      246489.303
                   Count/ms |        2539.881
    Pending work item count |       44743

                <      1 us | Count/s:      410651.545 | Average latency (us):           0.723
                <     10 us | Count/s:     1910855.099 | Average latency (us):           1.269
                <    100 us | Count/s:      257608.178 | Average latency (us):          21.061
                <   1000 us | Count/s:       10239.421 | Average latency (us):         345.401
                <  10000 us | Count/s:       10227.845 | Average latency (us):        4635.634
                >= 10000 us | Count/s:       75212.574 | Average latency (us):      655883.153
Biased average latency (us) |      312401.286
                   Count/ms |        2674.795
    Pending work item count |       44590

                <      1 us | Count/s:      460119.029 | Average latency (us):           0.719
                <     10 us | Count/s:     1791300.033 | Average latency (us):           1.297
                <    100 us | Count/s:      239890.947 | Average latency (us):          20.967
                <   1000 us | Count/s:       10326.240 | Average latency (us):         363.193
                <  10000 us | Count/s:       10762.308 | Average latency (us):        4416.084
                >= 10000 us | Count/s:       74089.679 | Average latency (us):      591813.365
Biased average latency (us) |      293369.118
                   Count/ms |        2586.488
    Pending work item count |       42336

                <      1 us | Count/s:      433782.832 | Average latency (us):           0.724
                <     10 us | Count/s:     1812622.439 | Average latency (us):           1.285
                <    100 us | Count/s:      243260.597 | Average latency (us):          21.002
                <   1000 us | Count/s:        9944.413 | Average latency (us):         343.540
                <  10000 us | Count/s:        9971.728 | Average latency (us):        4363.041
                >= 10000 us | Count/s:       75971.258 | Average latency (us):      606262.814
Biased average latency (us) |      303235.668
                   Count/ms |        2585.553
    Pending work item count |       39541

                <      1 us | Count/s:      435765.082 | Average latency (us):           0.718
                <     10 us | Count/s:     1915851.053 | Average latency (us):           1.276
                <    100 us | Count/s:      257271.279 | Average latency (us):          21.098
                <   1000 us | Count/s:       11111.863 | Average latency (us):         338.808
                <  10000 us | Count/s:       12500.606 | Average latency (us):        4467.817
                >= 10000 us | Count/s:       75599.910 | Average latency (us):      540766.188
Biased average latency (us) |      259995.255
                   Count/ms |        2708.100
    Pending work item count |       32057

                <      1 us | Count/s:      411993.816 | Average latency (us):           0.728
                <     10 us | Count/s:     1821767.465 | Average latency (us):           1.276
                <    100 us | Count/s:      241291.569 | Average latency (us):          20.905
                <   1000 us | Count/s:       10559.307 | Average latency (us):         342.365
                <  10000 us | Count/s:       13435.008 | Average latency (us):        4798.244
                >= 10000 us | Count/s:       77001.172 | Average latency (us):      419577.170
Biased average latency (us) |      241865.034
                   Count/ms |        2576.048
    Pending work item count |       22061

                <      1 us | Count/s:      338189.329 | Average latency (us):           0.713
                <     10 us | Count/s:     1924425.142 | Average latency (us):           1.252
                <    100 us | Count/s:      254636.372 | Average latency (us):          20.816
                <   1000 us | Count/s:       11922.714 | Average latency (us):         340.422
                <  10000 us | Count/s:       12791.814 | Average latency (us):        4411.685
                >= 10000 us | Count/s:       73628.780 | Average latency (us):      453434.557
Biased average latency (us) |      262407.245
                   Count/ms |        2615.594
    Pending work item count |       12040

--------------------------------     6309.281 ms --------------------------------

Probably due to TryPop giving up on contention, the pending work item counts and highest latencies are much higher, and throughput is lower.

With prototype fixes above:

--------------------------------     8338.280 ms --------------------------------

                <      1 us | Count/s:       31033.630 | Average latency (us):           0.872
                <     10 us | Count/s:     1794530.873 | Average latency (us):           1.386
                <    100 us | Count/s:      240720.058 | Average latency (us):          18.476
                <   1000 us | Count/s:       16525.944 | Average latency (us):         216.844
                <  10000 us | Count/s:        2481.911 | Average latency (us):        2582.900
                >= 10000 us | Count/s:         672.103 | Average latency (us):       30689.792
Biased average latency (us) |        1777.863
                   Count/ms |        2085.965
    Pending work item count |          44

                <      1 us | Count/s:       28052.525 | Average latency (us):           0.878
                <     10 us | Count/s:     1754760.278 | Average latency (us):           1.419
                <    100 us | Count/s:      233681.615 | Average latency (us):          18.228
                <   1000 us | Count/s:       17659.677 | Average latency (us):         202.853
                <  10000 us | Count/s:        2552.527 | Average latency (us):        2491.033
                >= 10000 us | Count/s:         767.897 | Average latency (us):       33670.568
Biased average latency (us) |        2520.193
                   Count/ms |        2037.475
    Pending work item count |          40

                <      1 us | Count/s:       25902.901 | Average latency (us):           0.867
                <     10 us | Count/s:     1572976.793 | Average latency (us):           1.463
                <    100 us | Count/s:      207367.154 | Average latency (us):          18.124
                <   1000 us | Count/s:       17092.880 | Average latency (us):         194.710
                <  10000 us | Count/s:        2332.350 | Average latency (us):        2241.505
                >= 10000 us | Count/s:         927.883 | Average latency (us):       52474.493
Biased average latency (us) |        6152.295
                   Count/ms |        1826.600
    Pending work item count |         693

                <      1 us | Count/s:       10454.743 | Average latency (us):           0.881
                <     10 us | Count/s:     1264116.532 | Average latency (us):           1.503
                <    100 us | Count/s:      164797.055 | Average latency (us):          17.822
                <   1000 us | Count/s:       14948.978 | Average latency (us):         182.101
                <  10000 us | Count/s:        2030.974 | Average latency (us):        2176.157
                >= 10000 us | Count/s:         691.271 | Average latency (us):      128948.315
Biased average latency (us) |       12681.084
                   Count/ms |        1457.040
    Pending work item count |         110

                <      1 us | Count/s:       14041.509 | Average latency (us):           0.874
                <     10 us | Count/s:     1250482.270 | Average latency (us):           1.526
                <    100 us | Count/s:      162579.544 | Average latency (us):          17.785
                <   1000 us | Count/s:       15356.745 | Average latency (us):         178.082
                <  10000 us | Count/s:        2009.821 | Average latency (us):        2070.270
                >= 10000 us | Count/s:         679.020 | Average latency (us):      105140.977
Biased average latency (us) |       13544.594
                   Count/ms |        1445.149
    Pending work item count |          92

                <      1 us | Count/s:       14060.226 | Average latency (us):           0.880
                <     10 us | Count/s:     1282035.069 | Average latency (us):           1.485
                <    100 us | Count/s:      168090.426 | Average latency (us):          17.954
                <   1000 us | Count/s:       14575.788 | Average latency (us):         187.046
                <  10000 us | Count/s:        1999.988 | Average latency (us):        2328.801
                >= 10000 us | Count/s:         678.984 | Average latency (us):       80702.822
Biased average latency (us) |       13081.498
                   Count/ms |        1481.440
    Pending work item count |          87

                <      1 us | Count/s:       18048.164 | Average latency (us):           0.869
                <     10 us | Count/s:     1263674.981 | Average latency (us):           1.512
                <    100 us | Count/s:      164876.993 | Average latency (us):          17.894
                <   1000 us | Count/s:       14956.806 | Average latency (us):         181.926
                <  10000 us | Count/s:        1988.275 | Average latency (us):        2137.143
                >= 10000 us | Count/s:         762.626 | Average latency (us):       72133.323
Biased average latency (us) |       14602.249
                   Count/ms |        1464.308
    Pending work item count |          80

                <      1 us | Count/s:       14344.843 | Average latency (us):           0.882
                <     10 us | Count/s:     1297946.140 | Average latency (us):           1.472
                <    100 us | Count/s:      170760.335 | Average latency (us):          17.997
                <   1000 us | Count/s:       14344.843 | Average latency (us):         186.100
                <  10000 us | Count/s:        1997.482 | Average latency (us):        2255.822
                >= 10000 us | Count/s:         615.809 | Average latency (us):       56198.016
Biased average latency (us) |        7941.093
                   Count/ms |        1500.009
    Pending work item count |          78

                <      1 us | Count/s:       18135.857 | Average latency (us):           0.874
                <     10 us | Count/s:     1295835.310 | Average latency (us):           1.484
                <    100 us | Count/s:      170143.565 | Average latency (us):          17.983
                <   1000 us | Count/s:       14645.274 | Average latency (us):         183.720
                <  10000 us | Count/s:        1965.156 | Average latency (us):        2245.704
                >= 10000 us | Count/s:         616.786 | Average latency (us):       85971.684
Biased average latency (us) |       17152.553
                   Count/ms |        1501.342
    Pending work item count |          72

                <      1 us | Count/s:       14835.312 | Average latency (us):           0.880
                <     10 us | Count/s:     1305744.777 | Average latency (us):           1.471
                <    100 us | Count/s:      171723.838 | Average latency (us):          18.004
                <   1000 us | Count/s:       14387.820 | Average latency (us):         184.533
                <  10000 us | Count/s:        1953.397 | Average latency (us):        2254.579
                >= 10000 us | Count/s:         690.693 | Average latency (us):       66026.951
Biased average latency (us) |       14296.471
                   Count/ms |        1509.336
    Pending work item count |          70

                <      1 us | Count/s:       15208.036 | Average latency (us):           0.874
                <     10 us | Count/s:     1275724.883 | Average latency (us):           1.506
                <    100 us | Count/s:      166591.457 | Average latency (us):          17.882
                <   1000 us | Count/s:       15058.136 | Average latency (us):         178.179
                <  10000 us | Count/s:        2001.262 | Average latency (us):        2110.332
                >= 10000 us | Count/s:         673.577 | Average latency (us):      143445.854
Biased average latency (us) |       32582.389
                   Count/ms |        1475.257
    Pending work item count |          65

                <      1 us | Count/s:        9989.831 | Average latency (us):           0.872
                <     10 us | Count/s:     1273969.859 | Average latency (us):           1.508
                <    100 us | Count/s:      165889.917 | Average latency (us):          17.829
                <   1000 us | Count/s:       15177.233 | Average latency (us):         177.097
                <  10000 us | Count/s:        2016.243 | Average latency (us):        2022.399
                >= 10000 us | Count/s:         620.233 | Average latency (us):      144251.228
Biased average latency (us) |       33408.934
                   Count/ms |        1467.663
    Pending work item count |          54

                <      1 us | Count/s:       12826.037 | Average latency (us):           0.877
                <     10 us | Count/s:     1274746.564 | Average latency (us):           1.504
                <    100 us | Count/s:      166355.066 | Average latency (us):          17.859
                <   1000 us | Count/s:       15007.826 | Average latency (us):         179.232
                <  10000 us | Count/s:        2103.937 | Average latency (us):        2414.525
                >= 10000 us | Count/s:         671.469 | Average latency (us):       93834.215
Biased average latency (us) |       25678.577
                   Count/ms |        1471.711
    Pending work item count |          53

                <      1 us | Count/s:       15289.610 | Average latency (us):           0.866
                <     10 us | Count/s:     1252234.396 | Average latency (us):           1.538
                <    100 us | Count/s:      162216.941 | Average latency (us):          17.727
                <   1000 us | Count/s:       15583.379 | Average latency (us):         172.501
                <  10000 us | Count/s:        2108.912 | Average latency (us):        2154.295
                >= 10000 us | Count/s:         704.268 | Average latency (us):      166529.897
Biased average latency (us) |       43510.814
                   Count/ms |        1448.138
    Pending work item count |          47

                <      1 us | Count/s:       11714.624 | Average latency (us):           0.882
                <     10 us | Count/s:     1254750.374 | Average latency (us):           1.532
                <    100 us | Count/s:      162777.449 | Average latency (us):          17.695
                <   1000 us | Count/s:       15639.596 | Average latency (us):         172.981
                <  10000 us | Count/s:        2086.965 | Average latency (us):        2126.463
                >= 10000 us | Count/s:         653.514 | Average latency (us):       66590.378
Biased average latency (us) |       18344.391
                   Count/ms |        1447.623
    Pending work item count |          41

                <      1 us | Count/s:       16515.245 | Average latency (us):           0.870
                <     10 us | Count/s:     1304474.830 | Average latency (us):           1.471
                <    100 us | Count/s:      171368.027 | Average latency (us):          18.024
                <   1000 us | Count/s:       14352.627 | Average latency (us):         186.433
                <  10000 us | Count/s:        1960.359 | Average latency (us):        2201.218
                >= 10000 us | Count/s:         628.171 | Average latency (us):       71403.866
Biased average latency (us) |       19161.381
                   Count/ms |        1509.299
    Pending work item count |          43

                <      1 us | Count/s:       11235.325 | Average latency (us):           0.881
                <     10 us | Count/s:     1259978.141 | Average latency (us):           1.525
                <    100 us | Count/s:      163662.728 | Average latency (us):          17.743
                <   1000 us | Count/s:       15477.531 | Average latency (us):         174.308
                <  10000 us | Count/s:        2127.917 | Average latency (us):        2182.331
                >= 10000 us | Count/s:         663.879 | Average latency (us):       45405.471
Biased average latency (us) |        3844.446
                   Count/ms |        1453.146
    Pending work item count |          44

                <      1 us | Count/s:       12325.176 | Average latency (us):           0.882
                <     10 us | Count/s:     1275652.806 | Average latency (us):           1.502
                <    100 us | Count/s:      166655.447 | Average latency (us):          17.911
                <   1000 us | Count/s:       14881.653 | Average latency (us):         179.481
                <  10000 us | Count/s:        2037.010 | Average latency (us):        2144.985
                >= 10000 us | Count/s:         589.507 | Average latency (us):       57906.353
Biased average latency (us) |        6051.593
                   Count/ms |        1472.142
    Pending work item count |          45

                <      1 us | Count/s:        9830.693 | Average latency (us):           0.877
                <     10 us | Count/s:     1253830.721 | Average latency (us):           1.535
                <    100 us | Count/s:      162351.382 | Average latency (us):          17.708
                <   1000 us | Count/s:       15667.242 | Average latency (us):         172.310
                <  10000 us | Count/s:        2118.667 | Average latency (us):        2124.627
                >= 10000 us | Count/s:         630.347 | Average latency (us):       45496.248
Biased average latency (us) |        3404.621
                   Count/ms |        1444.429
    Pending work item count |          39

                <      1 us | Count/s:       11209.796 | Average latency (us):           0.878
                <     10 us | Count/s:     1271428.619 | Average latency (us):           1.511
                <    100 us | Count/s:      165429.121 | Average latency (us):          17.800
                <   1000 us | Count/s:       15231.082 | Average latency (us):         176.338
                <  10000 us | Count/s:        2031.070 | Average latency (us):        2141.567
                >= 10000 us | Count/s:         678.969 | Average latency (us):       50736.966
Biased average latency (us) |        5856.000
                   Count/ms |        1466.009
    Pending work item count |          35

                <      1 us | Count/s:       10319.488 | Average latency (us):           0.878
                <     10 us | Count/s:     1238566.167 | Average latency (us):           1.554
                <    100 us | Count/s:      159880.073 | Average latency (us):          17.628
                <   1000 us | Count/s:       16004.545 | Average latency (us):         169.078
                <  10000 us | Count/s:        2136.274 | Average latency (us):        2085.964
                >= 10000 us | Count/s:         630.376 | Average latency (us):       41289.099
Biased average latency (us) |        3409.634
                   Count/ms |        1427.537
    Pending work item count |          34

--------------------------------    10938.968 ms --------------------------------

Pending work item count has stabilized somewhat, but still spikes frequently due to the issue (not always visible in the output). Highest latencies still worse than baseline. Throughput is lower still. I believe due to thread assignment to queues being balanced on proc ID changes, there is even more thread movement between queues and more cross-trampling of work streams.

In the baseline one thread's perspective would be one task scheduled per depth while going down the tree on one path. When a leaf is reached, the thread would pop the most recently scheduled work item (at the lowest depth) and continue processing that top-level branch of the tree in depth-first order. Meanwhile another thread would steal the oldest item (a high-depth node) and mostly independently process another top-level branch, and so on for the remaining threads.

With this change when there are more threads than the processor count, consider two threads T0 and T1 operating on one queue. Suppose T0 is going down the tree on one path and added a few work items one for each depth. T1 would initially steal an old item (a top-level high-depth node) and start processing that similarly to the baseline. However, while T0 is happily processing one top-level branch, T1 tramples on it by scheduling a bunch of work items for the other top-level branch. T0 gets detracted from its work stream and is forced to work on T1's payload. So the search is still mostly depth-first but at times it becomes a bit more breadth-first. This keeps happening from time to time, and every time it happens it causes work items that are already in their shared local queue to get backed up. I'm not sure exactly why it's slower, I had expected equivalent throughput, bad latency, and unexpected order of execution (which can show up in the UI if this was actual test execution).

The issue also exists on Linux and numbers are somewhat similar. The issue mostly disappears with the prototype fixes when ThreadCount == Environment.ProcessorCount, but then there's also hill climbing.

It's quite possible I did something wrong in the test as well, so please do double-check it, but I think the theory at least is valid.

I'm not sure there is a good way to solve that problem with a shared queue model. It may help in this particular case to always steal items in LIFO order but that would not solve the problem and I don't think that's the right solution. As far as I can tell the best way to solve this problem is to not allow threads to trample on one another by scheduling and taking work items from the same queue such that their work stream occurs in natural order and is not detracted, which would imply a thread-local queue model.

Perhaps a better model would be to use thread-local queues and to associate them with a proc ID (along with balancing thread assignments, checking proc ID frequently, etc., which I still feel is a bit unfortunate but may be necessary to solve some issues). Such that only the thread can ever add items to its own queue, but other threads associated with the same proc ID can take items from it in LIFO order before going to the global queue. Work items coming in from outside the thread pool could be spread out among processors in some uniform fashion, as inserting them at the opposite end of the queue should not trample on the thread's work stream. It feels to me like that could be implemented efficiently.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Sep 30, 2019

You are describing a "replicating task" scenario. It generally benefits from LIFO order. How much is the benefit - depends on the actual pattern of replication. Some patterns benefit more, some less, but generally LIFO is helpful and that is why local queues are LIFO.

With shared queues the LIFO property should hold statistically, but forcing a lot of context switches would weaken it.
There should be no cliff though. The more "task mixing" you can force, the further you will get from LIFO order and from the optimal scheduling strategy for this task pattern. I think this test demonstrates exactly that.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 1, 2019

Re: Parallel.For and thread starvation due to blocking Wait:

  • good catch on position == deqPosition in TryRemove - that can and should be handled. Proposed fix looks good.
  • scanning other segments/queues - I guess we can. Generally we will find the item in the current (and the only) segment, so the additional search path should be rarely executed.
    We should have some upper bound though on how far we search (in terms of items). Otherwise TryRemove has unbounded cost.
    Wait is expensive, but not infinitely expensive and the further we search the more likely that the task was just stolen and may already be finished.

Ultimately, the starvation issue is because there is a blocking wait on a task in a task. It is generally a dangerous thing to do since you start depending on scheduling order. It is not hard to fix in a more general way though. (I had very good results here: #26037 and that is just one direction to consider).

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 1, 2019

Actually, after thinking about this more, I think it is reasonable to search all local queues as long as task has not yet dequeued (which task can tell). With that condition it is not as "unbounded". We would have to wait until someone else dequeues it anyways, so we could as well keep scanning.

I will take proposed TryRemove changes.
Also will have to rebase since we seem to have merge conflicts with master.


In reply to: 536803228 [](ancestors = 536803228)

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Oct 1, 2019

With shared queues the LIFO property should hold statistically, but forcing a lot of context switches would weaken it.

Technically, work is being done in LIFO order, but in effect a single thread is ending up processing somewhat random nodes in the tree or subtrees, along with work item counts in local queues getting backed up, which is also not ideal. These issues are not so much caused by context switching but because of sharing queues between two threads.

I get that the proposed model in the paper would get benefits from having a processor run a set of (related) work items as long as possible and in a reasonable order to get cache and other benefits, but it only works well when it can be guaranteed that there is only one thread at any given time that is processing work from and inserting work into a queue (except for cases where a work item stalls, but including when a stalled work item is unblocked, in that it cannot run concurrently with another thread that is processing the same queue with a work item in flight). This would be easier to do when there is full control over scheduling. The model is also hurt by processor IDs changing. I don't think the model works well when there are multiple threads operating on the same queue, or when thread-queue assignments can be frequently incorrect.

There should be no cliff though.

The perf cliff appears to be at 9-16 threads on my 4-core 8-thread machine. Not sure yet why throughput is worse after my fixes above, but I tried ensuring that a thread's current local queue does not change until the next work item and it didn't seem to help. But anyway, throughput isn't the only problem.

Ultimately, the starvation issue is because there is a blocking wait on a task in a task. It is generally a dangerous thing to do since you start depending on scheduling order.

I know, but there are comments in the Parallel implementation indicating that it is very important to be able to remove a task from the local queue, so I take from that, that it is done and perhaps not as uncommon as I would have thought. Would be nice to have Parallel.ForAsync equivalents that wouldn't have that problem.

Side note, on my 4-core 8-thread machine it looks like proc ID assignments are unstable only between 5-8 concurrently running threads. Between 1-4 threads and at 9+ threads the IDs are reasonably stable at least for 40-50 ms on average. So maybe something to do with Windows and hyperthreading.

@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Oct 1, 2019

There could also be processor cache issues due to the issue above. Although this test case doesn't explicitly cover that, when a small subtree of work items access similar memory locations, processing items in different or somewhat random orders can negatively affect the processor cache.

@VSadov VSadov force-pushed the VSadov:queue branch from c214ac5 to c66539b Oct 2, 2019
@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 2, 2019

I have pushed a change so that TryRemove scans all local queues as long as the task is still inlineable.

I get roughly the same results as with master on the Parallel.For stressing sample.

@VSadov VSadov force-pushed the VSadov:queue branch from c66539b to 9091af3 Oct 2, 2019
@kouvel

This comment has been minimized.

Copy link
Member

kouvel commented Oct 2, 2019

By now I think I've at least convinced myself that I'm not willing to go ahead with this PR (more specifically this particular design change proposal), even with fixes to the above issues, because I don't think this is the best design to tackle. The current design is not ideal and certainly can be improved with some of the ideas from this proposal and the referenced paper, and I think at least those options should be explored seriously before settling on a design choice. I think there is merit to some of the ideas here and I think they can be implemented in a way that fits better. Of course, this is only my opinion, others may disagree.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

First - I am very grateful for the validation that you have done. Thanks!!!

Now - I disagree with your conclusion ... obviously :-)
I would not be pushing this design forward if I did not believe it is the most promising direction that we may take. I will try convincing you.

--
Your premise is basically that threads change cores frequently.
I believe that in the given context it so rare that it may be considered a “freaky accident”.
Tasks scheduling operations take microseconds. (measurements in this PR show that it is very possible for a task to both enque and deque in under 1us).
OS time slices are measured in milliseconds. That is 3 orders of magnitude difference. Switching just can’t happen “frequently”. And then OS will not try to shuffle threads at first opportunity.

We make assumption about core jumping being infrequent in a lot of places - in GC, in Concurrent Collections, in Array Pool, in basically all the places where we associate data with cores or with threads.
If the benchmarks had any significant memory access, it would be impacted by core changes as well.

Thread-associated data could be more tricky in this respect. When a thread happens to change its core, in core-associated case we will switch to a different piece of data associated with that core. There would be a delay due to core ID caching, but we will notice the core change soon enough and switch.
With thread-associated data we would continue using data that is now in the cache of a remote core and causes cache misses.
In practice, this is not a huge concern. Again because at microsecond level changing core is a rare event.

--
Regarding test strategies that oversubscribe to x4 x8 CoreNum threads in a compute-intense scenarios -

I welcome that! These are awesome scenarios and the pool must handle these reasonably well. However, I think we should not aggressively optimize for that. We should definitely address crazy anomalies - lock ups, crashes, excessive slowdowns..., but a gradual degrade in a compute scenario as more threads are introduced is probably ok.
Users do oversubscribe in IO-heavy cases, but I think compute-intensive scenarios generally run around CoreNum threads +/- a few.

--
Saying all this, I will look into the scenarios. At least to understand why we have the results that we have.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

To reiterate the above:

. . . contention due to threads that are assigned to the same local queue often operating on different processors . . .

This should be very rare.
Two threads cannot run on the same core at the same time and thus should not share queues. The only reason for sharing would be that core has changed between core ID sampling. It should be uncommon.

In particular TryPop and Enqueue should not experience mutual contentions at any meaningful rate. (That is why spinning there is trivial)

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

I am not seeing the regression on x4 threads scenario.

My fast machine: https://gist.github.com/VSadov/fec967fd44eaa21598debbac1a8f630c
My older slower machine: https://gist.github.com/VSadov/fec967fd44eaa21598debbac1a8f630c

In either case

  • througput is higher with the new queue and
  • average latency is lower.

The tests are pretty noisy though. It looks like results depend a lot on what is going on on the machine, whether CPU is thermal throttled, etc. Took some time to get decent readings.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

There is a latency regression on x1 threads scenario though.

Throughput is still higher with the new queue, but average latency is definitely worse.

My fast machine: https://gist.github.com/VSadov/3961a0de1dcaf8dcf17b044b240409ae
My older slower machine: https://gist.github.com/VSadov/f1cd310aa63c040a63923c5b2544d7bb

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

The results are puzzling in two ways:

  • x1 thread case should actually be the "easier" case, if core changing is a suspect.
  • the distribution of latencies for the new queue has a peculiar shape.

In master we have a tapered shape with odds of higher latencies gradually decreasing:

                <      1 us | Count/s:    10138807.968 | Average latency (us):           0.437
                <     10 us | Count/s:     2676057.714 | Average latency (us):           3.002
                <    100 us | Count/s:        1030.866 | Average latency (us):          27.271
                <   1000 us | Count/s:         576.248 | Average latency (us):         298.603
                <  10000 us | Count/s:           5.982 | Average latency (us):        3478.655
                >= 10000 us | Count/s:           0.000 | Average latency (us):             NaN

In the case of the new queue we have a "thick tail" and then sudden drop.

                <      1 us | Count/s:    10353321.091 | Average latency (us):           0.421
                <     10 us | Count/s:     2707408.416 | Average latency (us):           2.942
                <    100 us | Count/s:        3025.338 | Average latency (us):          34.336
                <   1000 us | Count/s:        6230.850 | Average latency (us):         482.556
                <  10000 us | Count/s:        2300.683 | Average latency (us):        1588.220
                >= 10000 us | Count/s:           0.000 | Average latency (us):             NaN
@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

In particular in x1 case associating queues with threads or cores should not differ much.
There should be roughly the same N-to-N association of threads and queues.

Indeed, from the results, most tasks are dispatched in about the same time, but there is a fraction that seem to be "neglected" like if sometimes we have a queue that noone works with for about 1000us.

I suspect what we are seeing is not the high rate of core changing, but the core sampling being too coarse.
Basically - while cores are not changing, which is common, we see the same 1-10 us latencies as in the baseline. Once two threads exchange cores, it is possible that one thread notices the change sooner than another and as a result its queue gets "neglected".

The duration of "neglect" is capped by the core ID caching delay, which appears to be way too conservative.
I will try reducing that and see if that proves my theory.

@VSadov

This comment has been minimized.

Copy link
Member Author

VSadov commented Oct 3, 2019

Reducing sampling delay of core ID to 1/50 calls results in a dramatic latency improvement.

https://gist.github.com/VSadov/7e5e165b3176f7e4981a54c2d91eb1ee

So the culprit is the core ID caching being too conservative.
Knowing this fact we can probably do even better. I just picked a random number.

internal const int MaxSegmentLength = 1024 * 1024;

/// <summary>
/// Lock used to protect cross-segment operations"/>

This comment has been minimized.

Copy link
@lpereira

lpereira Oct 14, 2019

Member

Stray "/>?

This comment has been minimized.

Copy link
@VSadov

VSadov Nov 1, 2019

Author Member

The comment is also too vague. Thanks!

VSadov added 4 commits Apr 18, 2019
…ly associated with CPU cores.

    The implementation avoids 1:1 strict mapping between work queues and working threads which is often a far greater number than the number of cores.

    This implementation improves handling of cases where the number of workers needs to exceeds the number of cores and causes precipitous performance drops because:
    - the cost of snooping/stealing does not grow with the number of workers.
    - the local queues tend to be deeper (since there are fewer of them) and thus stealing is less frequent and less contentious with enqueuing/popping.
    - reduced likelihood that the workitem will have to execute on a core different from where it was scheduled.
…ueue to ensure that continuously nonempty global queue does not delay retirement of old segments.

This is not a very common scenario, generally happening only at start up, but mitigation is also simple.
Moving to a different core while dispatching is a rare occurence, but when it does happen we do not want to neglect our local queue for too long since that can result in a batch of high-latency outliers.
@VSadov VSadov force-pushed the VSadov:queue branch from c6c9430 to 991c2eb Nov 1, 2019
return true;
}

for (int i = 0; i < _localQueues.Length; ++i)

This comment has been minimized.

Copy link
@benaadams

benaadams Nov 1, 2019

Collaborator

Take _localQueues to local to avoid bounds check in loop? #27510

This comment has been minimized.

Copy link
@VSadov

VSadov Nov 1, 2019

Author Member

We generally do not make this far. The task is either in the local queue or stolen and executed.
However, there is code in existence that relies on task inlining (or more specifically on task.Wait not blocking), so we check other queues if task did not run yet.

Storing queues in a temp is still better. Thanks!

@maryamariyan

This comment has been minimized.

Copy link
Member

maryamariyan commented Nov 6, 2019

Thank you for your contribution. As announced in #27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>
@maryamariyan

This comment has been minimized.

Copy link
Member

maryamariyan commented Dec 2, 2019

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

@maryamariyan maryamariyan removed this from the Future milestone Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.