Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Nov 15, 2020

This issue was pointed out to me by some folks at UWisc and just getting around to trying out a fix. This produces a substantial performance benefit in numerical comparisons with gcc8 (with sse4.2 only)

                               benchmark            baseline           contender  change %                                      counters
5         GreaterArrayArrayInt64/32768/0  414.735m items/sec    1.525b items/sec   267.679     {'iterations': 8792, 'null_percent': 0.0}
2         GreaterArrayArrayInt64/32768/1  411.275m items/sec    1.505b items/sec   266.053   {'iterations': 8676, 'null_percent': 100.0}
16      GreaterArrayArrayInt64/32768/100  406.580m items/sec    1.487b items/sec   265.846     {'iterations': 8639, 'null_percent': 1.0}
20        GreaterArrayArrayInt64/32768/2  411.188m items/sec    1.498b items/sec   264.320    {'iterations': 8700, 'null_percent': 50.0}
6     GreaterArrayArrayInt64/32768/10000  407.791m items/sec    1.458b items/sec   257.431    {'iterations': 8777, 'null_percent': 0.01}
17       GreaterArrayArrayInt64/32768/10  410.239m items/sec    1.438b items/sec   250.545    {'iterations': 8663, 'null_percent': 10.0}
11       GreaterArrayScalarInt64/32768/0  638.903m items/sec    1.882b items/sec   194.540    {'iterations': 13433, 'null_percent': 0.0}
14       GreaterArrayScalarInt64/32768/2  635.268m items/sec    1.866b items/sec   193.770   {'iterations': 13468, 'null_percent': 50.0}
9      GreaterArrayScalarInt64/32768/100  629.207m items/sec    1.838b items/sec   192.130    {'iterations': 13268, 'null_percent': 1.0}
15      GreaterArrayScalarInt64/32768/10  632.650m items/sec    1.848b items/sec   192.115   {'iterations': 13321, 'null_percent': 10.0}
3        GreaterArrayScalarInt64/32768/1  636.101m items/sec    1.858b items/sec   192.093  {'iterations': 13465, 'null_percent': 100.0}
19   GreaterArrayScalarInt64/32768/10000  630.470m items/sec    1.835b items/sec   191.010   {'iterations': 13336, 'null_percent': 0.01}
12       GreaterArrayArrayString/32768/1  295.949m items/sec  302.807m items/sec     2.317   {'iterations': 6323, 'null_percent': 100.0}
21      GreaterArrayScalarString/32768/2  274.259m items/sec  278.502m items/sec     1.547    {'iterations': 5847, 'null_percent': 50.0}
22      GreaterArrayScalarString/32768/1    1.007b items/sec    1.020b items/sec     1.244  {'iterations': 21309, 'null_percent': 100.0}
8        GreaterArrayArrayString/32768/2  148.096m items/sec  146.908m items/sec    -0.802    {'iterations': 3172, 'null_percent': 50.0}
10     GreaterArrayArrayString/32768/100  104.447m items/sec  103.111m items/sec    -1.280     {'iterations': 2231, 'null_percent': 1.0}
13      GreaterArrayArrayString/32768/10  105.158m items/sec  103.790m items/sec    -1.300    {'iterations': 2246, 'null_percent': 10.0}
7    GreaterArrayArrayString/32768/10000  104.408m items/sec  102.958m items/sec    -1.388    {'iterations': 2225, 'null_percent': 0.01}
4        GreaterArrayArrayString/32768/0  105.920m items/sec  103.817m items/sec    -1.985     {'iterations': 2263, 'null_percent': 0.0}
18     GreaterArrayScalarString/32768/10  645.713m items/sec  614.437m items/sec    -4.844   {'iterations': 13876, 'null_percent': 10.0}
1   GreaterArrayScalarString/32768/10000  977.632m items/sec  871.382m items/sec   -10.868   {'iterations': 20892, 'null_percent': 0.01}
0       GreaterArrayScalarString/32768/0  999.585m items/sec  877.238m items/sec   -12.240    {'iterations': 21302, 'null_percent': 0.0}
23    GreaterArrayScalarString/32768/100  934.922m items/sec  814.268m items/sec   -12.905    {'iterations': 19892, 'null_percent': 1.0}

The difference is less pronounced with clang-8, but still present

                               benchmark            baseline           contender  change %                                      counters
18        GreaterArrayArrayInt64/32768/2  928.755m items/sec    1.655b items/sec    78.196   {'iterations': 19762, 'null_percent': 50.0}
6         GreaterArrayArrayInt64/32768/1  938.947m items/sec    1.666b items/sec    77.386  {'iterations': 19962, 'null_percent': 100.0}
4         GreaterArrayArrayInt64/32768/0  945.091m items/sec    1.658b items/sec    75.386    {'iterations': 20198, 'null_percent': 0.0}
0        GreaterArrayArrayInt64/32768/10  923.876m items/sec    1.610b items/sec    74.241   {'iterations': 19667, 'null_percent': 10.0}
23      GreaterArrayArrayInt64/32768/100  918.441m items/sec    1.598b items/sec    73.962    {'iterations': 19602, 'null_percent': 1.0}
2     GreaterArrayArrayInt64/32768/10000  914.122m items/sec    1.573b items/sec    72.085   {'iterations': 19437, 'null_percent': 0.01}
21      GreaterArrayScalarString/32768/2  243.973m items/sec  246.087m items/sec     0.866    {'iterations': 5224, 'null_percent': 50.0}
17       GreaterArrayScalarInt64/32768/1    2.736b items/sec    2.638b items/sec    -3.590  {'iterations': 58090, 'null_percent': 100.0}
9        GreaterArrayArrayString/32768/0  136.733m items/sec  130.784m items/sec    -4.351     {'iterations': 2905, 'null_percent': 0.0}
22     GreaterArrayArrayString/32768/100  136.434m items/sec  127.740m items/sec    -6.372     {'iterations': 2918, 'null_percent': 1.0}
19   GreaterArrayArrayString/32768/10000  136.639m items/sec  127.704m items/sec    -6.539    {'iterations': 2916, 'null_percent': 0.01}
15   GreaterArrayScalarInt64/32768/10000    2.693b items/sec    2.512b items/sec    -6.742   {'iterations': 58669, 'null_percent': 0.01}
3        GreaterArrayScalarInt64/32768/0    2.795b items/sec    2.592b items/sec    -7.248    {'iterations': 58721, 'null_percent': 0.0}
13      GreaterArrayArrayString/32768/10  139.225m items/sec  128.587m items/sec    -7.641    {'iterations': 2980, 'null_percent': 10.0}
11       GreaterArrayScalarInt64/32768/2    2.759b items/sec    2.537b items/sec    -8.040   {'iterations': 57925, 'null_percent': 50.0}
14      GreaterArrayScalarInt64/32768/10    2.767b items/sec    2.517b items/sec    -9.045   {'iterations': 58391, 'null_percent': 10.0}
1      GreaterArrayScalarInt64/32768/100    2.738b items/sec    2.485b items/sec    -9.246    {'iterations': 58094, 'null_percent': 1.0}
16       GreaterArrayArrayString/32768/1  794.894m items/sec  659.684m items/sec   -17.010  {'iterations': 17020, 'null_percent': 100.0}
12       GreaterArrayArrayString/32768/2  182.007m items/sec  150.059m items/sec   -17.553    {'iterations': 3905, 'null_percent': 50.0}
7       GreaterArrayScalarString/32768/1    1.093b items/sec  772.486m items/sec   -29.297  {'iterations': 23230, 'null_percent': 100.0}
5      GreaterArrayScalarString/32768/10  691.609m items/sec  473.463m items/sec   -31.542   {'iterations': 14691, 'null_percent': 10.0}
20    GreaterArrayScalarString/32768/100    1.002b items/sec  603.447m items/sec   -39.761    {'iterations': 21340, 'null_percent': 1.0}
10      GreaterArrayScalarString/32768/0    1.054b items/sec  626.234m items/sec   -40.574    {'iterations': 22461, 'null_percent': 0.0}
8   GreaterArrayScalarString/32768/10000    1.078b items/sec  605.930m items/sec   -43.796   {'iterations': 23251, 'null_percent': 0.01}

I don't know why this seems to cause performance to be worse in the string comparison cases.

There might be some other things to try here like bit-packing in larger batches.

@wesm
Copy link
Member Author

wesm commented Nov 15, 2020

@cyb70289 @pitrou @jianxind since you've all been doing some kernel performance grinding you might be interested in helping investigate this further to see if there's more performance that can be unlocked. It might be better to let the compiler decide whether to unroll the loop-of-8, too.

@github-actions
Copy link

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This presumes that g() either returns a bool or 0/1 -- might want to enforce this (at compile time) but wasn't sure the best way to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe out_results[0] = !!g(); ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a compile-time assertion

@cyb70289
Copy link
Contributor

Tested on Xeon(R) Gold 5218 CPU @ 2.30GHz (skylake).

gcc-7.5 has big improvement.

                               benchmark            baseline           contender  change %
7        GreaterArrayScalarInt64/32768/1  584.172m items/sec    1.455b items/sec   149.105  {'null_percent': 100.0}
16       GreaterArrayScalarInt64/32768/0  585.510m items/sec    1.457b items/sec   148.920  {'null_percent': 0.0}
5        GreaterArrayScalarInt64/32768/2  583.826m items/sec    1.451b items/sec   148.611  {'null_percent': 50.0}
6       GreaterArrayScalarInt64/32768/10  583.846m items/sec    1.449b items/sec   148.178  {'null_percent': 10.0}
2    GreaterArrayScalarInt64/32768/10000  584.545m items/sec    1.448b items/sec   147.702  {'null_percent': 0.01}
15     GreaterArrayScalarInt64/32768/100  585.013m items/sec    1.446b items/sec   147.107  {'null_percent': 1.0}
0         GreaterArrayArrayInt64/32768/0  396.250m items/sec  921.516m items/sec   132.559  {'null_percent': 0.0}
8         GreaterArrayArrayInt64/32768/1  395.204m items/sec  917.636m items/sec   132.193  {'null_percent': 100.0}
19      GreaterArrayArrayInt64/32768/100  394.544m items/sec  913.442m items/sec   131.518  {'null_percent': 1.0}
20    GreaterArrayArrayInt64/32768/10000  392.995m items/sec  908.788m items/sec   131.247  {'null_percent': 0.01}
14        GreaterArrayArrayInt64/32768/2  394.104m items/sec  909.476m items/sec   130.771  {'null_percent': 50.0}
11       GreaterArrayArrayInt64/32768/10  394.966m items/sec  898.748m items/sec   127.551  {'null_percent': 10.0}
10      GreaterArrayScalarString/32768/1  884.277m items/sec  956.126m items/sec     8.125  {'null_percent': 100.0}
1        GreaterArrayArrayString/32768/1  305.843m items/sec  302.862m items/sec    -0.975  {'null_percent': 100.0}
21      GreaterArrayScalarString/32768/2  278.355m items/sec  273.676m items/sec    -1.681  {'null_percent': 50.0}
22       GreaterArrayArrayString/32768/2  140.984m items/sec  137.974m items/sec    -2.135  {'null_percent': 50.0}
12     GreaterArrayArrayString/32768/100  103.022m items/sec   98.360m items/sec    -4.525  {'null_percent': 1.0}
4       GreaterArrayArrayString/32768/10  104.032m items/sec   99.117m items/sec    -4.724  {'null_percent': 10.0}
13       GreaterArrayArrayString/32768/0  103.551m items/sec   98.652m items/sec    -4.731  {'null_percent': 0.0}
3    GreaterArrayArrayString/32768/10000  103.255m items/sec   97.749m items/sec    -5.332  {'null_percent': 0.01}
23  GreaterArrayScalarString/32768/10000  964.903m items/sec  912.461m items/sec    -5.435  {'null_percent': 0.01}
18    GreaterArrayScalarString/32768/100  924.999m items/sec  874.072m items/sec    -5.506  {'null_percent': 1.0}
17      GreaterArrayScalarString/32768/0  973.155m items/sec  917.459m items/sec    -5.723  {'null_percent': 0.0}
9      GreaterArrayScalarString/32768/10  701.792m items/sec  630.090m items/sec   -10.217  {'null_percent': 10.0}

clang-9 no benefit.

                               benchmark            baseline           contender  change %
3        GreaterArrayScalarInt64/32768/1    2.863b items/sec    2.891b items/sec     0.977  {'null_percent': 100.0}
12      GreaterArrayScalarInt64/32768/10    2.845b items/sec    2.860b items/sec     0.548  {'null_percent': 10.0}
9         GreaterArrayArrayInt64/32768/1    1.959b items/sec    1.967b items/sec     0.381  {'null_percent': 100.0}
15       GreaterArrayArrayString/32768/0  118.703m items/sec  118.738m items/sec     0.030  {'null_percent': 0.0}
20   GreaterArrayArrayString/32768/10000  118.455m items/sec  118.476m items/sec     0.018  {'null_percent': 0.01}
8        GreaterArrayScalarInt64/32768/2    2.873b items/sec    2.870b items/sec    -0.089  {'null_percent': 50.0}
14        GreaterArrayArrayInt64/32768/2    1.943b items/sec    1.941b items/sec    -0.130  {'null_percent': 50.0}
6         GreaterArrayArrayInt64/32768/0    1.980b items/sec    1.977b items/sec    -0.145  {'null_percent': 0.0}
7      GreaterArrayArrayString/32768/100  119.003m items/sec  118.708m items/sec    -0.248  {'null_percent': 1.0}
17      GreaterArrayArrayInt64/32768/100    1.961b items/sec    1.947b items/sec    -0.683  {'null_percent': 1.0}
16       GreaterArrayScalarInt64/32768/0    2.901b items/sec    2.881b items/sec    -0.701  {'null_percent': 0.0}
5        GreaterArrayArrayInt64/32768/10    1.961b items/sec    1.945b items/sec    -0.834  {'null_percent': 10.0}
21    GreaterArrayArrayInt64/32768/10000    1.967b items/sec    1.939b items/sec    -1.399  {'null_percent': 0.01}
1      GreaterArrayScalarInt64/32768/100    2.854b items/sec    2.813b items/sec    -1.430  {'null_percent': 1.0}
2    GreaterArrayScalarInt64/32768/10000    2.869b items/sec    2.806b items/sec    -2.201  {'null_percent': 0.01}
23      GreaterArrayScalarString/32768/2  231.294m items/sec  223.778m items/sec    -3.250  {'null_percent': 50.0}
22      GreaterArrayArrayString/32768/10  124.178m items/sec  119.522m items/sec    -3.750  {'null_percent': 10.0}
4        GreaterArrayArrayString/32768/2  163.400m items/sec  139.325m items/sec   -14.734  {'null_percent': 50.0}
0        GreaterArrayArrayString/32768/1  815.032m items/sec  688.068m items/sec   -15.578  {'null_percent': 100.0}
13      GreaterArrayScalarString/32768/1    1.100b items/sec  858.483m items/sec   -21.922  {'null_percent': 100.0}
11     GreaterArrayScalarString/32768/10  655.258m items/sec  490.804m items/sec   -25.098  {'null_percent': 10.0}
18    GreaterArrayScalarString/32768/100  976.617m items/sec  626.209m items/sec   -35.880  {'null_percent': 1.0}
10      GreaterArrayScalarString/32768/0    1.039b items/sec  650.698m items/sec   -37.376  {'null_percent': 0.0}
19  GreaterArrayScalarString/32768/10000    1.033b items/sec  646.860m items/sec   -37.397  {'null_percent': 0.01}

@pitrou
Copy link
Member

pitrou commented Nov 16, 2020

The PR looks reasonable to me.

@wesm wesm force-pushed the generate-bits-post-bitpack branch from c0b08ee to 50221b3 Compare November 18, 2020 02:22
@wesm
Copy link
Member Author

wesm commented Nov 19, 2020

+1, thanks all

@wesm wesm closed this in 8b9f6b9 Nov 19, 2020
@wesm wesm deleted the generate-bits-post-bitpack branch November 19, 2020 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants