Transform for multicore CPUs #626

jszuppe · 2016-07-06T17:33:02Z

This improves transform() (copy_on_device()) performance for multicore CPUs.

=== saxpy with stl ===
size,time (ms)
2,0.000160
4,0.000173
8,0.000200
16,0.000214
32,0.000225
64,0.000260
128,0.000944
256,0.001346
512,0.002352
1024,0.004416
2048,0.008575
4096,0.012101
8192,0.023922
16384,0.048170
524288,0.962686
1048576,1.929100
2097152,3.858070
4194304,7.719810
8388608,15.428000
16777216,30.873400
33554432,61.704100

[master]

=== saxpy with compute (CPU, Intel i5-6600K, 4/4 cores/threads) ===
size,time (ms)
2,0.024186
4,0.024106
8,0.024319
16,0.023895
32,0.024225
64,0.024235
128,0.024309
256,0.027449
512,0.027538
1024,0.028694
2048,0.029796
4096,0.031381
8192,0.028765
16384,0.039954
32768,0.045763
65536,0.063319
131072,0.087354
262144,0.130503
524288,0.292780
1048576,0.643207
2097152,1.271910
4194304,2.321300
8388608,4.728590
16777216,9.394120
33554432,18.629600

[pr_transform_cpu]

=== saxpy with compute (CPU, Intel i5-6600K, 4/4 cores/threads) ===
size,time (ms)
2,0.029375
4,0.028846
8,0.029399
16,0.028959
32,0.029304
64,0.029015
128,0.028879
256,0.029418
512,0.029383
1024,0.029599
2048,0.029605
4096,0.029974
8192,0.030838
16384,0.032885
32768,0.036190
65536,0.044674
131072,0.053115
262144,0.073374
524288,0.222535
1048576,0.468357
2097152,1.182020
4194304,2.110340
8388608,4.113700
16777216,8.209680
33554432,16.529000 

No difference in GPU performance.

This update is turned off for Apple OpenCL Platform as its compiler for CPU does not work correctly and can not compile kernel in copy_on_device_cpu() - see https://travis-ci.org/boostorg/compute/jobs/142560405. In other words, for Apple we keep the old performance.

coveralls · 2016-07-06T20:33:55Z

Coverage decreased (-0.3%) to 80.277% when pulling f58352f on haahh:pr_transform_cpu into a3f72e6 on boostorg:develop.

kylelutz · 2016-07-08T04:32:03Z

include/boost/compute/algorithm/detail/copy_on_device.hpp

-            "}\n";
-
-        m_count = detail::iterator_range_size(first, last);
+#ifndef __APPLE__


Hmm, I wonder if it would be better to use run-time dispatching (something like device.platform().name() == "Apple") rather than compile-time dispatching (probably unlikely, but something that is compiled on Apple host may be executed on a non-Apple compute device or vice-versa)?

Either way, could you add a comment explaining why we have this special case for Apple?

device.platform().name() == "Apple"

I can do this. I was just worried about using those kind of conditions in code. __APPLE__ seemed more reliable.

Either way, could you add a comment explaining why we have this special case for Apple?

Sure. Apple OpenCL platform (at least its compiler for the CPU) has some kind of bug that makes some kernels impossible to compile. I've figured out that conditions for the bug to show itself are: loop with condition with two variables (e.g. comparing two variables index < last_index; if you have index < 10000 everything is fine) + in this loop you have to write constant or result of function into a buffer (if you're copying value from one buffer to another, e.g. buf1[idx] = buf2[idx];, bug does not appear). And that's why some tests work and some don't.

Examples: _buf0[i]=42;, _buf0[i]=ret42();.

The same bug is a problem for vexcl - ddemidov/vexcl#92.

jszuppe · 2016-07-09T12:17:53Z

I wonder if there is a way to mock device for tests, so it returns device::gpu as its type while actually being a CPU device. That way we would be able to run both CPU and GPU dedicated algorithms and increase the coverage. It should work perfectly fine on all OpenCL platforms except Apple platform.

coveralls · 2016-07-10T14:05:04Z

Coverage decreased (-0.2%) to 80.394% when pulling 2ce959a on haahh:pr_transform_cpu into a3f72e6 on boostorg:develop.

Yet another bug on Apple OpenCL Platform.

coveralls · 2016-07-10T15:29:35Z

Coverage decreased (-0.3%) to 80.287% when pulling a10e7d3 on haahh:pr_transform_cpu into a3f72e6 on boostorg:develop.

jszuppe added 2 commits July 5, 2016 13:40

Remove unused function

c6123c4

Tranform/copy on device optimized for CPUs

34c476c

kylelutz reviewed Jul 8, 2016
View reviewed changes

jszuppe force-pushed the pr_transform_cpu branch from f58352f to 2ce959a Compare July 10, 2016 12:19

Disable CPU-optimized transform/copy_on_device() on Apple

a10e7d3

Yet another bug on Apple OpenCL Platform.

jszuppe force-pushed the pr_transform_cpu branch from 2ce959a to a10e7d3 Compare July 10, 2016 14:18

kylelutz merged commit d303097 into boostorg:develop Jul 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform for multicore CPUs #626

Transform for multicore CPUs #626

jszuppe commented Jul 6, 2016

coveralls commented Jul 6, 2016

kylelutz Jul 8, 2016

kylelutz Jul 8, 2016

jszuppe Jul 8, 2016

jszuppe commented Jul 9, 2016

coveralls commented Jul 10, 2016

coveralls commented Jul 10, 2016

Transform for multicore CPUs #626

Transform for multicore CPUs #626

Conversation

jszuppe commented Jul 6, 2016

coveralls commented Jul 6, 2016

kylelutz Jul 8, 2016

Choose a reason for hiding this comment

kylelutz Jul 8, 2016

Choose a reason for hiding this comment

jszuppe Jul 8, 2016

Choose a reason for hiding this comment

jszuppe commented Jul 9, 2016

coveralls commented Jul 10, 2016

coveralls commented Jul 10, 2016