Improve performance of inner_product #69

roshanr95 · 2014-03-19T00:10:48Z

Implemented simple kernel to perform inner_product, 10x performance gain

coveralls · 2014-03-19T00:49:38Z

Coverage remained the same when pulling ab03b78 on roshanr95:inner_product into fdadf85 on kylelutz:develop.

ddemidov · 2014-03-19T05:57:24Z

include/boost/compute/algorithm/inner_product.hpp

+            type_name<product_type>() + " *c)\n"
+        "{\n"
+        "   const uint i = get_global_id(0);\n"
+        "   c[i] = a[i] * b[i];\n"


Should not this change decrease the performance? It looks like the original code accumulated the transformed sequence on the fly, and the new version stores the intermediate result to the global memory.

There is an optimized accumulate/reduce() implementation which is used when the input iterators are buffer_iterator's and the function is commutative. This is most likely the cause for the performance increase.

I see. So the best way to increase performance of inner_product (and probably of other operations and also user code) would be to optimize the implementation of accumulate for the transform iterators. accumulate should only read its input once, so I don't see why input being a buffer is important here?

Accumulating the transformed sequence on the fly leads to a serial accumulation. As @kylelutz said, using a buffer eventually leads to reduce() or reduce_on_gpu() which is much faster.

Yes, but the proposed variant uses four unnecessary global memory IO operations (write a, b, c, read c). I am sure this is the reason why it is still slower than the STL version. It should be easy to extend the parallel version of accumulate to work with generic random access iterators.

Yeah, the proper long-term fix is to optimize accumulate/reduce for all input iterators, not just buffer_iterator.

I've added an issue (#73) to fix the performance of accumulate() with InputIterators. I'll look into making the fix which should greatly improve the performance of current inner_product() implementation.

roshanr95 · 2014-03-24T01:23:43Z

Better fixed by issue #73.

kylelutz · 2014-04-11T05:38:04Z

I've updated reduce() to work with generic iterators which drastically speeds up the perf_inner_product benchmark.

@roshanr95: Can you try out the code from develop?

roshanr95 · 2014-04-12T11:05:40Z

Working great!! ~160x performance increase from original.

Improve performance of inner_product

ab03b78

Implemented simple kernel to perform inner_product, 10x performance gain

roshanr95 mentioned this pull request Mar 19, 2014

Improve inner_product performance #70

Closed

ddemidov reviewed Mar 19, 2014
View reviewed changes

roshanr95 closed this Mar 24, 2014

roshanr95 deleted the inner_product branch April 12, 2014 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of inner_product #69

Improve performance of inner_product #69

roshanr95 commented Mar 19, 2014

coveralls commented Mar 19, 2014

ddemidov Mar 19, 2014

kylelutz Mar 19, 2014

ddemidov Mar 19, 2014

roshanr95 Mar 19, 2014

ddemidov Mar 19, 2014

kylelutz Mar 19, 2014

kylelutz Mar 19, 2014

roshanr95 commented Mar 24, 2014

kylelutz commented Apr 11, 2014

roshanr95 commented Apr 12, 2014

Improve performance of inner_product #69

Improve performance of inner_product #69

Conversation

roshanr95 commented Mar 19, 2014

coveralls commented Mar 19, 2014

ddemidov Mar 19, 2014

Choose a reason for hiding this comment

kylelutz Mar 19, 2014

Choose a reason for hiding this comment

ddemidov Mar 19, 2014

Choose a reason for hiding this comment

roshanr95 Mar 19, 2014

Choose a reason for hiding this comment

ddemidov Mar 19, 2014

Choose a reason for hiding this comment

kylelutz Mar 19, 2014

Choose a reason for hiding this comment

kylelutz Mar 19, 2014

Choose a reason for hiding this comment

roshanr95 commented Mar 24, 2014

kylelutz commented Apr 11, 2014

roshanr95 commented Apr 12, 2014