-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of inner_product #69
Conversation
Implemented simple kernel to perform inner_product, 10x performance gain
type_name<product_type>() + " *c)\n" | ||
"{\n" | ||
" const uint i = get_global_id(0);\n" | ||
" c[i] = a[i] * b[i];\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not this change decrease the performance? It looks like the original code accumulated the transformed sequence on the fly, and the new version stores the intermediate result to the global memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an optimized accumulate/reduce() implementation which is used when the input iterators are buffer_iterator's and the function is commutative. This is most likely the cause for the performance increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So the best way to increase performance of inner_product
(and probably of other operations and also user code) would be to optimize the implementation of accumulate
for the transform iterators. accumulate
should only read its input once, so I don't see why input being a buffer is important here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accumulating the transformed sequence on the fly leads to a serial accumulation. As @kylelutz said, using a buffer eventually leads to reduce() or reduce_on_gpu() which is much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the proposed variant uses four unnecessary global memory IO operations (write a
, b
, c
, read c
). I am sure this is the reason why it is still slower than the STL version. It should be easy to extend the parallel version of accumulate
to work with generic random access iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the proper long-term fix is to optimize accumulate/reduce for all input iterators, not just buffer_iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an issue (#73) to fix the performance of accumulate()
with InputIterator
s. I'll look into making the fix which should greatly improve the performance of current inner_product()
implementation.
Better fixed by issue #73. |
I've updated @roshanr95: Can you try out the code from develop? |
Working great!! ~160x performance increase from original. |
Implemented simple kernel to perform inner_product, 10x performance gain