Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of inner_product #69

Closed
wants to merge 1 commit into from

Conversation

roshanr95
Copy link
Contributor

Implemented simple kernel to perform inner_product, 10x performance gain

Implemented simple kernel to perform inner_product, 10x performance gain
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling ab03b78 on roshanr95:inner_product into fdadf85 on kylelutz:develop.

type_name<product_type>() + " *c)\n"
"{\n"
" const uint i = get_global_id(0);\n"
" c[i] = a[i] * b[i];\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not this change decrease the performance? It looks like the original code accumulated the transformed sequence on the fly, and the new version stores the intermediate result to the global memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an optimized accumulate/reduce() implementation which is used when the input iterators are buffer_iterator's and the function is commutative. This is most likely the cause for the performance increase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So the best way to increase performance of inner_product (and probably of other operations and also user code) would be to optimize the implementation of accumulate for the transform iterators. accumulate should only read its input once, so I don't see why input being a buffer is important here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accumulating the transformed sequence on the fly leads to a serial accumulation. As @kylelutz said, using a buffer eventually leads to reduce() or reduce_on_gpu() which is much faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the proposed variant uses four unnecessary global memory IO operations (write a, b, c, read c). I am sure this is the reason why it is still slower than the STL version. It should be easy to extend the parallel version of accumulate to work with generic random access iterators.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the proper long-term fix is to optimize accumulate/reduce for all input iterators, not just buffer_iterator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an issue (#73) to fix the performance of accumulate() with InputIterators. I'll look into making the fix which should greatly improve the performance of current inner_product() implementation.

@roshanr95
Copy link
Contributor Author

Better fixed by issue #73.

@roshanr95 roshanr95 closed this Mar 24, 2014
@kylelutz
Copy link
Collaborator

I've updated reduce() to work with generic iterators which drastically speeds up the perf_inner_product benchmark.

@roshanr95: Can you try out the code from develop?

@roshanr95 roshanr95 deleted the inner_product branch April 12, 2014 11:04
@roshanr95
Copy link
Contributor Author

Working great!! ~160x performance increase from original.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants