Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable cuda texture object based access in regions kernel #1903

Merged
merged 2 commits into from Aug 29, 2017
Merged

Conversation

9prady9
Copy link
Member

@9prady9 9prady9 commented Aug 9, 2017

There is no improvement except for some specific sizes. Note that I have run the benchmark on only 2^n sizes.

Size TextureObject Plain Old Memory
128 x 128 0.00247927 0.00299974
256 x 256 0.00671363 0.006602
384 x 384 0.00620648 0.00795144
512 x 512 0.00989444 0.0123784
640 x 640 0.0136365 0.0155043
768 x 768 0.0186786 0.0196094
896 x 896 0.0253983 0.0266321
1024 x 1024 0.0322667 0.0337688
1152 x 1152 0.0411638 0.0431613
1280 x 1280 0.0529367 0.0560853
1408 x 1408 0.0612112 0.0643147
1536 x 1536 0.0872647 0.0769113
1664 x 1664 0.105343 0.0999272
1792 x 1792 0.12006 0.112325
1920 x 1920 0.112999 0.118628
2048 x 2048 0.129985 0.136978

chart

[skip arrayfire ci]

//#if (__CUDA_ARCH__ >= 300)
#if 0
// Kepler bindless texture objects
#if __CUDA_ARCH__ >= 300
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually drop this ? We don't support pre 3.0 compute anymore.

cudaDeviceProp prop = getDeviceProp(getActiveDeviceId());

//Use texture objects with compute 3.0 or higher
if (prop.major>=3 && prop.minor>=0 && !std::is_same<T,double>::value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@9prady9
Copy link
Member Author

9prady9 commented Aug 10, 2017

@arrayfire/core-devel This is ready for merge.

}

return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check this in?

@@ -23,8 +23,9 @@
#include <thrust/sort.h>
#include <thrust/transform_scan.h>

#if __CUDACC__
#include <type_traits>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore, it was used in first commit but I removed it now. Will remove it.


#if __CUDACC__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. It wasn't added in this change, so I didn't notice this.

cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
CUDA_CHECK(cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not releasing the texture.

@pavanky
Copy link
Member

pavanky commented Aug 11, 2017

@9prady9 did you re-run the benchmarks after the texture free? The difference earlier has been fairly minimal. I wonder if this change affects the overall performance.

@9prady9
Copy link
Member Author

9prady9 commented Aug 11, 2017

@pavanky The the performance looks good.
chart 1

@9prady9 9prady9 requested a review from pavanky August 11, 2017 06:32
@9prady9
Copy link
Member Author

9prady9 commented Aug 17, 2017

@pavanky If there is no additional feedback, I think this is ready for merge.

@umar456 umar456 closed this Aug 23, 2017
@pavanky pavanky changed the base branch from hotfix-3.5.1 to master August 23, 2017 14:35
@pavanky pavanky reopened this Aug 23, 2017
@pavanky pavanky changed the base branch from master to v3.5 August 23, 2017 14:59
@pavanky pavanky added this to the v3.5.1 milestone Aug 23, 2017
@pavanky pavanky changed the base branch from v3.5 to devel August 23, 2017 15:02
@pavanky pavanky changed the base branch from devel to v3.5 August 23, 2017 15:02
@pavanky
Copy link
Member

pavanky commented Aug 23, 2017

@9prady9 can you close this PR and create a new one targeting v3.5 by cherry picking the appropriate commits?

@9prady9
Copy link
Member Author

9prady9 commented Aug 23, 2017

@umar456 / @mlloreda please kill the ci jobs for this task.

@9prady9 9prady9 closed this Aug 23, 2017
@pavanky
Copy link
Member

pavanky commented Aug 23, 2017

@9prady9 Nevermind, the problem is with the base branch. I am fixing v3.5

@pavanky
Copy link
Member

pavanky commented Aug 23, 2017

@9prady9 fixed now.

@umar456
Copy link
Member

umar456 commented Aug 25, 2017

The performance of this function depends on the input image. What input image did you use for this function? Can you test the performance where more iterations are required? How does it compare to the performance where there are only few iterations?

Also, I am having trouble seeing how this access pattern takes advantage of the texture objects. Looks like you are loading into the shared memory so you are not taking advantage of the additional texture cache. You are using linear textures so you aren't taking advantage of the spatial locality accesses of Texture caches. I don't see a huge advantage to using this approach. Maybe it will be more pronounced in inputs with more iterations.

@pavanky
Copy link
Member

pavanky commented Aug 25, 2017

Also we should probably be using cudaArray instead of cudaTextureObject.

@9prady9
Copy link
Member Author

9prady9 commented Aug 25, 2017

@umar456 There is an advantage to using texture objects for this section of code because it is doing 2D spatial access(reading more than one element per thread - I am not talking about per thread load) and textures surely have an advantage compared to global memory in that aspect.

Does having the additional layer of shared memory reduce the performance ? It depends on whether shared memory has less latency compared to texture access. I think shared memory latency is better compared to texture access, but I am not certain of it.

I had some benchmark sample using randomized input commited earlier, but removed it later and did a force push. Do you have any suggestion for the input you were talking about - the one that will cause regions to run more iterations ?

@pavanky Is cudaArray similar to texture in performance ? I have seen couple of examples online where cudaArray_t is again wrapped around with cudaTextureObject_t. Not sure if it is equivalent in performance with textures.

@pavanky
Copy link
Member

pavanky commented Aug 25, 2017

The benefits coming from texture memory seem to be fairly minimal at the moment.

Remove pre-3.0-compute checks as we don't support 2.0 compute
capability anymore
@9prady9
Copy link
Member Author

9prady9 commented Aug 26, 2017

@pavanky @umar456 I have cleaned up more of it. I have tested the changes with two actual images(using timeit) and the improvement is around 20+%.

One of them has lot of smaller regions(image with lot of text) and the other has huge but fewer regions.

  • 21% improvement for image with lot of individual components but smaller regions - took 8 iterations
  • 26% improvement for image with less individual components but big regions - took 22 iterations.

wtext
sciie

@pavanky pavanky merged commit b7bd543 into v3.5 Aug 29, 2017
@pavanky pavanky deleted the regions_fix branch August 29, 2017 15:32
@mlloreda mlloreda mentioned this pull request Sep 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants