Re-enable cuda texture object based access in regions kernel #1903

9prady9 · 2017-08-09T11:22:12Z

There is no improvement except for some specific sizes. Note that I have run the benchmark on only 2^n sizes.

Size	TextureObject	Plain Old Memory
128 x 128	0.00247927	0.00299974
256 x 256	0.00671363	0.006602
384 x 384	0.00620648	0.00795144
512 x 512	0.00989444	0.0123784
640 x 640	0.0136365	0.0155043
768 x 768	0.0186786	0.0196094
896 x 896	0.0253983	0.0266321
1024 x 1024	0.0322667	0.0337688
1152 x 1152	0.0411638	0.0431613
1280 x 1280	0.0529367	0.0560853
1408 x 1408	0.0612112	0.0643147
1536 x 1536	0.0872647	0.0769113
1664 x 1664	0.105343	0.0999272
1792 x 1792	0.12006	0.112325
1920 x 1920	0.112999	0.118628
2048 x 2048	0.129985	0.136978

[skip arrayfire ci]

pavanky · 2017-08-10T00:47:42Z

src/backend/cuda/kernel/regions.hpp

-//#if (__CUDA_ARCH__ >= 300)
-#if 0
-    // Kepler bindless texture objects
+#if __CUDA_ARCH__ >= 300


Can you actually drop this ? We don't support pre 3.0 compute anymore.

pavanky · 2017-08-10T00:48:00Z

src/backend/cuda/regions.cu

+    cudaDeviceProp prop = getDeviceProp(getActiveDeviceId());
+
+    //Use texture objects with compute 3.0 or higher
+    if (prop.major>=3 && prop.minor>=0 && !std::is_same<T,double>::value) {


Same as above.

9prady9 · 2017-08-10T16:43:10Z

@arrayfire/core-devel This is ready for merge.

pavanky · 2017-08-10T17:19:36Z

examples/benchmarks/regions.cpp

+    }
+
+    return 0;
+}


Do we need to check this in?

pavanky · 2017-08-10T17:21:05Z

src/backend/cuda/kernel/regions.hpp

@@ -23,8 +23,9 @@
 #include <thrust/sort.h>
 #include <thrust/transform_scan.h>

-#if __CUDACC__
+#include <type_traits>


Where is this used ?

Not anymore, it was used in first commit but I removed it now. Will remove it.

pavanky · 2017-08-10T17:21:16Z

src/backend/cuda/kernel/regions.hpp


+#if __CUDACC__


Is this necessary?

Not really. It wasn't added in this change, so I didn't notice this.

pavanky · 2017-08-10T17:23:52Z

src/backend/cuda/regions.cu

+        cudaTextureDesc texDesc;
+        memset(&texDesc, 0, sizeof(texDesc));
+        texDesc.readMode = cudaReadModeElementType;
+        CUDA_CHECK(cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL));


You are not releasing the texture.

pavanky · 2017-08-11T03:13:12Z

@9prady9 did you re-run the benchmarks after the texture free? The difference earlier has been fairly minimal. I wonder if this change affects the overall performance.

9prady9 · 2017-08-11T06:31:31Z

@pavanky The the performance looks good.

9prady9 · 2017-08-17T02:07:13Z

@pavanky If there is no additional feedback, I think this is ready for merge.

pavanky · 2017-08-23T15:05:12Z

@9prady9 can you close this PR and create a new one targeting v3.5 by cherry picking the appropriate commits?

9prady9 · 2017-08-23T15:09:47Z

@umar456 / @mlloreda please kill the ci jobs for this task.

pavanky · 2017-08-23T15:10:54Z

@9prady9 Nevermind, the problem is with the base branch. I am fixing v3.5

pavanky · 2017-08-23T15:13:09Z

@9prady9 fixed now.

umar456 · 2017-08-25T05:02:57Z

The performance of this function depends on the input image. What input image did you use for this function? Can you test the performance where more iterations are required? How does it compare to the performance where there are only few iterations?

Also, I am having trouble seeing how this access pattern takes advantage of the texture objects. Looks like you are loading into the shared memory so you are not taking advantage of the additional texture cache. You are using linear textures so you aren't taking advantage of the spatial locality accesses of Texture caches. I don't see a huge advantage to using this approach. Maybe it will be more pronounced in inputs with more iterations.

pavanky · 2017-08-25T05:22:56Z

Also we should probably be using cudaArray instead of cudaTextureObject.

9prady9 · 2017-08-25T05:54:44Z

@umar456 There is an advantage to using texture objects for this section of code because it is doing 2D spatial access(reading more than one element per thread - I am not talking about per thread load) and textures surely have an advantage compared to global memory in that aspect.

Does having the additional layer of shared memory reduce the performance ? It depends on whether shared memory has less latency compared to texture access. I think shared memory latency is better compared to texture access, but I am not certain of it.

I had some benchmark sample using randomized input commited earlier, but removed it later and did a force push. Do you have any suggestion for the input you were talking about - the one that will cause regions to run more iterations ?

@pavanky Is cudaArray similar to texture in performance ? I have seen couple of examples online where cudaArray_t is again wrapped around with cudaTextureObject_t. Not sure if it is equivalent in performance with textures.

pavanky · 2017-08-25T06:55:48Z

The benefits coming from texture memory seem to be fairly minimal at the moment.

Remove pre-3.0-compute checks as we don't support 2.0 compute capability anymore

9prady9 · 2017-08-26T07:40:47Z

@pavanky @umar456 I have cleaned up more of it. I have tested the changes with two actual images(using timeit) and the improvement is around 20+%.

One of them has lot of smaller regions(image with lot of text) and the other has huge but fewer regions.

21% improvement for image with lot of individual components but smaller regions - took 8 iterations
26% improvement for image with less individual components but big regions - took 22 iterations.

9prady9 added the perf label Aug 9, 2017

9prady9 requested review from mlloreda, pavanky and umar456 August 9, 2017 11:22

9prady9 added the CUDA label Aug 9, 2017

pavanky reviewed Aug 10, 2017

View reviewed changes

pavanky requested changes Aug 10, 2017

View reviewed changes

9prady9 removed request for mlloreda and umar456 August 11, 2017 02:08

9prady9 force-pushed the regions_fix branch from 984bd9b to 0db6081 Compare August 11, 2017 02:11

9prady9 requested a review from pavanky August 11, 2017 06:32

pavanky approved these changes Aug 17, 2017

View reviewed changes

umar456 closed this Aug 23, 2017

pavanky changed the base branch from hotfix-3.5.1 to master August 23, 2017 14:35

pavanky reopened this Aug 23, 2017

pavanky changed the base branch from master to v3.5 August 23, 2017 14:59

pavanky added this to the v3.5.1 milestone Aug 23, 2017

pavanky changed the base branch from v3.5 to devel August 23, 2017 15:02

pavanky changed the base branch from devel to v3.5 August 23, 2017 15:02

9prady9 closed this Aug 23, 2017

pavanky reopened this Aug 23, 2017

pavanky force-pushed the v3.5 branch from a0b94c2 to cf521ad Compare August 23, 2017 15:12

Re-enable texture object access in regions cuda kernel

fcdedb0

Remove pre-3.0-compute checks as we don't support 2.0 compute capability anymore

9prady9 force-pushed the regions_fix branch from 0db6081 to fcdedb0 Compare August 26, 2017 04:58

Replace warp reduce with __syncthreads_or in regions

b1119ea

pavanky merged commit b7bd543 into v3.5 Aug 29, 2017

pavanky deleted the regions_fix branch August 29, 2017 15:32

umar456 mentioned this pull request Sep 10, 2017

Update release notes and backport bugfixes #1925

Merged

mlloreda mentioned this pull request Sep 15, 2017

Release v3.5.1 #1929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable cuda texture object based access in regions kernel #1903

Re-enable cuda texture object based access in regions kernel #1903

9prady9 commented Aug 9, 2017 •

edited

pavanky Aug 10, 2017

pavanky Aug 10, 2017

9prady9 commented Aug 10, 2017

pavanky Aug 10, 2017

pavanky Aug 10, 2017

9prady9 Aug 11, 2017

pavanky Aug 10, 2017

9prady9 Aug 11, 2017

pavanky Aug 10, 2017

pavanky commented Aug 11, 2017

9prady9 commented Aug 11, 2017

9prady9 commented Aug 17, 2017

pavanky commented Aug 23, 2017

9prady9 commented Aug 23, 2017

pavanky commented Aug 23, 2017

pavanky commented Aug 23, 2017

umar456 commented Aug 25, 2017

pavanky commented Aug 25, 2017

9prady9 commented Aug 25, 2017

pavanky commented Aug 25, 2017

9prady9 commented Aug 26, 2017


		#if __CUDACC__

Re-enable cuda texture object based access in regions kernel #1903

Re-enable cuda texture object based access in regions kernel #1903

Conversation

9prady9 commented Aug 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

9prady9 commented Aug 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavanky commented Aug 11, 2017

9prady9 commented Aug 11, 2017

9prady9 commented Aug 17, 2017

pavanky commented Aug 23, 2017

9prady9 commented Aug 23, 2017

pavanky commented Aug 23, 2017

pavanky commented Aug 23, 2017

umar456 commented Aug 25, 2017

pavanky commented Aug 25, 2017

9prady9 commented Aug 25, 2017

pavanky commented Aug 25, 2017

9prady9 commented Aug 26, 2017

9prady9 commented Aug 9, 2017 •

edited