Skip to content

Conversation

9prady9
Copy link
Member

@9prady9 9prady9 commented Feb 15, 2019

  • Documetation for new functions (src/backend/cuda/nvrtc/cache.hpp)
  • Add checks for compute versions not supported in CUDA 10

Moved the following functions to use runtime compilation while refining the API defined in cache.hpp

  • convolve
  • scan
  • scan_by_key
  • separable convolution
  • transpose
  • where

All required headers(listed below) for runtime compilation will be embedded into the built library. Therefore, developers writing kernels just have to #include any required files as usually inside the cuh kernel file. Check transpose function to get an idea of how it is being done.

  • backend.hpp
  • complex.hpp
  • jit.cuh
  • math.hpp
  • ops.hpp
  • optypes.hpp
  • Param.hpp
  • shared.hpp
  • types.hpp

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of buildKernel function. But this has been deferred for another PR.

Notes:

  1. Just this change brought down afcuda.so file size by 200MB for single compute version 61. Hopefully, we will see drastic reduction in our final binary once all feasible functions are ported to runtime compilation.
  2. CUB can't be included into this framework since it includes some system headers, there is an open issue regarding this on the corresponding repository.
  3. Functions using thrust can only be ported if thrust calls and raw kernels are cleanly separated. Thrust API involving CUDA runtime constructs are basically the blockers.

@9prady9
Copy link
Member Author

9prady9 commented Feb 15, 2019

Some other CUDA tests failed, may be which use JIT. I am looking into those failures.

@pavanky
Copy link
Member

pavanky commented Feb 20, 2019

This is nice. good to use this for full fledged half precision support.

@9prady9
Copy link
Member Author

9prady9 commented Feb 20, 2019

@arrayfire/core-devel I am still debugging some failures on the linux-cuda ci job, but I think it is ready for reviews.

Copy link
Member

@umar456 umar456 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. I have suggested an API change which may be more manageable and avoids creating a map on load.

@9prady9 9prady9 dismissed umar456’s stale review March 5, 2019 16:07

Addressed all feedback

@9prady9 9prady9 requested a review from umar456 March 5, 2019 16:07
@9prady9
Copy link
Member Author

9prady9 commented Mar 5, 2019

thats an odd error on linux, that didn't happen prior to rebase I did today. trying a fresh build on my machine.

Update: was able to reproduce this on fresh build.

@9prady9
Copy link
Member Author

9prady9 commented Mar 5, 2019

I have noticed __int64 type failures on windows, will debug them soon.

umar456
umar456 previously requested changes Mar 5, 2019
Copy link
Member

@umar456 umar456 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small things here and there. Great comments. We need to strive to document more of our internal functions.

@umar456
Copy link
Member

umar456 commented Mar 5, 2019

We need to do this but I am worried that there are going to be a few combinations of template parameters that are valid but will not be tested. We need be vigilant to test all types are parameters.

Added documentation for nvrtc cache mechanism

Moved the following functions in CUDA backt to use runtime compilation
* Transpose (In place transpose hasn't been ported yet)
* Convolutions
* Scan and Scan by Key

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of compileKernel function.
@9prady9
Copy link
Member Author

9prady9 commented Mar 6, 2019

I have rebased/squashed all CUDA work. Will push once OpenCL changes are ready.

@umar456
Copy link
Member

umar456 commented Mar 6, 2019

You should run Bloaty McBloatface before and after on this if its not too difficult. It would be interesting to see.

@9prady9
Copy link
Member Author

9prady9 commented Mar 6, 2019

bloaty output

nvrtc ./src/backend/cuda/libafcuda.so.3.7.0
     VM SIZE                         FILE SIZE
 --------------                   --------------
 100.0%   119Mi TOTAL               529Mi 100.0%
  

master ./src/backend/cuda/libafcuda.so
   VM SIZE             FILE SIZE
 --------------          --------------
 100.0%  180Mi TOTAL        640Mi 100.0%

@9prady9 9prady9 dismissed umar456’s stale review March 12, 2019 20:13

Addressed feedback

@9prady9 9prady9 requested a review from umar456 March 12, 2019 21:28
@umar456 umar456 merged commit 7797d01 into arrayfire:master Mar 12, 2019
@9prady9 9prady9 deleted the nvrtc branch March 12, 2019 21:41
9prady9 added a commit to 9prady9/arrayfire that referenced this pull request Apr 8, 2019
* Add CUDA runtime compilation support using nvrtc

Moved the following functions in CUDA backend to use runtime compilation
* Transpose (In place transpose hasn't been ported yet)
* Convolutions
* Scan and Scan by Key

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of compileKernel function.

(cherry picked from commit 7797d01)
umar456 pushed a commit to 9prady9/arrayfire that referenced this pull request Apr 17, 2019
* Add CUDA runtime compilation support using nvrtc

Moved the following functions in CUDA backend to use runtime compilation
* Transpose (In place transpose hasn't been ported yet)
* Convolutions
* Scan and Scan by Key

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of compileKernel function.

(cherry picked from commit 7797d01)
umar456 pushed a commit to 9prady9/arrayfire that referenced this pull request Apr 17, 2019
* Add CUDA runtime compilation support using nvrtc

Moved the following functions in CUDA backend to use runtime compilation
* Transpose (In place transpose hasn't been ported yet)
* Convolutions
* Scan and Scan by Key

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of compileKernel function.

(cherry picked from commit 7797d01)
umar456 pushed a commit to 9prady9/arrayfire that referenced this pull request Apr 17, 2019
* Add CUDA runtime compilation support using nvrtc

Moved the following functions in CUDA backend to use runtime compilation
* Transpose (In place transpose hasn't been ported yet)
* Convolutions
* Scan and Scan by Key

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of compileKernel function.

(cherry picked from commit 7797d01)
umar456 pushed a commit that referenced this pull request Apr 17, 2019
* Add CUDA runtime compilation support using nvrtc

Moved the following functions in CUDA backend to use runtime compilation
* Transpose (In place transpose hasn't been ported yet)
* Convolutions
* Scan and Scan by Key

The eventual goal is to use math.hpp even inside jit kernels and remove
the code-path controlled by isJIT parameter of compileKernel function.

(cherry picked from commit 7797d01)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants