Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ORO-0] Update hiprtc dlls #43

Merged
merged 3 commits into from
Dec 14, 2022

Conversation

jammm
Copy link

@jammm jammm commented Dec 9, 2022

No description provided.

@takahiroharada takahiroharada merged commit 57929ee into feature/ORO-0-sync Dec 14, 2022
takahiroharada added a commit that referenced this pull request Jan 27, 2023
* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Fix linux loading of libhiprtc.so (#49)

* [ORO-0] Update test scripts (#50)

* [ORO-0] Update scripts for linux (#51)

* [ORO-0] Add new scripts (#52)

* [ORO-0] Add new scripts

* [ORO-0] Add execute permissions to scripts

* Fix Unit Test: getErrorString (#54)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Support hiprtc0504 (#55)

* [ORO-0] Update hiprtc and orortc error codes (#57)

* [ORO-0] Update test scripts to delete cache before running (#58)

* [ORO-0] Update hiprtc dlls

* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation

* Fix apt python installation (#63)

Update checkout version


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] OrochiUtils update. (#61)

* [ORO-0] Add WMMA test (#62)

* [ORO-0] Add WMMA test

* [ORO-0] Add a comment for WMMA

* [ORO-0] Cleanup

* [ORO-0] Add a couple more comments

* [ORO-0] Remove hip_runtime include

* [ORO-0] Cleanup

* [ORO-0] Fix comment

* [ORO-0] Add Copyright notice

* [ORO-0] Load binary from the directory where DLL is.

* [ORO-0] Fix for linux.

---------

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: PixelClear <pariku@amd.com>
PixelClear pushed a commit that referenced this pull request Jul 12, 2023
* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Allow usage of libhiprtc64.so if exists

* [ORO-0] Fix linux loading of libhiprtc.so

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: PixelClear <pariku@amd.com>
RichardGe added a commit that referenced this pull request Mar 29, 2024
* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* remove space after -I (#33)

* Feature/oro 0 gpuopen merge 2 (#32)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Allow usage of libhiprtc64.so if exists

* [ORO-0] Fix linux loading of libhiprtc.so

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: PixelClear <pariku@amd.com>

* Feature/oro 0 radix sort stream (#34)

* Initial commit

* Streams to the configuration

* Mutex in OrochiUtils

* Feature/oro 0 radix sort mutex baking (#36)

* Locking other methods in OrochiUtils

* Removing mutex from static methods

* Making mutex and map static

* Removing static from OrochiUtils

* Removing static from OrochiUtils

* Support Precompiled Kernels in Orochi (#37)

* Add bitcode support: getFunctionFromPrecompiledBinary

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add bitcode and the script to generate it.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* rewrite OROASSERT. Fix include file order.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Use string instead of const char*


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Rename the option from bitcode to precompiled


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Add bitcode script for nvidia fatbin

* [ORO-0] CUDA - hipfb->fatbin rename

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>

* Feature/oro 0 resource limits (#38)

* Adding limit functions

* Removing enum

* Removing enum

* Limit enum

* char string Windows API (#39)

* [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42)

* [ORO-0] Update precompiled radix sort kernels to use -ffast-math

* [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math

* [ORO-0] Function pointer test. (#40)

* [ORO-0] Function pointer test.

* [ORO-0] launch2d.

* [ORO-0] Event, OroStopwatch.

* Implement GpuMemory to handle device memory operations.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Fix linux loading of libhiprtc.so (#49)

* [ORO-0] Update test scripts (#50)

* [ORO-0] Update scripts for linux (#51)

* [ORO-0] Add new scripts (#52)

* [ORO-0] Add new scripts

* [ORO-0] Add execute permissions to scripts

* Fix Unit Test: getErrorString (#54)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Support hiprtc0504 (#55)

* [ORO-0] Update hiprtc and orortc error codes (#57)

* [ORO-0] Update test scripts to delete cache before running (#58)

* [ORO-0] Update hiprtc dlls

* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation

* Fix apt python installation (#63)

Update checkout version


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] OrochiUtils update. (#61)

* [ORO-0] Add WMMA test (#62)

* [ORO-0] Add WMMA test

* [ORO-0] Add a comment for WMMA

* [ORO-0] Cleanup

* [ORO-0] Add a couple more comments

* [ORO-0] Remove hip_runtime include

* [ORO-0] Cleanup

* [ORO-0] Fix comment

* [ORO-0] Add Copyright notice

* [ORO-0] Load binary from the directory where DLL is.

* [ORO-0] Fix for linux.

---------

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] Remove unnecessary template.

* [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46)

* [ORO-0] Clean up. Added python script kernelCompile.py for compilation.

* [ORO-0] hipsdk should be next to orochi dir.

* Update ParallelPrimitives/RadixSortKernels.h

Remove commented line

---------

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] add automatic arch selection (#47)

* [ORO-0] add automatic arch selection

* [ORO-0] Refactor and error output when it cannot find llc.

---------

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Feature/oro 0 flexible rtc error handling cherrypick (#48)

* add a handler for RTC load failure case on cuda.

* [ORO-0] add a handler for RTC load failure case on hip.

* [ORO-0] add cuda 12.0 sdk in nvrtc path

* [ORO-0] Remove non bundled bitcode tests. Clean up.

* [ORO-0] Clean up.

* [ORO-0] Add hiprtcGetBitcodeSize back.

* Update Orochi.cpp

* Update Orochi.cpp

* [ORO-0] Fix for multi-GPU/iGPU

* [HIPSDK-0] compute-22.40-osdb/36/

* [ORO-0] compute-23.10-osdb/9/

* [ORO-0] Update dll names

* [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup

* [ORO-0] fix compile issues

* [ORO-0] fix declaration of oroManagedMalloc

* [ORO-0] change streaming kernel

* [ORO-0] enable it on windows too

* [ORO-0] add more asserts

* [ORO-0] update kernel

* [ORO-0] add host copy times

* [ORO-0] add malloc times

* Refactor Count

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Refactor Radix Sort class:

- Now the tmp buffer is allocated internally.
- All GPU memory buffers are changed to the GpuMemory class
- `configure` will now calculate the total number of GPU blocks for the count and the scan kernel
- The client does not need to call configure explicitly
- Refactor function parameters
- Remove count reference kernel



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add `const`




Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Thid commit does the followings:

- Support setting the the number of thread per block (a.k.a block size) dynamically
- Refactor `exclusiveScanCpu`
- Extend `printKernelInfo`.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* The 1st working example for the radix sort optimization


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel

Compute the optimal number of inputs for each block to handle.

Refactor the usage of stopwatch

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] add hiprtc future dll names in hiprtc path

* Add linux paths and dll names (#66)

* [ORO-0] Change path and rtc dll names

* [ORO-0] Make scripts executable

* [ORO-0] Add hiprtc path

* [ORO-0] Remove ParallelPrimitives, test/radix sort

* [ORO-0] Edit premake

---------

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>
Co-authored-by: Richard Geslot <richard.geslot@amd.com>
Co-authored-by: Atsushi Yoshimura <51312299+AtsushiYoshimura0302@users.noreply.github.com>
Co-authored-by: Atsushi.Yoshimura <Atsushi.Yoshimura@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants