Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Allow dynamic shmem of size > 48K in runtime #11478

Merged
merged 4 commits into from
May 27, 2022

Conversation

masahi
Copy link
Member

@masahi masahi commented May 26, 2022

Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem.

This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on VerifyGPUCode below).

I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform).

Reference in cutlass code:
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482

@vinx13 @junrushao1994 @tqchen @yzh119 @Hzfengsy

if (fcache_[device_id] == nullptr) {
fcache_[device_id] = m_->GetFunc(device_id, func_name_);
if (wl.dyn_shmem_size >= (48 << 10)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dynamic memory is too large, will it pass VerifyGPUCode check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested but yeah, it seems VerifyGPUCode checks the static alloc size against max_shared_memory_per_block, which would fail if dyn_shmem_size >= (48 << 10)

} else if (storage_scope.rank == runtime::StorageRank::kShared) {
size_t size = static_cast<size_t>(op->ConstantAllocationSize());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we defer this issue later? I need this to demonstrate that a multi-stage pipeline with depth > 2 works on a semi-realistic cuda schedule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's defer this particular issue

@masahi masahi changed the title [CUDA] Allow dynamic shmem of size > 48K [CUDA] Allow dynamic shmem of size > 48K in runtime May 26, 2022
if (fcache_[device_id] == nullptr) {
fcache_[device_id] = m_->GetFunc(device_id, func_name_);
if (wl.dyn_shmem_size >= (48 << 10)) {
// Assumption: dyn_shmem_size doesn't change across different invocations of
// fcache_[device_id]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption could be controversial, but this should be mostly ok in practice. To support a kernel which uses different big shmem sizes depending on input, we need to call cuFuncSetAttribute on every invocation.

@junrushao
Copy link
Member

Thanks @masahi @vinx13 @Hzfengsy, it's merged!

driazati pushed a commit to driazati/tvm that referenced this pull request May 27, 2022
Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem. 

This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on `VerifyGPUCode` below). 

I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform). 

Reference in cutlass code:
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482
juda pushed a commit to juda/tvm that referenced this pull request Jun 21, 2022
Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem. 

This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on `VerifyGPUCode` below). 

I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform). 

Reference in cutlass code:
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants