Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtime][Contrib] Support cudnn softmax #5214

Merged
merged 6 commits into from Apr 6, 2020

Conversation

icemelon
Copy link
Member

@icemelon icemelon commented Apr 2, 2020

Using cudnn can improve the softmax performance on Nvidia GPU.

@yzhliu @Laurawly @ZihengJiang


// Set mode and shape descriptor
if (axis == ndim - 1) {
int64_t N = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need int64_t here but later cast to int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because DLTensor defines the shape in int64_t type. There'll be a cast anyway.

@tqchen
Copy link
Member

tqchen commented Apr 2, 2020

as part of the principle, would be great if we can lookinto making the native op as fast

@icemelon
Copy link
Member Author

icemelon commented Apr 3, 2020

@tqchen Yes, I understand that. But the latency difference could be 10x between tvm schedule and cudnn for the input shape like [100, 1024] on V100. I guess to achieve such performance it requires fusion across multiple stage of reduction, which it seems not easy to be implemented in tir.

@tqchen
Copy link
Member

tqchen commented Apr 3, 2020

k, I am not trying to blocking the PR, merely trying to say it would be great to have such investigation :)

@tqchen tqchen merged commit 799ff35 into apache:master Apr 6, 2020
@tqchen
Copy link
Member

tqchen commented Apr 6, 2020

@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done?

@yongfeng-nv
Copy link
Contributor

@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done?

We don't know the details, but will look into it.

icemelon added a commit to icemelon/tvm that referenced this pull request Apr 14, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
@wpan11nv
Copy link
Contributor

wpan11nv commented Apr 20, 2020

@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done?

The cuda schedule emits 4 kernels, which cause lots of IO overhead. Ideally, we may emit a single kernel for small reduction sizes (e.g. reduction dim n <= 1024)

dpankratz pushed a commit to dpankratz/incubator-tvm that referenced this pull request Apr 24, 2020
@tqchen
Copy link
Member

tqchen commented Jun 5, 2020

#5600 for improving softmax with warp shuffle.

@icemelon icemelon deleted the softmax-cudnn branch July 21, 2020 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants