Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cudnn support #16

Closed
wants to merge 29 commits into from
Closed

Cudnn support #16

wants to merge 29 commits into from

Conversation

M1ngXU
Copy link
Contributor

@M1ngXU M1ngXU commented Nov 8, 2022

currently only Tensor4D, Activation and Convolution, both forward and backward

@M1ngXU M1ngXU marked this pull request as ready for review November 12, 2022 15:17
@M1ngXU
Copy link
Contributor Author

M1ngXU commented Dec 13, 2022

should the tensordata struct be changed to something like cudaslice?

@coreylowman coreylowman mentioned this pull request Dec 14, 2022
/// # See also
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorDescriptor_t>
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyTensorDescriptor>
pub struct TensorDescriptor<T, const N: usize, const C: usize, const H: usize, const W: usize> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this can be changed to hold runtime values instead of const generics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but should this runtime panic (asserts) or just be unsafe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is most likely affecting every module in cudnn 😅)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime panics on creation IMO

@@ -0,0 +1,55 @@
// every function must end with `f32` and only accept `float`s; then the same function with `f64` and `double`s will be generated

extern "C" __global__ void recip_with_scale_f32(float *out, const float *a, const float *a_scale, size_t numel)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa these are awesome! especially that compile_custom_kernels.sh file. 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)
(though this doesn’t work on windows for me lol)

@coreylowman
Copy link
Owner

This does seem to fit more at the device level in dfdx, especially since we'll need a layer there regardless to conform to the kernel traits over there.

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Dec 14, 2022

@coreylowman can you take over these changes? i am only availabe tomorrow for very few hours and maybe a bit on friday, but i could also do them in 3.5 weeks

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Dec 14, 2022

hmm, lots of merge conflicts

src/cudarc.rs Outdated
block_dim: (n, 1, 1),
// round up
grid_dim: ((n + 1023) / 1024, 1, 1),
block_dim: (n.min(1024), 1, 1),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice we should port this min over to main in a separate pr

Comment on lines +34 to +37
impl_activation_mode!(Sigmoid: CUDNN_ACTIVATION_SIGMOID);
impl_activation_mode!(Relu: CUDNN_ACTIVATION_RELU);
impl_activation_mode!(Tanh: CUDNN_ACTIVATION_TANH);
impl_activation_mode!(Elu: CUDNN_ACTIVATION_ELU);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm thinking we don't need to add this activation forward for dfdx since there are so little in cudnn. we can just write custom kernels for them (they are really easy to write)

use crate::prelude::*;

/// This does the softmax activation per image.
pub struct Softmax;
Copy link
Owner

@coreylowman coreylowman Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudnn softmax only supports along single axis right? we don't necessarily need softmax impl for dfdx since dfdx uses lower level primitives to implement it, and also supports over any axis

use crate::prelude::*;

/// Uses per image (after conv2d) normalization.
pub type BatchNormalizationPerImage<
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchNorm in dfdx is implemented based off of lower level primitives, so there wouldn't be a great way to use this


/// A struct that holds all the data to calculate `dx` by `y`, the filter and
/// `dy`.
pub struct Convolution2DBackward<
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should probably use this for dfdx, there's similar approach for dfdx cpu conv2d kernel with workspaces so i think this will fit in nicely.

Can probably move to using no const generics for this though to simplify api

Comment on lines +16 to +19
pub struct Filter<T, const C_OUT: usize, const C_IN: usize, const H: usize, const W: usize> {
descriptor: Rc<FilterDescriptor<T, C_OUT, C_IN, H, W>>,
data: Tensor4DData<T, C_OUT, C_IN, H, W>,
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird how they have a separate descriptor for filters, I wonder why they don't just use a Tensor descriptor?

Comment on lines +63 to +71
impl_reduce_op!(ReduceOperationAdd: CUDNN_REDUCE_TENSOR_ADD);
impl_reduce_op!(ReduceOperationMul: CUDNN_REDUCE_TENSOR_MUL);
impl_reduce_op!(ReduceOperationMin: CUDNN_REDUCE_TENSOR_MIN);
impl_reduce_op!(ReduceOperationMax: CUDNN_REDUCE_TENSOR_MAX);
impl_reduce_op!(ReduceOperationAMax: CUDNN_REDUCE_TENSOR_AMAX);
impl_reduce_op!(ReduceOperationAvg: CUDNN_REDUCE_TENSOR_AVG);
impl_reduce_op!(ReduceOperationNorm1: CUDNN_REDUCE_TENSOR_NORM1);
impl_reduce_op!(ReduceOperationNorm2: CUDNN_REDUCE_TENSOR_NORM2);
impl_reduce_op!(ReduceOperationMulNoZeros: CUDNN_REDUCE_TENSOR_MUL_NO_ZEROS);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this will all be super useful. Let's move to using runtime shapes though as mentioned in other places, I can help with this

/// # See also
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorDescriptor_t>
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyTensorDescriptor>
pub struct TensorDescriptor<T, const N: usize, const C: usize, const H: usize, const W: usize> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to support up to 6d tensors, which it looks like you can do pretty easily with the tensor descriptors, this can be tracked at runtime like everything else 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so TensorNd = TensorDescriptorNd([Axes]) + CudaSlice?

@M1ngXU M1ngXU closed this Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants