-
-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cudnn support #16
Cudnn support #16
Conversation
should the tensordata struct be changed to something like cudaslice? |
/// # See also | ||
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorDescriptor_t> | ||
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyTensorDescriptor> | ||
pub struct TensorDescriptor<T, const N: usize, const C: usize, const H: usize, const W: usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this can be changed to hold runtime values instead of const generics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but should this runtime panic (asserts) or just be unsafe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is most likely affecting every module in cudnn 😅)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime panics on creation IMO
@@ -0,0 +1,55 @@ | |||
// every function must end with `f32` and only accept `float`s; then the same function with `f64` and `double`s will be generated | |||
|
|||
extern "C" __global__ void recip_with_scale_f32(float *out, const float *a, const float *a_scale, size_t numel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa these are awesome! especially that compile_custom_kernels.sh file. 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
(though this doesn’t work on windows for me lol)
This does seem to fit more at the device level in dfdx, especially since we'll need a layer there regardless to conform to the kernel traits over there. |
@coreylowman can you take over these changes? i am only availabe tomorrow for very few hours and maybe a bit on friday, but i could also do them in 3.5 weeks |
hmm, lots of merge conflicts |
src/cudarc.rs
Outdated
block_dim: (n, 1, 1), | ||
// round up | ||
grid_dim: ((n + 1023) / 1024, 1, 1), | ||
block_dim: (n.min(1024), 1, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice we should port this min over to main in a separate pr
impl_activation_mode!(Sigmoid: CUDNN_ACTIVATION_SIGMOID); | ||
impl_activation_mode!(Relu: CUDNN_ACTIVATION_RELU); | ||
impl_activation_mode!(Tanh: CUDNN_ACTIVATION_TANH); | ||
impl_activation_mode!(Elu: CUDNN_ACTIVATION_ELU); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm thinking we don't need to add this activation forward for dfdx since there are so little in cudnn. we can just write custom kernels for them (they are really easy to write)
use crate::prelude::*; | ||
|
||
/// This does the softmax activation per image. | ||
pub struct Softmax; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudnn softmax only supports along single axis right? we don't necessarily need softmax impl for dfdx since dfdx uses lower level primitives to implement it, and also supports over any axis
use crate::prelude::*; | ||
|
||
/// Uses per image (after conv2d) normalization. | ||
pub type BatchNormalizationPerImage< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchNorm in dfdx is implemented based off of lower level primitives, so there wouldn't be a great way to use this
|
||
/// A struct that holds all the data to calculate `dx` by `y`, the filter and | ||
/// `dy`. | ||
pub struct Convolution2DBackward< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should probably use this for dfdx, there's similar approach for dfdx cpu conv2d kernel with workspaces so i think this will fit in nicely.
Can probably move to using no const generics for this though to simplify api
pub struct Filter<T, const C_OUT: usize, const C_IN: usize, const H: usize, const W: usize> { | ||
descriptor: Rc<FilterDescriptor<T, C_OUT, C_IN, H, W>>, | ||
data: Tensor4DData<T, C_OUT, C_IN, H, W>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird how they have a separate descriptor for filters, I wonder why they don't just use a Tensor descriptor?
impl_reduce_op!(ReduceOperationAdd: CUDNN_REDUCE_TENSOR_ADD); | ||
impl_reduce_op!(ReduceOperationMul: CUDNN_REDUCE_TENSOR_MUL); | ||
impl_reduce_op!(ReduceOperationMin: CUDNN_REDUCE_TENSOR_MIN); | ||
impl_reduce_op!(ReduceOperationMax: CUDNN_REDUCE_TENSOR_MAX); | ||
impl_reduce_op!(ReduceOperationAMax: CUDNN_REDUCE_TENSOR_AMAX); | ||
impl_reduce_op!(ReduceOperationAvg: CUDNN_REDUCE_TENSOR_AVG); | ||
impl_reduce_op!(ReduceOperationNorm1: CUDNN_REDUCE_TENSOR_NORM1); | ||
impl_reduce_op!(ReduceOperationNorm2: CUDNN_REDUCE_TENSOR_NORM2); | ||
impl_reduce_op!(ReduceOperationMulNoZeros: CUDNN_REDUCE_TENSOR_MUL_NO_ZEROS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this will all be super useful. Let's move to using runtime shapes though as mentioned in other places, I can help with this
/// # See also | ||
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorDescriptor_t> | ||
/// <https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnDestroyTensorDescriptor> | ||
pub struct TensorDescriptor<T, const N: usize, const C: usize, const H: usize, const W: usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to support up to 6d tensors, which it looks like you can do pretty easily with the tensor descriptors, this can be tracked at runtime like everything else 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so TensorNd = TensorDescriptorNd([Axes]) + CudaSlice?
currently only
Tensor4D
,Activation
andConvolution
, both forward and backward