Kernel for masked_matrix_multiplication (both forward and backward)
Kernel for sparse_softmax (both forward and backward)
Kernel for vector-shape spmm (both forward and backward)
Current self-attention implementation in DGL is not efficient and uses too much GPU memory.
Custom Op support is required to accelerate some graph operations like masked_mm and sparse_softmax used in the self-attention module.
In the future, there might be elegant solutions but currently, we write custom op for these operations on our own.
You may find my primitive custom op implementations here(private repo), note that I've not covered MXNet yet and I hope team members familiar with MXNet would help.