[Feat] Distributed Sparse Backend #36

Seventeen17 · 2023-05-24T08:28:39Z

🚀 The feature, motivation and pitch

Background

The GNN convolutions generates a lot of memory expansion during message passing and hard to make an optimal use of parallelization resources due to the sparsity, which result in insufficient computational performance and too high peak memory.
geSpMM, geSDDMM integrate graph operator, matrix calculation and reduce operator into one sparse kernel, to reduce kernel launch times and usage of memory, and then improve performance.

Objective

Distributed Sparse Backend using Sparse Matrix Multiplication to express convolutions in GNN, replacing the commonly used Message Passing paradigm, and supporting high distributed sparse convolution.

Moreover, we can optimize the parallel implementation of the kernel based on the sparsity and feature dimensions of the input data.
When the graph data or model is too large, we can use data parallelism, model parallelism, and pipeline parallelism for distributed optimization.

Tasks

This work includes the following major tasks, we will enrich each specific task into detailed subtasks.

Phase 1: Implementations

Sparse Matrix representation: Convert graph data in GNN into sparse matrix format for efficient matrix computation like multiplication, softmax...
Sparse Matrix computation kernels: like geSpMM, geSDDMM, EdgeSoftmax..
GNN models: Implement basic GNN models and LLM-GNN models with Sparse kernels to improve computation efficiency and reduce peak memory .
Distributed sparse modules: For commonly used GNN models, using DP, MP, PP to implement the most efficient distributed sparse convs, just like Megatron.

Phase 2: Performance optimizations

Kernel optimization: Optimize parallelization of kernels for different workloads, half-precision and mixed-precision.
Computation graph capture and compilation optimization: using TorchDynamo or other techniques to capture GNN operators and dynamic sparse shapes, enrich HLO to support lowering the sparse kernels mentioned above, and optimize based on input graph.
Memory optimization: using techniques like CPU offload-ZERO.
Distributed optimization: more efficient parallelism, cache..

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Seventeen17 added feature New feature or request cuda pytorch distributed model nn llm labels May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Distributed Sparse Backend #36

[Feat] Distributed Sparse Backend #36

Seventeen17 commented May 24, 2023

[Feat] Distributed Sparse Backend #36

[Feat] Distributed Sparse Backend #36

Comments

Seventeen17 commented May 24, 2023

🚀 The feature, motivation and pitch

Background

Objective

Tasks

Alternatives

Additional context