Computational graph grows with the sequence length #467

PABannier · 2023-08-20T21:21:19Z

Hello everyone!

I'm currently enhancing the GGML implementation of a LSTM network.
My main focus is to avoid having scalability issues with the computational graph.

Currently I'm setting GGML_MAX_NODES to a very high value (100k): https://github.com/PABannier/bark.cpp/blob/main/ggml.h#L206C32-L206C38

This is due to the fact that for the LSTM network the number of nodes in the computational graph grows with the sequence length: https://github.com/PABannier/bark.cpp/blob/main/encodec.cpp#L81C25-L81C36 .

I wanted a quick fix in order to have a first POC of bark.cpp. Now that we want to clean things, I'm wondering what's the best solution?

I was thinking if we should not create a ggml context and computational graph per time point to avoid these scalability issues in the graph. One subgraph would be created per time point, run the forward pass and obtain the result of one cell.
It feels hacky and quite costly considering the overhead of building the graph, copying tensors from one context to another, etc.

What do you think would be the best solution?

lin72h · 2023-08-21T15:22:19Z

great works, have you considered rwkv.cpp which also a form of RNN, from my limited knowledge I think it's superior than LSTM.

ggerganov · 2023-08-22T10:14:42Z

Maybe we should refactor ggml_cgraph to allow to dynamically allocate nodes. Something like this:

    struct ggml_cgraph {
        int n_nodes;
        int n_leafs;

        struct ggml_tensor * nodes_stack[GGML_MAX_NODES];
        struct ggml_tensor * grads_stack[GGML_MAX_NODES];
        struct ggml_tensor * leafs_stack[GGML_MAX_NODES];

        // by default we work on the stack
        struct ggml_tensor ** nodes = nodes_stack;
        struct ggml_tensor ** grads = grads_stack;
        struct ggml_tensor ** leafs = leafs_stack;

        void * visited_hash_table[GGML_GRAPH_HASHTABLE_SIZE];

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;
    };

When we allocate the graph on the stack, initialize nodes, grads and leafs to point to the stack arrays.
When we allocate it on the heap, we allow to pass arbitrary number of nodes which we malloc and redirect nodes, grads and leafs to them.

The reason I am thinking about something like this is because I want to keep the option have the graph fully allocated on the stack when we can fit it in GGML_MAX_NODES.

Any other suggestions? cc @slaren

slaren · 2023-08-22T12:55:02Z

This is probably the best we can do for now. An improvement would be to move the storage to the end of the struct, so that the space reserved for the stack can still be used when allocated on the heap. Ie:

    struct ggml_cgraph {
        int n_nodes;
        int n_leafs;

        // by default we work on the stack
        struct ggml_tensor ** nodes = storage + 0;
        struct ggml_tensor ** grads = storage + GGML_MAX_NODES;
        struct ggml_tensor ** leafs = storage + GGML_MAX_NODES * 2;

        void ** visited_hash_table = storage + GGML_MAX_NODES * 3;

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;

        // may be larger than this when allocated on the heap
        struct ggml_tensor * storage[GGML_MAX_NODES * 3 + GGML_GRAPH_HASHTABLE_SIZE];
    };

The hash table must also grow to hold all the nodes and leafs, and the size should be a prime number, otherwise the number of collisions will be huge and the performance will be affected.

In the long term, I hope we will be able to remove the node limits from the computation graphs and calculate the required memory automatically, but that's probably going to require a larger redesign.

PABannier · 2023-08-26T10:42:23Z

Looking at the rwkv repo, I think the easiest solution for now is to move the computational graph of Encodec on the heap then. I have skimmed through the code and it seems like they are building a computational graph per time point.

datduonguva · 2023-10-10T05:10:01Z

You can take a look on my work.
Basically, the computation graph (a GRU in this case) includes only the forward pass of 1 cell. To loop through the entire sequence, at each timestep, we reset the input and call the ggml_graph_compute_with_ctx()

https://github.com/datduonguva/ggml-experiments/blob/master/rnn_text_gen/rnn_text_generation.cpp

Hope this helps! (I am very new to the project so my approach might not be correct).

ggerganov added the question Further information is requested label Aug 22, 2023

PABannier mentioned this issue Oct 2, 2023

[QUESTION] Implementing RNN/LSTM with ggml RWKV/rwkv.cpp#136

Closed

slaren mentioned this issue Oct 7, 2023

fixed bad memory access exception on ios 17 ggerganov/llama.cpp#3527

Closed

ggerganov mentioned this issue Oct 10, 2023

ggml : remove GGML_MAX_NODES limit #567

Closed

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023

Add AVX2 implementation of dequantize_row_q4_0 (ggerganov#467)

09aecbf

PABannier mentioned this issue Jan 9, 2024

not enough space in context's memory pool PABannier/encodec.cpp#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computational graph grows with the sequence length #467

Computational graph grows with the sequence length #467

PABannier commented Aug 20, 2023

lin72h commented Aug 21, 2023

ggerganov commented Aug 22, 2023

slaren commented Aug 22, 2023

PABannier commented Aug 26, 2023

datduonguva commented Oct 10, 2023

Computational graph grows with the sequence length #467

Computational graph grows with the sequence length #467

Comments

PABannier commented Aug 20, 2023

lin72h commented Aug 21, 2023

ggerganov commented Aug 22, 2023

slaren commented Aug 22, 2023

PABannier commented Aug 26, 2023

datduonguva commented Oct 10, 2023