-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computational graph grows with the sequence length #467
Comments
great works, have you considered rwkv.cpp which also a form of RNN, from my limited knowledge I think it's superior than LSTM. |
Maybe we should refactor struct ggml_cgraph {
int n_nodes;
int n_leafs;
struct ggml_tensor * nodes_stack[GGML_MAX_NODES];
struct ggml_tensor * grads_stack[GGML_MAX_NODES];
struct ggml_tensor * leafs_stack[GGML_MAX_NODES];
// by default we work on the stack
struct ggml_tensor ** nodes = nodes_stack;
struct ggml_tensor ** grads = grads_stack;
struct ggml_tensor ** leafs = leafs_stack;
void * visited_hash_table[GGML_GRAPH_HASHTABLE_SIZE];
// performance
int perf_runs;
int64_t perf_cycles;
int64_t perf_time_us;
}; When we allocate the graph on the stack, initialize The reason I am thinking about something like this is because I want to keep the option have the graph fully allocated on the stack when we can fit it in Any other suggestions? cc @slaren |
This is probably the best we can do for now. An improvement would be to move the storage to the end of the struct, so that the space reserved for the stack can still be used when allocated on the heap. Ie: struct ggml_cgraph {
int n_nodes;
int n_leafs;
// by default we work on the stack
struct ggml_tensor ** nodes = storage + 0;
struct ggml_tensor ** grads = storage + GGML_MAX_NODES;
struct ggml_tensor ** leafs = storage + GGML_MAX_NODES * 2;
void ** visited_hash_table = storage + GGML_MAX_NODES * 3;
// performance
int perf_runs;
int64_t perf_cycles;
int64_t perf_time_us;
// may be larger than this when allocated on the heap
struct ggml_tensor * storage[GGML_MAX_NODES * 3 + GGML_GRAPH_HASHTABLE_SIZE];
}; The hash table must also grow to hold all the nodes and leafs, and the size should be a prime number, otherwise the number of collisions will be huge and the performance will be affected. In the long term, I hope we will be able to remove the node limits from the computation graphs and calculate the required memory automatically, but that's probably going to require a larger redesign. |
Looking at the |
You can take a look on my work. https://github.com/datduonguva/ggml-experiments/blob/master/rnn_text_gen/rnn_text_generation.cpp Hope this helps! (I am very new to the project so my approach might not be correct). |
Hello everyone!
I'm currently enhancing the GGML implementation of a LSTM network.
My main focus is to avoid having scalability issues with the computational graph.
Currently I'm setting GGML_MAX_NODES to a very high value (100k): https://github.com/PABannier/bark.cpp/blob/main/ggml.h#L206C32-L206C38
This is due to the fact that for the LSTM network the number of nodes in the computational graph grows with the sequence length: https://github.com/PABannier/bark.cpp/blob/main/encodec.cpp#L81C25-L81C36 .
I wanted a quick fix in order to have a first POC of bark.cpp. Now that we want to clean things, I'm wondering what's the best solution?
I was thinking if we should not create a ggml context and computational graph per time point to avoid these scalability issues in the graph. One subgraph would be created per time point, run the forward pass and obtain the result of one cell.
It feels hacky and quite costly considering the overhead of building the graph, copying tensors from one context to another, etc.
What do you think would be the best solution?
The text was updated successfully, but these errors were encountered: