-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How embeddings clearing works on embedding storage set to GPU #1076
Comments
Hello @pommedeterresautee yes that is correct. However, the Flair embeddings are by default static so they are kept in GPU memory as well. The reason for this is that fine-tuning is by default disabled for FlairEmbeddings since across our experiments it worked much better to freeze the weights in the LM. This means all models that we distribute have non-dynamic Flair embeddings. So if you select 'gpu' this will slowly fill up your GPU memory since Flair embeddings are quite large. |
Ok I understand now, in my understanding dynamic (from comment So follow up question: Flair embeddings are specific to each sentence and keeping them in GPU should not bring any speed improvement. Why someone would want to keep the Flair embeddings on GPU? Same reason than #1070 ? Can you tell me where the Fasttext Matrix embeddings is stored with "none" option? I think in computer general RAM and I am wondering if it would make sense to have this kind of embeddings staying in GPU RAM (because they never change) and Flair cleared after each batch. |
It makes sense to keep them in memory during training, since you do many epochs over the same training dataset. So if the embeddings for all sentences are already generated you can always reuse them in the next epoch. This is why 'gpu' is the fastest option for training and 'cpu' in most cases is faster than 'none' since generating embeddings is often slower than moving the tensors from cpu memory to gpu. |
Thanks a lot @alanakbik I focus so much on inference that I forgot to take training into account! |
My understanding is that after the batch, all dynamics embeddings are cleared from GPU memory. It makes sense as they are specific to each sentence and there is no obvious reason to keep them in memory.
However, when I run GPU storage mode on my computer I get a OOM exception on the 72th batch. My understanding is that the longest sentence are performed first (for padding optimization).
I use Fasttext + Flair embeddings
Did I get this error only because more and more Fasttext word are kept in GPU memory?
It s quite surprising as the full Fasttext matrix takes 1.2 Gb when serialized on hard drive and I am wondering if I am not missing something regarding Flair embeddings.
Looking at the memory consumption, it increases linearly with batches, I was not expecting such behaviour because of zipf law.
I suspect that this line is the cause of what I see https://github.com/zalandoresearch/flair/blob/ddba219c1deea9c7d12725741cf8d041b68ae738/flair/training_utils.py#L354 (in inference there is no gradient if I am right) so dynamics embeddings stay in memory
The text was updated successfully, but these errors were encountered: