Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How embeddings clearing works on embedding storage set to GPU #1076

Closed
pommedeterresautee opened this issue Sep 6, 2019 · 4 comments
Closed
Labels
question Further information is requested

Comments

@pommedeterresautee
Copy link
Contributor

pommedeterresautee commented Sep 6, 2019

My understanding is that after the batch, all dynamics embeddings are cleared from GPU memory. It makes sense as they are specific to each sentence and there is no obvious reason to keep them in memory.

However, when I run GPU storage mode on my computer I get a OOM exception on the 72th batch. My understanding is that the longest sentence are performed first (for padding optimization).
I use Fasttext + Flair embeddings

Did I get this error only because more and more Fasttext word are kept in GPU memory?
It s quite surprising as the full Fasttext matrix takes 1.2 Gb when serialized on hard drive and I am wondering if I am not missing something regarding Flair embeddings.

Looking at the memory consumption, it increases linearly with batches, I was not expecting such behaviour because of zipf law.

I suspect that this line is the cause of what I see https://github.com/zalandoresearch/flair/blob/ddba219c1deea9c7d12725741cf8d041b68ae738/flair/training_utils.py#L354 (in inference there is no gradient if I am right) so dynamics embeddings stay in memory

@pommedeterresautee pommedeterresautee added the question Further information is requested label Sep 6, 2019
@alanakbik
Copy link
Collaborator

Hello @pommedeterresautee yes that is correct. However, the Flair embeddings are by default static so they are kept in GPU memory as well. The reason for this is that fine-tuning is by default disabled for FlairEmbeddings since across our experiments it worked much better to freeze the weights in the LM. This means all models that we distribute have non-dynamic Flair embeddings. So if you select 'gpu' this will slowly fill up your GPU memory since Flair embeddings are quite large.

@pommedeterresautee
Copy link
Contributor Author

Ok I understand now, in my understanding dynamic (from comment # else delete only dynamic embeddings (otherwise autograd will keep everything in memory)) was about the embeddings which have been "dynamically" generated for each sentence (something Flair LM is doing but not Fasttext).

So follow up question: Flair embeddings are specific to each sentence and keeping them in GPU should not bring any speed improvement. Why someone would want to keep the Flair embeddings on GPU? Same reason than #1070 ?

Can you tell me where the Fasttext Matrix embeddings is stored with "none" option? I think in computer general RAM and I am wondering if it would make sense to have this kind of embeddings staying in GPU RAM (because they never change) and Flair cleared after each batch.

@alanakbik
Copy link
Collaborator

It makes sense to keep them in memory during training, since you do many epochs over the same training dataset. So if the embeddings for all sentences are already generated you can always reuse them in the next epoch. This is why 'gpu' is the fastest option for training and 'cpu' in most cases is faster than 'none' since generating embeddings is often slower than moving the tensors from cpu memory to gpu.

@pommedeterresautee
Copy link
Contributor Author

Thanks a lot @alanakbik I focus so much on inference that I forgot to take training into account!
Will probably come up with new questions soon :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants