Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce inference time by 30% when using Flair embeddings #1074

Merged
merged 7 commits into from Sep 10, 2019

Conversation

pommedeterresautee
Copy link
Contributor

@pommedeterresautee pommedeterresautee commented Sep 5, 2019

It appeared on profiler (screenshots on #1070) that Tensor.clone() was one of the slowest single operations during inference.
Looking at the code, it's always called even when gradient is disabled.
My understanding is that when gradient is disabled, embeddings can't change and so there is no need to clone the Tensor to avoid it to be updated. That's what does this commit.

I have made tests on inference, but not on LM training, I don't think it changes anything but let me know if I misses something.

Time on 2080 TI with storage set to None on my French dataset.

  • before 36 secs
  • after 25 secs

FYI, before I was doing measures on storage set to "cpu", that's why before time decreased a lot compared to PR #1068 (43 secs after optimization). Details on #1070

Add a warning when cpu option is used with a GPU, plus update the documentation.
fix #1070

All evaluation where still having embedding storage set to cpu, I moved them to none

flair/trainers/trainer.py Outdated Show resolved Hide resolved
@pommedeterresautee
Copy link
Contributor Author

@alanakbik for

                    if self.fine_tune:
                        embedding = embedding.clone()
                    else:
                        embedding = embedding.detach()

Does detach makes sense? (because there is no gradient)

@alanakbik
Copy link
Collaborator

@pommedeterresautee sorry for the delay - I'm planning to look into #1078 and review this PR today!

@alanakbik
Copy link
Collaborator

Hello @pommedeterresautee thanks for the PR! Prediction speed is well increased when setting storage mode to 'cpu'.

However, when setting storage mode to 'gpu', I now get OOM errors. For instance:

from flair.datasets import WNUT_17
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from flair.models import SequenceTagger

corpus = WNUT_17(in_memory=True)

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# embeddings
embeddings = StackedEmbeddings([
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
    WordEmbeddings('glove'),
])

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        )

# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# # 7. start training
trainer.train(
    f'resources/taggers/local-test',
    max_epochs=5,
    train_with_dev=True,
    embeddings_storage_mode='gpu',
)

This previously trained well with GPU storage mode, but now is throwing memory errors. It may be that without the .clone() it somehow does not release the memory from the language model. That's strange because a .detach() seems like it should be enough to move the hidden state tensor out of scope, but somehow this isn't happening.

@alanakbik
Copy link
Collaborator

@pommedeterresautee quick update: I think the problem lies in the line https://github.com/zalandoresearch/flair/blob/e0437d577be28cf7d099df588ad3a997e542df60/flair/embeddings.py#L1856

where we select from all hidden states of all characters in the sentence only the states at one position to become the word embedding. Essentially, what we need to do here is copy these states over to a new tensor so that we can discard all the rest of the huge tensor all_hidden_states_in_lm. As discussed here and elsewhere, the preferred way in PyTorch is to do this via clone(). After the copy, no references to the tensor exist and it gets cleaned up.

With your PR, we don't do the clone() and it still works if the embedding storage mode is either 'cpu' or 'none' because these storage modes cause either the embedding to be deleted ('none') or a new tensor to be created on cpu (essentially another copy operation), thus freeing up all_hidden_states_in_lm. This means we save the clone() operation giving us a significant speed-up.

But if the storage mode is 'gpu', the embedding tensor is preserved and so also all_hidden_states_in_lm which quickly blows up the memory. So the PR as it is speeds up 'none' and 'cpu' but breaks 'gpu'.

@pommedeterresautee
Copy link
Contributor Author

pommedeterresautee commented Sep 9, 2019

Came to the same conclusion than you.
It s like the new tensor is a kind of view over a part of the large matrix.
It would make sense to proceed that way to avoid copy and share data layer between tensors and the large matrix (tensor keeping a simple reference to the data layer of the matrix).
clone() would perform a real copy on a brand new data layer.
The problem is that if this was the case, each token of the batch would have a reference toward the same matrix, therefore the memory should not raise exception.

I agree that this has no effect on none, but it increases memory consumption on CPU.
Moreover, inside Language Model I don't know which way have been chosen to store embeddings, and can't perform a simple if not "none" then tensor.clone...

Things not working: narrow (to keep the same under the hood data lawyer) and split (to split into list of tensor)

Complicated...

@alanakbik
Copy link
Collaborator

One potential solution would be to define a global variable. We currently only have two: the flair.cache_root and the flair.device, both defined in the __init__.py:
https://github.com/zalandoresearch/flair/blob/e0437d577be28cf7d099df588ad3a997e542df60/flair/__init__.py#L8

We could add an optimization_settings variable here that initializes as "default". When someone selects mbedding storage mode to be 'cpu', we change this variable. In the FlairEmbeddings class we do the check: if optimization_settings is set to default, we always do the .clone() otherwise we don't.

Drawback is that this adds another global variable but if it is only for non-essential optimization instructions maybe it would be ok in this case. What do you think? I could add this in a separate PR.

@pommedeterresautee
Copy link
Contributor Author

tks @alanakbik.
For now I have reversed the optimization so you can merge this PR and #1022 if you have time.
After that, I will push some other optimizations I have worked on (hoping that they don't come with their own problem :-) ), it will give me some time to work on clone() in parallel.
What do you think @alanakbik ?
We can choose the global variable path when we will be sure there is no other solution to take the benefit of removing clone?

@pommedeterresautee
Copy link
Contributor Author

pommedeterresautee commented Sep 10, 2019

I think this is a beginning of understanding, the view theory may be the right one, they share storage until clone() is called. What I still don't understand is why it raises OOM, it should not consume more memory if they share the storage...

image

Calling .data_ptr() on tensor provides same insight.
.is_set_to() is the easier way to check if they share the same data layer (storage)
It explains why detach() is not enough.

Little exp:

                   embedding = all_hidden_states_in_lm[offset, i, :]
                    print(embedding.is_set_to(all_hidden_states_in_lm))
                    print(embedding.data_ptr(), all_hidden_states_in_lm.data_ptr())
                    print(embedding.storage().data_ptr(), all_hidden_states_in_lm.storage().data_ptr())
                    print(embedding.storage()._weak_ref(), all_hidden_states_in_lm.storage()._weak_ref())
                   print(embedding.storage_offset(), all_hidden_states_in_lm.storage_offset())

It print

False
139949060255744 139949059866624
139949059866624 139949059866624
2788225280 2788225280
97280 0

_weak_ref are the same, .storage().data_ptr() are the same but data_ptr are not.
embeddings store an offset over the storage of the large matrix
Interesting link: http://blog.christianperone.com/2018/03/pytorch-internal-architecture-tour/#zero-copy

@alanakbik
Copy link
Collaborator

Hi @pommedeterresautee yes that sounds good - I'll merge shortly!

Thanks for sharing your findings! I guess because they share the storage, the all_hidden_states_in_lm tensor is prevented from going out of scope and thus garbage collected.

@alanakbik
Copy link
Collaborator

👍

1 similar comment
@adizdari
Copy link

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CPU storing option is slower than None storing during inference on GPU
3 participants