Reduce the number of concatenation for 10% inference time reduction #1093
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of concatenating token representations at the token level (row approach) and then perform the sentence tokenization, we make the process lazy by retrieving each token representation before concat, reorganize them in nested lists and perform the concat per column (all Word embeddings are concatenated in one operation, then all LM are concatenated in one op), then columns (each column being a kind of representation) are concatenated together.
The idea is because there are much more tokens per sentence than different kind of representation per token, there are less concatenation operations performed.
On Conll2003, 40 -> 36s.
GPU use whole time over 70% (when it reaches 100 there may be some improvements remaining, but the main bottleneck will be the model itself).
Let me know if your measures match :-)
FWIW, on French dataset, it s a 20% improvement, I have stopped to try to guess why.
Nb. : I downloaded Connl 2003 from https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003