Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-387: modify FlairEmbeddings to handle large texts #444

Merged
merged 3 commits into from
Feb 2, 2019

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Feb 1, 2019

closes #387

This PR adds a fix for retrieving embeddings from a character LM for long sequences. The idea is to chop long sequences into chunks and push each chunk through the LM, while always remembering the last hidden state as new initial hidden state. This lowers memory requirements (shorter sequences at once) but increases runtime (more calls to RNN).

In detail:

  • LanugageModel.get_representation() and FlairEmbeddings now have the chars_per_chunk parameter that defaults to 512. Lowering this parameter reduces memory but increases runtime.

  • LanguageModelTrainer can now shuffle sentences in each split

  • Deprecated DocumentMeanEmbeddings removed, as well as most mentions of deprecated CharLMEmbeddings

  • Removed slow unit tests

@alanakbik alanakbik merged commit 1876f1f into master Feb 2, 2019
@alanakbik alanakbik deleted the GH-387-long-flair-embeddings branch February 6, 2019 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FlairEmbeddings for long sequences
1 participant