-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Pre-trained ELMo Representations for Many Languages in Flair #438
Comments
Hello @mauryaland this looks very interesting - multilingual ELMo would definitely be a great addition to Flair. Are you planning an installable pip package? |
@alanakbik I will ask them if they are planning to or if I can create it. I let you know about that. |
Cool, thanks! |
@alanakbik I had an answer from the author on the following issue and the project is too much unstable so far, so wait and see. I will follow the topic. On another side, I have discovered great embeddings recently, subwords embeddings for many languages in fact. It is called BPEmb. Could be interesting to used and it is available on pypi. Things are going really fast in NLP these days ! |
Wow these look interesting - perhaps we can integrate them. |
Would be great if they had used the official ELMo training code 😂 From myexperience the Transformer ELMo model is a good alternative to the default ELMo model, and training is a lot faster. BPE embeddings could be interesting, it would work easily for text classification, because you only have to bpe encode the sentences and train a baseline model; if you also want to use a language model, then you need to train on a BPE encoded corpus. For sequence tagging: I made some experiments with SentencePiece (SentencePiece word embeddings + SentencePiece language model + Converting a CoNLL NER dataset also to SentencePiece) but the results weren't very promising (It's an open question how to tag the Pieces...). But for text classification I think it is worth to try these BPEmb. E.g. there's a subword variant with SentencePiece in combination with the ULMfit model on Polish, see paper here. |
Thanks for the details. Indeed, how to tag the pieces for sequence tagging seems tricky. Appreciate the paper on SentencePiece in combination with the ULMfit model, really good results! |
About tagging pieces: Converting token-based tags to subword-based tags is not necessary. Instead, after having run your encoder (LSTM, ELMo, BERT...) on the subword sequence, you simply pick one encoder state for each token, e.g. the state corresponding to the first subword in each token. |
Thanks for that hint @bheinzerling . Your link also includes a very nice reference to another discussion about the NER result in the BERT paper (and document vs sentence context) 👍 |
@bheinzerling just had a look at your BPEmb paper - this looks really interesting and could allow us to reduce model size (as you noted, fastText embeddings are huge). So we'll definitely take a look at integrating your embeddings! |
@bheinzerling is there a way to train BPEmb with our own data? Thanks in advance! Update, by reading your BPEmd paper, I figure out a way of doing so. First use SentencePiece to train a BPE, and then use Glove or Word2Vec to train a BPEmd. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hello,
First of all, thanks for the great work. This library is very useful and I follow with attention the many improvements!
I wonder if there is a possibility of implementing elmo embeddings from the repository ELMoForManyLangs ?
Thank you in advance for your answer and I am available to help if the answer is positive.
Amaury
The text was updated successfully, but these errors were encountered: