Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Pre-trained ELMo Representations for Many Languages in Flair #438

Closed
mauryaland opened this issue Jan 31, 2019 · 12 comments
Closed

Using Pre-trained ELMo Representations for Many Languages in Flair #438

mauryaland opened this issue Jan 31, 2019 · 12 comments
Labels
question Further information is requested wontfix This will not be worked on

Comments

@mauryaland
Copy link
Contributor

Hello,

First of all, thanks for the great work. This library is very useful and I follow with attention the many improvements!

I wonder if there is a possibility of implementing elmo embeddings from the repository ELMoForManyLangs ?

Thank you in advance for your answer and I am available to help if the answer is positive.

Amaury

@mauryaland mauryaland added the question Further information is requested label Jan 31, 2019
@alanakbik
Copy link
Collaborator

Hello @mauryaland this looks very interesting - multilingual ELMo would definitely be a great addition to Flair. Are you planning an installable pip package?

@mauryaland
Copy link
Contributor Author

@alanakbik I will ask them if they are planning to or if I can create it. I let you know about that.

@alanakbik
Copy link
Collaborator

Cool, thanks!

@mauryaland
Copy link
Contributor Author

@alanakbik I had an answer from the author on the following issue and the project is too much unstable so far, so wait and see. I will follow the topic.

On another side, I have discovered great embeddings recently, subwords embeddings for many languages in fact. It is called BPEmb. Could be interesting to used and it is available on pypi. Things are going really fast in NLP these days !

@alanakbik
Copy link
Collaborator

Wow these look interesting - perhaps we can integrate them.

@stefan-it
Copy link
Member

Would be great if they had used the official ELMo training code 😂 From myexperience the Transformer ELMo model is a good alternative to the default ELMo model, and training is a lot faster.

BPE embeddings could be interesting, it would work easily for text classification, because you only have to bpe encode the sentences and train a baseline model; if you also want to use a language model, then you need to train on a BPE encoded corpus.

For sequence tagging: I made some experiments with SentencePiece (SentencePiece word embeddings + SentencePiece language model + Converting a CoNLL NER dataset also to SentencePiece) but the results weren't very promising (It's an open question how to tag the Pieces...).

But for text classification I think it is worth to try these BPEmb. E.g. there's a subword variant with SentencePiece in combination with the ULMfit model on Polish, see paper here.

@mauryaland
Copy link
Contributor Author

Thanks for the details.

Indeed, how to tag the pieces for sequence tagging seems tricky.

Appreciate the paper on SentencePiece in combination with the ULMfit model, really good results!

@bheinzerling
Copy link

bheinzerling commented Feb 6, 2019

About tagging pieces: Converting token-based tags to subword-based tags is not necessary.

Instead, after having run your encoder (LSTM, ELMo, BERT...) on the subword sequence, you simply pick one encoder state for each token, e.g. the state corresponding to the first subword in each token.
This is described in some detail with example code here.

@stefan-it
Copy link
Member

stefan-it commented Feb 6, 2019

Thanks for that hint @bheinzerling . Your link also includes a very nice reference to another discussion about the NER result in the BERT paper (and document vs sentence context) 👍

@alanakbik
Copy link
Collaborator

@bheinzerling just had a look at your BPEmb paper - this looks really interesting and could allow us to reduce model size (as you noted, fastText embeddings are huge). So we'll definitely take a look at integrating your embeddings!

@gccome
Copy link

gccome commented Feb 7, 2019

@bheinzerling is there a way to train BPEmb with our own data? Thanks in advance!

Update, by reading your BPEmd paper, I figure out a way of doing so. First use SentencePiece to train a BPE, and then use Glove or Word2Vec to train a BPEmd.

alanakbik pushed a commit that referenced this issue Feb 8, 2019
alanakbik pushed a commit that referenced this issue Feb 9, 2019
@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@stale stale bot closed this as completed May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants