Using Pre-trained ELMo Representations for Many Languages in Flair #438

mauryaland · 2019-01-31T18:55:45Z

Hello,

First of all, thanks for the great work. This library is very useful and I follow with attention the many improvements!

I wonder if there is a possibility of implementing elmo embeddings from the repository ELMoForManyLangs ?

Thank you in advance for your answer and I am available to help if the answer is positive.

Amaury

alanakbik · 2019-01-31T19:06:29Z

Hello @mauryaland this looks very interesting - multilingual ELMo would definitely be a great addition to Flair. Are you planning an installable pip package?

mauryaland · 2019-01-31T19:43:35Z

@alanakbik I will ask them if they are planning to or if I can create it. I let you know about that.

alanakbik · 2019-01-31T20:00:35Z

Cool, thanks!

mauryaland · 2019-02-01T09:45:40Z

@alanakbik I had an answer from the author on the following issue and the project is too much unstable so far, so wait and see. I will follow the topic.

On another side, I have discovered great embeddings recently, subwords embeddings for many languages in fact. It is called BPEmb. Could be interesting to used and it is available on pypi. Things are going really fast in NLP these days !

alanakbik · 2019-02-01T09:52:18Z

Wow these look interesting - perhaps we can integrate them.

stefan-it · 2019-02-01T10:16:28Z

Would be great if they had used the official ELMo training code 😂 From myexperience the Transformer ELMo model is a good alternative to the default ELMo model, and training is a lot faster.

BPE embeddings could be interesting, it would work easily for text classification, because you only have to bpe encode the sentences and train a baseline model; if you also want to use a language model, then you need to train on a BPE encoded corpus.

For sequence tagging: I made some experiments with SentencePiece (SentencePiece word embeddings + SentencePiece language model + Converting a CoNLL NER dataset also to SentencePiece) but the results weren't very promising (It's an open question how to tag the Pieces...).

But for text classification I think it is worth to try these BPEmb. E.g. there's a subword variant with SentencePiece in combination with the ULMfit model on Polish, see paper here.

mauryaland · 2019-02-01T10:35:02Z

Thanks for the details.

Indeed, how to tag the pieces for sequence tagging seems tricky.

Appreciate the paper on SentencePiece in combination with the ULMfit model, really good results!

bheinzerling · 2019-02-06T22:02:57Z

About tagging pieces: Converting token-based tags to subword-based tags is not necessary.

Instead, after having run your encoder (LSTM, ELMo, BERT...) on the subword sequence, you simply pick one encoder state for each token, e.g. the state corresponding to the first subword in each token.
This is described in some detail with example code here.

stefan-it · 2019-02-06T23:01:58Z

Thanks for that hint @bheinzerling . Your link also includes a very nice reference to another discussion about the NER result in the BERT paper (and document vs sentence context) 👍

alanakbik · 2019-02-07T10:43:04Z

@bheinzerling just had a look at your BPEmb paper - this looks really interesting and could allow us to reduce model size (as you noted, fastText embeddings are huge). So we'll definitely take a look at integrating your embeddings!

gccome · 2019-02-07T14:53:58Z

@bheinzerling is there a way to train BPEmb with our own data? Thanks in advance!

Update, by reading your BPEmd paper, I figure out a way of doing so. First use SentencePiece to train a BPE, and then use Glove or Word2Vec to train a BPEmd.

GH-438: added byte pair embeddings

stale · 2020-04-30T02:53:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mauryaland added the question Further information is requested label Jan 31, 2019

alanakbik pushed a commit that referenced this issue Feb 8, 2019

GH-438: added byte pair embeddings

52df1b7

alanakbik pushed a commit that referenced this issue Feb 9, 2019

Merge pull request #473 from zalandoresearch/GH-438-bp-embeddings

fcac0e1

GH-438: added byte pair embeddings

stale bot added the wontfix This will not be worked on label Apr 30, 2020

stale bot closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Pre-trained ELMo Representations for Many Languages in Flair #438

Using Pre-trained ELMo Representations for Many Languages in Flair #438

mauryaland commented Jan 31, 2019

alanakbik commented Jan 31, 2019

mauryaland commented Jan 31, 2019

alanakbik commented Jan 31, 2019

mauryaland commented Feb 1, 2019

alanakbik commented Feb 1, 2019

stefan-it commented Feb 1, 2019

mauryaland commented Feb 1, 2019

bheinzerling commented Feb 6, 2019 •

edited

Loading

stefan-it commented Feb 6, 2019 •

edited

Loading

alanakbik commented Feb 7, 2019

gccome commented Feb 7, 2019 •

edited

Loading

stale bot commented Apr 30, 2020

Using Pre-trained ELMo Representations for Many Languages in Flair #438

Using Pre-trained ELMo Representations for Many Languages in Flair #438

Comments

mauryaland commented Jan 31, 2019

alanakbik commented Jan 31, 2019

mauryaland commented Jan 31, 2019

alanakbik commented Jan 31, 2019

mauryaland commented Feb 1, 2019

alanakbik commented Feb 1, 2019

stefan-it commented Feb 1, 2019

mauryaland commented Feb 1, 2019

bheinzerling commented Feb 6, 2019 • edited Loading

stefan-it commented Feb 6, 2019 • edited Loading

alanakbik commented Feb 7, 2019

gccome commented Feb 7, 2019 • edited Loading

stale bot commented Apr 30, 2020

bheinzerling commented Feb 6, 2019 •

edited

Loading

stefan-it commented Feb 6, 2019 •

edited

Loading

gccome commented Feb 7, 2019 •

edited

Loading