Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformer models for language model training and tag prediction instead of LSTM's #68

Closed
mittalsuraj18 opened this issue Aug 15, 2018 · 26 comments
Labels
feature A new feature help wanted Extra attention is needed wontfix This will not be worked on

Comments

@mittalsuraj18
Copy link
Contributor

I recently read the generative pretraining paper of openAI.
According to the benchmarks, fine-tuning the openAI model on a custom dataset takes a very less amount of time compared to a LSTM based approach.
Also the model has shown to improve SOTA in a lot of tasks.
So I was wondering if it is possible to replace the pipeline by a transformer based model implemented by OpenAI.

@mittalsuraj18 mittalsuraj18 changed the title transformer models for language model training and tag prediction instead if LSTM's transformer models for language model training and tag prediction instead of LSTM's Aug 15, 2018
@alanakbik
Copy link
Collaborator

Great idea - we've been discussing this internally and really want to try it out, and compare the two approaches! Any help / pointers are appreciated :)

@mittalsuraj18
Copy link
Contributor Author

mittalsuraj18 commented Aug 15, 2018

https://github.com/huggingface/pytorch-openai-transformer-lm has an implementation of transformer model in Pytorch and scripts to load openai transformer weights.
Will have a look at it this weekend and check out the feasibility of the implementation.

@alanakbik
Copy link
Collaborator

Great, thanks! Perhaps this code can be the basis of new transformer-based LanguageModel and LanguageModelTrainer classes!

@stefan-it
Copy link
Member

stefan-it commented Aug 16, 2018

A deep Transformer model achieves state-of-the-art results also in language modeling now, see this paper. So I think integrating such an architecture in flair would be awesome ❤️

But don't look at the evaluation section in the paper mentioned above ;) it took more than 7 days on a single Cloud TPU 😱

@mittalsuraj18
Copy link
Contributor Author

64 layers wow...
i don't think implementing such a huge network would be feasible since it would slow down the training of further models in the pipeline quite considerably. However their 12 layer network also yielded some decent results.
The concept of auxiliary losses is good and will have to test out and see how that works out.

@tabergma
Copy link
Collaborator

Small update: We are going to add the BERT embeddings (see #251) in the next release to flair. They are based on transformers.

We are still thinking of adding our own transformer model at one point. But not in the near future.

@tabergma tabergma added help wanted Extra attention is needed feature A new feature labels Dec 13, 2018
@mittalsuraj18
Copy link
Contributor Author

alright 👍

@stefan-it
Copy link
Member

stefan-it commented Jan 10, 2019

@alanakbik and @tabergma : Here's another great paper about a Transformer-based LM:

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

-> Yesterday they provided both a TensorFlow and PyTorch implementation of the model. I'm going to play with the implementation now, maybe I find a way to get embeddings for a sentence (like it is done with FlairEmbeddings).

@alanakbik
Copy link
Collaborator

Wow this looks really interesting!

@stefan-it
Copy link
Member

stefan-it commented Jan 28, 2019

Two PR's from the pytorch-pretrained-BERT repository are very interesting:

Once they're merged I would like to add them to flair :)

Training a Transformer-XL model is possible, but on one GPU I had to use a smaller Transformer model (but I'm currently do some experiments with it...)

@alanakbik
Copy link
Collaborator

Yeah that would be great! :) Also, we'd be very interested to hear about your experiments with Transformer-XL!

@stefan-it
Copy link
Member

stefan-it commented Feb 11, 2019

Version 0.5.0 is out now: https://github.com/huggingface/pytorch-pretrained-BERT/releases/tag/v0.5.0

I'll check the integration of OpenAI GPT and the Transformer-XL now :)

@alanakbik
Copy link
Collaborator

Wow awesome!

@gccome
Copy link

gccome commented Feb 12, 2019

Wow this is awesome. Really look forward to transformer-based models and fine-tuning-based models.

@stefan-it
Copy link
Member

Two current caveats:

  • OpenAI GPT needs two libraries to be installed (not covered by pytorch-pretrained-BERTs dependency management): ftfy and spacy. For spacy you also need to manually install the English model with: python -m spacy download en. Then it works fine, I was able to get embeddings of a sentence
  • Transformer-XL: I wasn't able to get proper embeddings, a "nan" tensor was returned. But I opened an issue, see here :)

@alanakbik
Copy link
Collaborator

Ah thanks for the update - do you know why OpenAI requires spacy, and why the English models? Only for tokenization?

@thomwolf
Copy link

Hi guys, I've made some update and a new release for these stuff: https://github.com/huggingface/pytorch-pretrained-BERT/releases/tag/v0.5.1

Keep up with the good work on flair.

@stefan-it
Copy link
Member

I've implemented an early draft of TransformerXLEmbeddings + I'm currently training on CoNLL 2003 dataset. I'll report the results here soon :)

@stefan-it
Copy link
Member

Bzw: Second version of GPT is out: https://github.com/openai/gpt-2/blob/master/README.md

@gccome
Copy link

gccome commented Feb 15, 2019

@stefan-it In my understanding, TransformerXLEmbeddings supports varied sentences length, so it won't have out-of-index issue from BertEmbedding, because Bert has fixed length of 512. Is it correct?

@alanakbik
Copy link
Collaborator

@stefan-it @thomwolf wow that's great - really looking forward to seeing this in action! And very interested to hear how well it does on CoNLL 03 and other tasks.

@stefan-it
Copy link
Member

Here's another Transformer-based architecture, that uses a new approach for pretraining (cloze-style token reconstruction task is embedded during training):

https://arxiv.org/abs/1903.07785

It also achieves new SOTA on CoNLL-2003 NER: 93.5% (compared to flair: 93.18%)

@alanakbik
Copy link
Collaborator

Very impressive results - look forward to taking a closer look at this!

@stefan-it
Copy link
Member

stefan-it commented Mar 26, 2019

One major drawback is the ridiculous amount of training data 🤣 Unfortunately, there's currently no implementation/model available.

@stefan-it
Copy link
Member

stefan-it commented Mar 26, 2019

I just asked @michaelauli if they plan to release the code and model :) [I could imagine that it will be integrated in fairseq, but this is just speculation]

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@stale stale bot closed this as completed May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature help wanted Extra attention is needed wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants