Table of Contents
This project is still under construction and will continue to see breaking changes.
Ali Zaidi
In this project we examine the ability to use generative pre-training with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.
Pre-training and generative modeling for improved language understanding in NLP remains a challenging but interesting area of research and experimentation. Currently, most SOTA NLP results are obtained by training end-to-end architectures for each language task. We examine how transformer models relying solely on attention modules, as well as convolution-only modules such as Q-RNN and those described in QANet can provide rich representations learned through generative language modeling and then fine-tuned for text classification as well as general multi-task problems. Of particular interest will be multi-label and hierarchical/structured output label classification where graph convolutional and value networks are more effective than binary categorical cross-entropy networks.
In the past few months, a number of techniques utilizing pre-training, generative modeling, multi-task architectures, data augmentation using back-translation, and efficiency techniques in language modeling have been implemented that have allowed faster training and greater scope for transfer learning. In particular, the five papers below tackle the problem of generative pre-training and multi-task learning in NLP and achieve SOTA results.
-
OpenAI: Improving Language Understanding by Generative Pre-Training
- code
- tldr: Train an unsupervised language model using a transformer architecture, and then fine-tune on task-specific datasets.
-
fastAI: Universal Language Model Fine-tuning for Text Classification
- tldr: Pre-train a language model on generic English corpus (i.e., Wikipedia). Use that to initialize a new language model on your unlabeled domain-specific corpus. Fine-tune / initialize a new domain-specific architecture for text classification.
-
AllenAI: Deep Contextualized Word Vectors
- code
- tldr: Train a generic language model using Bidirectional-LSTM.
-
Salesforce Research, The Natural Language Decathlon
- code: github.com/salesforce/decaNLP
- tldr: Challenge consisting of ten NLP tasks: : question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, database query generation, and pronoun resolution. Proposed MQAN (multi-task question answering network) which uses bidirectional LSTM to encode both question and context document, dual coattention, compressed further using another two BiLSTMs + self-attention + two more BiLSTMs to obtain final representations.
-
Trieu H. Trinh & Quoc Le: A Simple Method for Commonsense Reasoning
- tldr: Solve difficult multiple choice questions from Pronoun Disambiguation Challenge and Winograd Schema Challenge by pre-training many language models (diversity helps!) and use coreference resolution to substitute question pronoun with answer choices and pick the one with the highest likelihood (lowest perplexity) on the language models (using ensembling).
- language modeling can naturally capture common sense knowledge.
- tldr: Solve difficult multiple choice questions from Pronoun Disambiguation Challenge and Winograd Schema Challenge by pre-training many language models (diversity helps!) and use coreference resolution to substitute question pronoun with answer choices and pick the one with the highest likelihood (lowest perplexity) on the language models (using ensembling).
We describe our experimentations of generative modeling and transfer learning for improved language understanding, summarize our results, examine the issues we faced, and then discuss future directions.
Our experiments start with training language models for a variety of datasets. Our overall approach is similar across languages, so we discuss our implementation with English first.
We utilized the Wikipedia long term dependency dataset curated by Stephen Merity, which has a vocabulary of 103M tokens in the training set. We used the Tensor2Tensor
library to train this model, the details of which are summarized in wikitext103-lm
.
Training on TPUs can provide significant benefits in terms of training speed. The Transformer model is devoid of any significant recurrent operations, so there is an optimized implementation in the tensor2tensor
library that can utilize TPUs. Other types of language models, such as bidirectional LSTMs with attention have ops that are not yet available on TPUs.
TPUs do not yet support cloud_ml
based hyperparameter search, so you'll have to revert to GPUs for their usage. Multiple TPUs for single model training is also not supported.
It took 12 hours to train to 20K steps, reaching a perplexity of 53.2, very close to the SOTA reported perplexity for this dataset.
TODO: Try out the universal transformer.
Here we replicate the paper Unsupervised Machine Translation Using Monolingual Corpora Only using OpenNMT-tf
. This implementation did not work TPUs, so we instead used 4 V100's for training.
The task-specific dataset we will examine is a corpus of scientific articles from PubMed, collected and distributed by the NLM and the BioASQ challenge.
- 14 million abstracts, titles and MeSH (medical subject heading labels)
- hierarchy of parent to child label headings
This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group. I also would like to thank my wonderful mentor, Minjoon Seo for his advice and inspiration. Lastly, lots of thanks to all the awesome participants for making this a super fun experience!