Deep Learning for NLP resources
State of the art resources for NLP sequence modeling tasks such as machine translation, image captioning, and dialog.
Deep Learning for NLP
A Primer on Neural Network Models for Natural Language Processing
Yoav Goldberg. Submitted 9/2015, published 11/16. 75 page summary of state of the art.
Resources about word vectors, aka word embeddings, and distributed representations for words.
Word vectors are numeric representations of words where similar words have similar vectors. Word vectors are often used as input to deep learning systems. This process is sometimes called pretraining.
A neural probabilistic language model.
Bengio 2003. Seminal paper on word vectors.
Efficient Estimation of Word Representations in Vector Space
Mikolov et al. 2013. Word2Vec generates word vectors in an unsupervised way by attempting to predict words from a corpus. Describes Continuous Bag-of-Words (CBOW) and Continuous Skip-gram models for learning word vectors.
Skip-gram takes center word and predict outside words. Skip-gram is better for large datasets.
CBOW - takes outside words and predict the center word. CBOW is better for smaller datasets.
Distributed Representations of Words and Phrases and their Compositionality
Mikolov et al. 2013. Learns vectors for phrases such as "New York Times." Includes optimizations for skip-gram: heirachical softmax, and negative sampling. Subsampling frequent words. (i.e. frequent words like "the" are skipped periodically to speed things up and improve vector for less frequently used words)
Linguistic Regularities in Continuous Space Word Representations
Mikolov et al. 2013. Performs well on word similarity and analogy task. Expands on famous example: King – Man + Woman = Queen
Word2Vec source code
Word2Vec tutorial in TensorFlow
word2vec Parameter Learning Explained
GloVe: Global vectors for word representation
Pennington, Socher, Manning. 2014. Creates word vectors and relates word2vec to matrix factorizations. Evalutaion section led to controversy by Yoav Goldberg
Glove source code and training data
Advances in Pre-Training Distributed Word Representations
T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin 2017
FastText library includes English word vectors
Thought vectors are numeric representations for sentences, paragraphs, and documents. This concept is used for many text classification tasks such as sentiment analysis.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Socher et al. 2013. Introduces Recursive Neural Tensor Network and dataset: "sentiment treebank." Includes demo site. Uses a parse tree.
Distributed Representations of Sentences and Documents
Le, Mikolov. 2014. Introduces Paragraph Vector. Concatenates and averages pretrained, fixed word vectors to create vectors for sentences, paragraphs and documents. Also known as paragraph2vec. Doesn't use a parse tree.
Implemented in gensim. See doc2vec tutorial
Deep Recursive Neural Networks for Compositionality in Language
Irsoy & Cardie. 2014. Uses Deep Recursive Neural Networks. Uses a parse tree.
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
Tai et al. 2015 Introduces Tree LSTM. Uses a parse tree.
Semi-supervised Sequence Learning
Dai, Le 2015
Approach: "We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder..."
Result: "With pretraining, we are able to train long short term memory recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups."
Bag of Tricks for Efficient Text Classification
Joulin, Grave, Bojanowski, Mikolov 2016 Facebook AI Research.
"Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation."
Neural Machine Translation
In 2014, neural machine translation (NMT) performance became comprable to state of the art statistical machine translation(SMT).
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (abstract)
Cho et al. 2014 Breakthrough deep learning paper on machine translation. Introduces basic sequence to sequence model which includes two rnns, an encoder for input and a decoder for output.
Neural Machine Translation by jointly learning to align and translate (abstract)
Bahdanau, Cho, Bengio 2014.
Implements attention mechanism. "Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated"
Result: "comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation."
English to French Demo
On Using Very Large Target Vocabulary for Neural Machine Translation
Jean, Cho, Memisevic, Bengio 2014.
"we try replacing each [UNK] token with the aligned source word or its most likely translation determined by another word alignment model."
Result: English -> German bleu score = 21.59 (target vocabulary of 50,000)
Sequence to Sequence Learning with Neural Networks
Sutskever, Vinyals, Le 2014. (nips presentation). Uses seq2seq to generate translations.
Result: English -> French bleu score = 34.8 (WMT’14 dataset)
A key contribution is improvements from reversing the source sentences.
seq2seq tutorial in TensorFlow.
Addressing the Rare Word Problem in Neural Machine Translation (abstract)
Luong, Sutskever, Le, Vinyals, Zaremba 2014
Replace UNK words with dictionary lookup.
Result: English -> French BLEU score = 37.5.
Effective Approaches to Attention-based Neural Machine Translation
Luong, Pham, Manning. 2015
2 models of attention: global and local.
Result: English -> German 25.9 BLEU points
Context-Dependent Word Representation for Neural
Choi, Cho, Bengio 2016
"we propose to contextualize the word embedding vectors using a nonlinear bag-of-words representation of the source sentence."
"we propose to represent special tokens (such as numbers, proper nouns and acronyms) with typed symbols to facilitate translating those words that are not well-suited to be translated via continuous vectors."
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Wu et al. 2016
"WMT’14 English-to-French, our single model scores 38.95 BLEU"
"WMT’14 English-to-German, our single model scores 24.17 BLEU"
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Johnson et al. 2016
Translations between untrained language pairs.
Convolutional Sequence to Sequence Learning
Gehring et al. 2017 Facebook AI research blog post
Architecture: Convolutional sequence to sequence. ConvS2s.
Results: "We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU."
Transformer: A Novel Neural Network Architecture for Language Understanding
Arcitecture: Transformer, a T2T model introduced by Google in Attention is all you need
Results: "we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks."
T2T Source code
T2T blog post
8/15/18 Google increases BLEU score by 1 and uses universal tranformer in other domains besides translation.
DeepL Translator claims to outperform competitors but doesn't disclose their architecture. "Specific details of our network architecture will not be published at this time. DeepL Translator is based on a single, non-ensemble model."
Conversation modeling / Dialog
Neural Responding Machine for Short-Text Conversation
Shang et al. 2015 Uses Neural Responding Machine. Trained on Weibo dataset. Achieves one round conversations with 75% appropriate responses.
A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
Sordoni et al. 2015. Generates responses to tweets.
Uses Recurrent Neural Network Language Model (RLM) architecture of (Mikolov et al., 2010). source code: RNNLM Toolkit
Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
Serban, Sordoni, Bengio et al. 2015. Extends hierarchical recurrent encoder-decoder neural network (HRED).
Attention with Intention for a Neural Network Conversation Model
Yao et al. 2015 Architecture is three recurrent networks: an encoder, an intention network and a decoder.
A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
Serban, Sordoni, Lowe, Charlin, Pineau, Courville, Bengio 2016
Proposes novel architecture: VHRED. Latent Variable Hierarchical Recurrent Encoder-Decoder
Compares favorably against LSTM and HRED.
A Neural Conversation Model
Vinyals, Le 2015. Uses LSTM RNNs to generate conversational responses. Uses seq2seq framework. Seq2Seq was originally designed for machine translation and it "translates" a single sentence, up to around 79 words, to a single sentence response, and has no memory of previous dialog exchanges. Used in Google Smart Reply feature for Inbox
Incorporating Copying Mechanism in Sequence-to-Sequence Learning
Gu et al. 2016 Proposes CopyNet, builds on seq2seq.
A Persona-Based Neural Conversation Model
Li et al. 2016 Proposes persona-based models for handling the issue of speaker consistency in neural response generation. Builds on seq2seq.
Deep Reinforcement Learning for Dialogue Generation
Li et al. 2016. Uses reinforcement learing to generate diverse responses. Trains 2 agents to chat with each other. Builds on seq2seq.
Deep learning for chatbots
Article summary of state of the art, and challenges for chatbots.
Deep learning for chatbots. part 2
Implements a retrieval based dialog agent using dual encoder lstm with TensorFlow, based on the Ubuntu dataset [paper] includes source code
Memory and Attention Models
Attention mechanisms allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. - Attention and Memory in Deep Learning and NLP
Memory Networks Weston et. al 2014, and
End-To-End Memory Networks Sukhbaatar et. al 2015.
Memory networks are implemented in MemNN. Attempts to solve task of reason attention and memory.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Weston 2015. Classifies QA tasks like single factoid, yes/no etc. Extends memory networks.
Evaluating prerequisite qualities for learning end to end dialog systems
Dodge et. al 2015. Tests Memory Networks on 4 tasks including reddit dialog task.
See Jason Weston lecture on MemNN
Neural Turing Machines
Graves, Wayne, Danihelka 2014.
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-toend, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples. Olah and Carter blog on NTM
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
Joulin, Mikolov 2015. Stack RNN source code and blog post