# Introduction

1. NLP research advances in 2020 are still dominated by **large pre-trained language models, and specifically transformers.** There were many interesting **updates** introduced this year that have **made transformer architecture more efficient and applicable to long documents.**

1. Another **hot topic** relates to **the evaluation of NLP models** in different applications. We still lack evaluation approaches that clearly show where a model fails and how to fix it.

1. Also, with the growing capabilities of language models such as **GPT-3, conversational AI** is enjoying a new wave of interest. Chatbots are improving, with several impressive bots like Meena and Blender introduced this year by top technology companies.

To help you stay up to date with the latest NLP research breakthroughs, we’ve curated and summarized the key research papers in natural language processing from 2020. The papers cover the **leading language models, updates to the transformer architecture, novel evaluation approaches, and major advances in conversational AI.**




# Language Models: Updates to the Transformer Architecture

## T5: Text-to-Text Transformer (Oct 2019)

### Paper's Abstract

[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. **By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.** To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

### Our Summary 

The **Google research team** suggests a unified approach to transfer learning in NLP with the goal to set a new state of the art in the field. To this end, they propose **treating each NLP problem as a “text-to-text” problem.** Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentiment analysis, question answering, and machine translation. The researchers call their model a **Text-to-Text Transfer Transformer (T5)** and train it on the large corpus of web-scraped data to get state-of-the-art results on a number of NLP tasks.

![image](https://user-images.githubusercontent.com/28102493/109564398-4ba0ed00-7ae1-11eb-9816-e16949e92118.png)


### What’s the core idea of this paper?
The paper has several important contributions:

1. Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing techniques.

1. Introducing a new approach to transfer learning in NLP by suggesting to treat every NLP problem as a **text-to-text task:**

    1. The mode understands which tasks should be performed thanks to the task-specific prefix added to the original input sentence (e.g., “translate English to German:”, “summarize:”).

1. Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text, the **Colossal Clean Crawled Corpus (C4).**

1. Training a large (up to 11B parameters) model, called **Text-to-Text Transfer Transformer (T5)** on the C4 dataset.


### What’s the key achievement?
1. The T5 model with 11 billion parameters achieved state-of-the-art performance on 17 out of 24 tasks considered, including:

    1. the [GLUE](https://gluebenchmark.com/leaderboard) score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI tasks;
    1. the Exact Match score of 90.06 on [SQuAD dataset](https://paperswithcode.com/sota/question-answering-on-squad11-dev);
    1. the [SuperGLUE](https://super.gluebenchmark.com/leaderboard) score of 88.9, which is a very significant improvement over the previous state-of-the-art result (84.6) and very close to human performance (89.8);
    1. the ROUGE-2-F score of 21.55 on [CNN/Daily Mail](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail) abstractive summarization task.

### What are future research areas?

1. Researching the methods to achieve **stronger performance with cheaper models.**
1. Exploring more efficient knowledge extraction techniques.
1. Further investigating the **language-agnostic models.**

### What are possible business applications?

Even though the introduced model has billions of parameters and can be too heavy to be applied in the business setting, the presented ideas can be used to improve the performance on different NLP tasks, including summarization, question answering, and sentiment analysis.

### Resources

The pretrained models together with the dataset and code are released on [GitHub](https://github.com/google-research/text-to-text-transfer-transformer).

## Reformer: The Efficient Transformer (Jan 2020)

### Paper's Abstract
[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451)

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, **we replace dot-product attention by one that uses locality-sensitive hashing,** **changing its complexity from O(L2) to O(L log L),** where L is the length of the sequence. Furthermore, we **use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers.** *The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*

### Our Summary 
The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the **Google Research team** introduces several techniques that improve the efficiency of Transformers. In particular, they suggest **(1) using reversible layers to allow storing the activations only once instead of for each layer,** and **(2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-product attention.** Experiments on several text tasks demonstrate that the introduced **Reformer model matches the performance of the full Transformer but runs much faster and with much better memory efficiency.**

![image](https://user-images.githubusercontent.com/28102493/109565556-eea63680-7ae2-11eb-8736-4c5cec8aea86.png)


### What’s the core idea of this paper?

1. The leading Transformer models require huge computational resources because of the very high number of parameters and several other factors:

    1. The **activations of every layer need to be stored for back-propagation.**
    1. The intermediate feed-forward layers account for a large fraction of memory use since their depth is often much larger than the depth of attention activations.
    1. The **complexity of attention on a sequence of length L is O(L2).**

1. To address these problems, the research team introduces the Reformer model with the following improvements:

    1. using **reversible layers to store only a single copy of activations;**
    1. splitting activations inside the feed-forward layers and processing them in chunks;
    1. approximating attention computation based on **locality-sensitive hashing.**

### What’s the key achievement?

1. By analyzing the introduced techniques one by one, the authors show that model accuracy is not sacrificed by:
    
    1. switching to locality-sensitive hashing attention;
    1. using reversible layers.

1. Reformer performs on par with the full Transformer model while demonstrating much higher speed and memory efficiency: For example, on the newstest2014 task for machine translation from English to German, the Reformer base model gets a BLEU score of 27.6 compared to [Vaswani’s et al. (2017)](https://arxiv.org/abs/1706.03762) BLEU score of 27.3. 

### What does the AI community think?

The paper was selected for [oral presentation](https://iclr.cc/virtual_2020/poster_rkgNKkHtvB.html) at ICLR 2020, the leading conference in deep learning.

### What are possible business applications?
The suggested efficiency improvements enable more widespread Transformer application, especially for the tasks that depend on large-context data, such as:

1. text generation;
1. visual content generation;
1. music generation;
1. time-series forecasting.

### Resources

1. The official code implementation from Google is publicly available on [GitHub](https://github.com/google/trax/tree/master/trax/models/reformer).

1.The PyTorch implementation of Reformer is also available on [GitHub](https://github.com/lucidrains/reformer-pytorch).


## ELECTRA (Mar 2020)

### Paper's Abstract 

[ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,](https://arxiv.org/abs/2003.10555)

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more **sample-efficient pre-training task called replaced token detection.** *Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network.* Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this **new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out.** **As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute.** The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30× more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

### Our Summary 

The pre-training task for popular language models like BERT and XLNet involves **masking a small subset of unlabeled input** and then training the network to recover this original input. Even though it works quite well, **this approach is not particularly data-efficient as it learns from only a small fraction of tokens (typically ~15%).** As an alternative, the **researchers from Stanford University and Google Brain** propose a new pre-training task called replaced token detection. Instead of masking, they suggest replacing some tokens with plausible alternatives generated by a small language model. Then, the pre-trained discriminator is used to predict whether each token is an original or a replacement. **As a result, the model learns from all input tokens instead of the small masked fraction, making it much more computationally efficient.** The experiments confirm that the introduced approach leads to significantly **faster training and higher accuracy on downstream NLP tasks.**

![image](https://user-images.githubusercontent.com/28102493/109560992-daf7d180-7adc-11eb-9c77-91bb32adcbca.png)

### What’s the core idea of this paper?

1. **Pre-training methods that are based on masked language modeling are computationally inefficient as they use only a small fraction of tokens for learning.**

1. Researchers propose a new pre-training task called **replaced token detection**, where:

    1. some tokens are replaced by samples from a small generator network; 
    1. a model is pre-trained as a discriminator to distinguish between original and replaced tokens.

1. The introduced approach, called **ELECTRA** (Efficiently Learning an Encoder that Classifies Token Replacements Accurately):
    1. enables the model to learn from all input tokens instead of the small masked-out subset;
    1. is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood.

### What’s the key achievement?

1. Demonstrating that the discriminative task of distinguishing between real data and challenging negative samples is more efficient than existing generative methods for language representation learning.

1. Introducing a model that substantially outperforms state-of-the-art approaches while requiring less pre-training compute:

    1. ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT model with a score of 75.1 and a much larger GPT model with a score of 78.8.

    1. An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25% of their pre-training compute.

    1. ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and SQuAD benchmarks while still requiring less pre-training compute.

### What does the AI community think?

The paper was selected for presentation at [ICLR 2020](https://iclr.cc/virtual_2020/poster_r1xMH1BtvB.html), the leading conference in deep learning.

### What are possible business applications?

Because of its computational efficiency, the ELECTRA approach can make the application of pre-trained text encoders more accessible to business practitioners.

### Resources

The original TensorFlow implementation and pre-trained weights are released on [GitHub](https://github.com/google-research/electra).


## Longformer: The Long-Document Transformer (Apr 2020)

### Paper's Abstract

[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)

**Transformer-based models** are unable to process long sequences due to their **self-attention operation, which scales quadratically with the sequence length.** To address this limitation, we introduce the **Longformer** with an **attention mechanism** that **scales linearly with sequence length**, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

### Our Summary 

Self-attention is one of the key factors behind the success of Transformer architecture. However, it also makes transformer-based models hard to apply to long documents. The existing techniques usually divide the long input into a number of chunks and then use complex architectures to combine information across these chunks. The research team from the **Allen Institute for Artificial Intelligence** introduces a more elegant solution to this problem. ***The suggested Longformer model employs an attention pattern that combines local windowed attention with task-motivated global attention.*** This attention mechanism **scales linearly with the sequence length and enables processing of documents with thousands of tokens.** The experiments demonstrate that Longformer achieves state-of-the-art results on character-level language modeling tasks, and when **pre-trained, consistently outperforms RoBERTa on long-document tasks.**

![image](https://user-images.githubusercontent.com/28102493/109558380-a0406a00-7ad9-11eb-8b0a-a17cbd0459bd.png)


### What’s the core idea of this paper?

1. The computational requirements of **self-attention grow quadratically with sequence length,** making it hard to process on current hardware. 
    
1. To address this issue, the researchers present **Longformer, a modified version of Transformer architecture** that:

    1. allows **memory usage to scale linearly**, and not quadratically, with the sequence length;
    1. includes **an attention mechanism that combines**:
        1. a **windowed local-context self-attention** to build contextual representations;
        1. an end **task motivated global attention to encode inductive bias about the task and build full sequence representation.**

1. Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce a custom CUDA kernel for implementing these attention operations.


### What’s the key achievement?
1. The Longformer model achieves a new state of the art on character-level language modeling tasks:

    1. BPC of 1.10 on text8;
    1. BPC of 1.00 on enwik8.

1. **After pre-training and fine-tuning for six tasks**, including classification, question answering, and coreference resolution, **the Longformer-base consistently outperformers the RoBERTa-base with:**

    1. accuracy of 75.0 vs. 72.4 on WikiHop;
    1. F1 score of 75.2 vs. 74.2 on TriviaQA;
    1. joint F1 score of 64.4 vs. 63.5 on HotpotQA;
    1. average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
    1. accuracy of 95.7 vs. 95.3 on the IMDB classification task;
    1. F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.

*The performance gains are especially remarkable for the tasks that require a long context (i.e., WikiHop and Hyperpartisan).*

### What are future research areas?

1. Exploring other attention patterns that are more efficient due to dynamic adaptation to the input. 
1. Applying Longformer to other relevant long document tasks such as summarization.

### What are possible business applications?

The Longformer architecture can be very advantageous for the downstream NLP tasks that often require processing of long documents:

1. document classification;
1. question answering;
1. coreference resolution;
1. summarization;
1. semantic search.

### Resources

Where can you get implementation code?
The code implementation of Longformer is open-sourced on [GitHub](https://github.com/allenai/longformer).


# Resources

1. [GPT-3 & Beyond: 10 NLP Research Papers You Should Read](https://www.topbots.com/nlp-research-papers-2020/)