# i. Overview
---
---


We present on the following picture an overview of the **Natural Language Processing** world.

![Images](./imgs/nlp.png)

## i.a Pre-training
Pre-training means running a corpus through the BERT architecture where **masked language modeling** (MLM) and **next sentence prediction** (NSP) are used to derive weights (some model doesn't use NSP any more, after the RoBERTa paper). You can do either by:

- from scratch with your own vocabulary and randomly initialized weights;
- using the pre-trained BERT vocab/weights (so you are in effect “pre-training a pre-trained model.”)

## i.b Fine Tuning
The generic means of **Fine Tuning** is all about train the model, again. In this context we actually have two ways of doing fine-tuning:

- **pre-train** fine tuning
- **downstream task** fine tuning

The first option (`pre-train fine tuning`) consists on fine tuning the model using the model itself and the loss the model has been pre-trained with. 
The second option (`downstream task fine tuning`) consists on adding a "layer" (it can be on any form, also a decoder itself, or something custom) to the BERT architecture in order to test the architecture's performance on some **downstream task** (such as classification, summarization, and more). In this case the loss we fine-tune with has to be task-specific.

## i.c NLP tasks

The following are called in various ways (NLP **fine-tuning** tasks, **Probing** tasks, **downstream** tasks). 
The typical are:

### i.c.1 Neural Machine Translation (NMT)
---
    - {textLang1 - textLang2}
### i.c.2 Question Answering (QA)
---
    - {text - question - index start/end}
    
### i.c.3 Natural Language Inference (NLI) 
---
    - {dataset: SNLI, MultiNLI, SciTail} 
    - {premise - hypothesis - relation} 
    
### i.c.4 Language Modeling (LM)
---

#### Semantic Text Similarities (STS)
    - {text1 - text2 - similarities (cosine/other)}
        
#### Keyphrase Extraction/Generation (KPE/KPG)
    - {text - keyphrases}
        
#### Summarization (summ) 
    - {text - summary}
    
#### Simplification (simpl) 
    - {text - simple text};

#### Text Classification (GLUE)
    - {text1 - class};
    - {text1 - text2 - class};
    
The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)


## Our objective

The topics we are interested in are the following:

- Text Classification (classif);
- Sentence Embedding (sent-emb);
- Semantic Text Similarities (STS);
- Keyphrase Extraction/Generation (KPE/KPG);
- Summarization (summ);
- Simplification (simpl);


# 0 Approaches
---
---


## Supervised or semi-supervised
---
The most successful universal sentence embedding methods are pretrained on the (human-labelled) natural language inference (NLI) datasets Stanford NLI (**SNLI**) and **MultiNLI**. 
NLI is the task of classifying a pair of sentences (denoted the “hypothesis” and the “premise”) into one of three relationships: entailment, contradiction or neutral. 
The effectiveness of NLI for training universal sentence encoders was demonstrated by the supervised
method InferSent.


## Unsupervised
---
**Skip-thought** and **FastSent** are two popular unsupervised techniques that learn sentence embeddings by using an encoding of a sentence to predict words in neighbouring sentences. 
However, in addition to being computationally expensive, this generative objective forces the model to reconstruct the surface form of a sentence, which may produce representations that capture aspects irrelevant to the meaning of a sentence.
QuickThoughts addresses both of these shortcomings with a simple discriminative objective; given a sentence and its context (adjacent sentences), it learns sentence representations by training a classifier to distinguish context sentences
from non-context sentences.
The unifying theme of these unsupervised approaches is that they exploit the distributional hypothesis, namely that the meaning of a word (and by extension, a sentence) is characterized by the word-context in which it appears.


## Sentence-Transformers
---
Our overall approach is most similar to Sentence Transformers – we pretrain a transformer-based language model to produce useful sentence embeddings – but our proposed objective is self-supervised, allowing us to exploit the vast amount of unlabelled text on the web, without being restricted to languages or domains where labelled data exist. Our objective most closely resembles QuickThoughts, with three distinctions: 

- we relax our sampling to textual segments (rather than natural sentences),
- we sample one or more positive segments per anchor (rather than strictly one), 
- we allow these segments to be adjacent, overlapping or subsuming (rather than strictly adjacent; see Figure 1, B).

![Images](./imgs/DeCLUTRfig1.png)


Figure 1: Overview of the self-supervised contrastive objective. 

(A) For each document d in a minibatch of size N, we sample A anchor spans per document and P positive spans per anchor. For simplicity, we illustrate the case where A = P = 1 and denote the anchor-positive span pair as si, sj. Both spans are fed through the same encoder f(·) and pooler g(·) to produce the corresponding embeddings ei = g(f(si)), ej = g(f(sj )). The encoder and pooler are trained to minimize the distance between embeddings via a contrastive prediction task (where the other embeddings in a minibatch are treated as negatives, omitted here for simplicity). 

(B) Positive spans can overlap with, be adjacent to or be subsumed by the sampled anchor span. 

(C) The length of anchors and positives are randomly sampled from beta distributions, skewed toward longer and shorter spans, respectively.

# 1. Datasets
---
---


In this notebook we analyse the dataset we are going to use.

There are lots of dataset for the task of Keyphrase Extraction and Generation, but few of those are large datasets. So the dataset that could be used as training dataset are reported below as well as part of those dataset that are actually selected for testing. Those dataset represent the benchmark for the keyphrase tasks.


## 1.1 Keyphrase
---


SOTA: [keyphrase generation](https://arxiv.org/pdf/1704.06879.pdf).

- **Inspec** [(Hulth, 2003)](https://www.aclweb.org/anthology/W03-1028.pdf), This dataset provides *2,000 paper abstracts*. We adopt the *500 testing* papers and their corresponding uncontrolled keyphrases for evaluation, and the remaining *1,500 papers* are used for *training* the supervised baseline models.

- **Krapivin** [(Krapivin et al., 2008)](http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeu-marchese.pdf): This dataset provides *2,304 papers with full-text* and *author-assigned keyphrases*. However, the author did not mention how to split testing data, so we selected the first *400 papers in alphabetical order as the testing data*, and the *remaining* papers are used to *train* the supervised baselines.

- **NUS** [(Nguyen and Kan, 2007)](https://www.comp.nus.edu.sg/~kanmy/papers/icadl2007.pdf): We use both author-assigned and reader-assigned keyphrases and treat *all 211 papers as the testing data*. Since the NUS dataset did not specifically mention the ways of splitting training and testing data, the results of the supervised baseline models are obtained through a *five-fold cross-validation*.

- **SemEval-2010** [(Kim et al., 2010)](https://www.aclweb.org/anthology/S10-1004.pdf): 288 articles were collected from the ACM Digital Library. 100 articles were used for testing and the rest were used for training supervised baselines.

- **KP20k dataset** [(Meng et al., 2018)](https://arxiv.org/abs/1704.06879): They built a new testing dataset that contains the *titles, abstracts, and keyphrases* of *20,000 scientific articles* in computer science. They were *randomly selected from their obtained 567,830 articles*. Thus they took the 20,000 articles in the validation set to train the supervised baselines.

- **MagKP-CS** (from OpenNMT-py and [OpenNMT-kpg-release](https://github.com/memray/OpenNMT-kpg-release)) that is available for download. 

- **STACKEX** (from [StackExchange](https://archive.org/details/stackexchange)) has been constructed from the computer science forums (CS/AI) at StackExchange using “title” + “body” as source text and “tags” as the target keyphrases. After removing questions without valid tags, they collected 330,965 questions. They randomly selected *16,000 for validation*, and another *16,000 as test set*. Note some questions in StackExchange forums contain large blocks of code, resulting in long texts (sometimes more than 10,000 tokens after tokenization), this is difficult for most neural models to handle. Consequently, the texts have been truncated to 300 tokens and 1,000 tokens for training and evaluation splits respectively.
 
 
| **Dataset** | **#Train** | **#Valid** | **#Test** | **#KP** | **#PreDoc** | **#PreKP** | **#AbsDoc** | **#AbsKP** |
| :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: 
| **KP20k** | 514k | 19,992 | 19,987 | 105,181 | 19,059 | 66,746 | 16,325 | 38,435|
| **MagKP** | 2.7m | -- | -- | 34.8m | -- | -- | -- | -- |
| **Inspec** | -- | 1,500 | 500| 4,913 | 497 | 3,921 | 363 | 992 |
| **Krapivin** | -- | 1,844 | 460 | 2,641 | 438 | 1,492 | 416 | 1,149 |
| **NUS** | -- | - | 211 | 2,461 | 207 | 1,260 | 195 | 1,201 |
| **Semeval** | -- | 144 | 100 | 1,507 | 100 | 673 | 99 | 834|
| **StackEx** | 298k | 16,000 | 16,000 | 43,131 | 13,498 | 24,864 | 10,967 | 18,267 |
| **DUC** | -- | -- | 308 | 2,484 | 308 | 2,421 | 38 | 63 |




## 1.2 Sentence embedding
---


SOTA: [sBERT](https://arxiv.org/abs/1908.10084)

- **SNLI** [(Bowman et al., 2015)](https://arxiv.org/abs/1508.05326) is a collection of *570,000 sentence pairs* annotated with the *labels contradiction, eintailment, and neutral*.

- **MultiNLI** [(Williams et al., 2018)](https://arxiv.org/abs/1704.05426) contains *430,000 sentence pairs* and covers a *range of genres of spoken and written text*.

- **SciTail** [(allenai)](http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf), the entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist “in the wild”. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

## 1.3 Generic NLP tasks
---

- **S2ORC** [(Lo et al., 2020)](https://github.com/allenai/s2orc) is a large corpus of *81.1M English-language academic papers* spanning many academic disciplines. The corpus consists of *rich metadata, paper abstracts, resolved bibliographic references*, as well as *structured full text for 8.1M open access papers*. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, they aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. Built for text mining over academic text.

- **OAG** [(Tang et al., 2008)](http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al-ArnetMiner.pdf)  is a large knowledge graph unifying *two billion-scale academic graphs*: Microsoft Academic Graph (**MAG**) and **AMiner**. In mid 2017, they published OAG v1, which contains *166,192,182 papers from MAG and 154,771,162 papers from AMiner* and generated *64,639,608 linking (matching) relations between the two graphs*. This time, in OAG v2, author, venue and newer publication data and the corresponding matchings are available.

# 2. Models
---
---


## 2.1 Keyphrase 
---


### 2.1.1 Non-Neural (Parametric)
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.

#### TfIdf

#### TextRank

#### KEA

#### Maui

#### SIF

#### uSIF

#### GEM



### 2.1.2 Neural (Non-Parametric)
Non-parametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features.

#### CatSeq

#### CatSeqD

#### CopyRNN



### 2.1.3 Evaluation

#### Greedy search

#### Beam search


### 2.1.4 Models

#### One2One

#### One2Seq







## 2.2 STS
---

### 2.2.1 BERT


### 2.2.2 sBERT

### 2.2.3 SciBERT 

### 2.2.4 USE

### 2.2.5 BERT-wk

### 2.2.6 XL-Net

### 2.2.7 inferSent

### 2.2.8 skip-thought

### 2.2.9 ? P2V ? rBERT

### 2.2.10 Doc2Vec

### 2.2.11 GloVe

### 2.2.12 ELMo

### 2.2.13 LongFormer

### 2.2.14 SPECTER

# 3. Evaluation
---
---

The evaluation tasks are different regarding on what is the main goal. For Sentence-BERT models the evaluation are done on the common **Semantic Textual Similarity** and on the **SentEval** tranfer tasks. There are also some ablation studies 
Actually, an **ablation study** typically refers to removing some “feature” of the model or algorithm and seeing how that affects performance.


## 3.1 STS
---
Semantic textual similarity deals with determining how similar two pieces of texts are. This can take the form of assigning a score from 1 to 5. Related tasks are paraphrase or duplicate identification. The evaluation criterion is Pearson correlation. There are many place [(nlp-progress)](http://nlpprogress.com/english/semantic_textual_similarity.html) that track the progress in Natural Language Processing (NLP).


### 3.1.1 Unsupervised STS
Evaluate the performance of a model for STS without using any STS specific training data.
Datasets are:

- **STS tasks** 2012 - 2016 [(Agirre et al., 2012, 2013, 2014, 2015, 2016)](https://www.aclweb.org/anthology/S12-1051.pdf), 
- **STS benchmark** [(Cer et al., 2017)](https://www.aclweb.org/anthology/S17-2001.pdf),
- **SICK-Relatedness** dataset [(Marelli et al., 2014)](http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf). 


### 3.1.2 Supervised STS
The STS benchmark (STSb) [(Cer et al., 2017)](https://www.aclweb.org/anthology/S17-2001.pdf) provides is a popular dataset to evaluate supervised STS systems. The data includes *8,628 sentence pairs* from the *three categories captions, news, and forums*. It is divided into *train (5,749), dev (1,500) and test (1,379)*.


### 3.1.3 Argument Facet Similarity
Argument Facet Similarity (AFS) corpus by [Misra et al. (2016)](https://www.aclweb.org/anthology/W16-3636.pdf). The AFS corpus annotated *6,000 sentential argument pairs* from social media dialogs on *three controversial topics: gun control, gay marriage, and death penalty*. The data was annotated on a scale from 0 (“different topic”) to 5 (“completely equivalent”). The similarity notion in the *AFS corpus is fairly different* to the similarity notion in the *STS datasets from SemEval*. 
**STS** data is usually *descriptive*, while **AFS** data are *argumentative excerpts from dialogs*. To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning. Further, *the lexical gap between the sentences in AFS is much larger*.


### 3.1.4 Wikipedia Sections Distinction
[Dor et al. (2018)]() use Wikipedia to create a thematically fine-grained train, dev and test set for sentence embeddings methods. Wikipedia articles are separated into distinct sections focusing on certain aspects. [Dor et al. (2018)]() assume that sentences in the same section are thematically closer than sentences in different sections. They use this to create a large dataset of weakly labeled sentence triplets: 

- the **anchor** and the **positive** example come from the same section;
- the **negative** example comes from a different section of the same article.

As evaluation metric, we can use accuracy: Is the positive example closer to the anchor than the negative example?


## 3.2 SentEval
---
SentEval [(Conneau and Kiela, 2018)](https://arxiv.org/abs/1803.05449) is a popular toolkit to evaluate the quality of sentence embeddings. Sentence embeddings are used as features for a logistic regression classifier. The logistic regression classifier is trained on various tasks in a 10-fold cross-validation setup and the prediction accuracy is computed for the test-fold.

Usually the sentence embeddings are evaluated to other sentence embeddings methods on the following seven *SentEval transfer tasks*:

- **MR**: Sentiment prediction for movie reviews snippets on a five start scale [(Pang and Lee, 2005)](http://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf).
- **CR**: Sentiment prediction of customer product reviews [(Hu and Liu, 2004)](https://www.cs.uic.edu/~liub/publications/kdd04-revSummary.pdf).
- **SUBJ**: Subjectivity prediction of sentences from movie reviews and plot summaries [(Pang and Lee, 2004)](https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf).
- **MPQA**: Phrase level opinion polarity classification from newswire [(Wiebe et al., 2005)](https://www.aclweb.org/anthology/S15-2045.pdf).
- **SST**: Stanford Sentiment Treebank with binary labels [(Socher et al., 2013)](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf).
- **TREC**: Fine grained question-type classification from TREC [(Li and Roth, 2002)](https://www.aclweb.org/anthology/C02-1150.pdf).
- **MRPC**: Microsoft Research Paraphrase Corpus from parallel news sources [(Dolan et al., 2004)](https://www.aclweb.org/anthology/C04-1051.pdf).


## 3.3 Ablation study
---
Different **pooling strategies** are used:

- **MEAN**;
- **MAX**;
- **CLS**. 

For the classification objective function, can be evaluated different **concatenation methods**. 

For each possible configuration, we can train the model with **10 different random seeds** and average the **performances**.

The **objective function** (classification vs. regression) depends on the annotated dataset:

- **classification objective** function, we can train the model on the **SNLI** and the **Multi-NLI** dataset;
- **regression objective** function, we can train on the **training set of the STS benchmark dataset**. 

Performances are measured on the **development split** of the **STS benchmark** dataset.

# 4. Metrics
---
---


## 4.1 F1 scores

### 4.1.1 F1@k

### 4.1.2 F1@O

### 4.1.3 F1@M