Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixed STS reproducibility issues. Refactored. Classifier parameter se…
…tting option.
- Loading branch information
Alexis Conneau
committed
Dec 26, 2017
1 parent
e97d861
commit 79b19e2
Showing
21 changed files
with
394 additions
and
970 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1,59 @@ | ||
# SentEval | ||
# SentEval - evaluation tool for sentence embeddings | ||
|
||
SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks (more details [here](https://arxiv.org/abs/1705.02364)). Our goal is to ease the study and the development of general-purpose fixed-size sentence representations. | ||
|
||
** | ||
SentEval recent fixes (12/26): | ||
* renamed main directory: new way to import SentEval (see below) | ||
* fixed reproducibility issue for STS tasks (preprocessing and lower-casing fixed in data download file) | ||
* added option to set parameters of the classifier (nhid, optim, lr, batch size, ...) | ||
* added example of classifier setting to speed up training (x5) in prototyping phase | ||
** | ||
|
||
## Dependencies | ||
|
||
This code is written in python. The dependencies are: | ||
|
||
* Python 2.7 (with recent versions of [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/)) | ||
* [Pytorch](http://pytorch.org/) >= 0.2 (**recent new pytorch version**) | ||
* Python 2.7 (with [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/)) | ||
* [Pytorch](http://pytorch.org/) >= 0.2 | ||
* [scikit-learn](http://scikit-learn.org/stable/index.html)>=0.18.0 | ||
|
||
## Tasks | ||
## Transfer tasks | ||
|
||
SentEval allows you to evaluate your sentence embeddings as features for the following tasks: | ||
* Binary classification: [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (movie review), [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (product review), [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (subjectivity status), [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (opinion-polarity), [SST](https://nlp.stanford.edu/sentiment/index.html) (Stanford sentiment analysis) | ||
* Multi-class classification: [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) (question-type classification), [SST](http://www.aclweb.org/anthology/P13-1045) (fine-grained Stanford sentiment analysis) | ||
* Entailment (NLI): [SNLI](https://nlp.stanford.edu/projects/snli/) (caption-based NLI), [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) (Multi-genre NLI), [SICK](http://clic.cimec.unitn.it/composes/sick.html) (Sentences Involving Compositional Knowledge, entailment) | ||
* Semantic Textual Similarity: [STS12](https://www.cs.york.ac.uk/semeval-2012/task6/), [STS13](http://ixa2.si.ehu.es/sts/) (-SMT), [STS14](http://alt.qcri.org/semeval2014/task10/), [STS15](http://alt.qcri.org/semeval2015/task2/), [STS16](http://alt.qcri.org/semeval2016/task1/) | ||
* Semantic Relatedness: [STSBenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results), [SICK](http://clic.cimec.unitn.it/composes/sick.html) | ||
* Paraphrase detection: [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) (Microsoft Research Paraphrase Corpus) | ||
* Caption-Image retrieval: [COCO](http://mscoco.org/) dataset (with ResNet-101 2048d image embeddings) | ||
|
||
[more details on the tasks](https://arxiv.org/pdf/1705.02364.pdf) | ||
| Task | Type | n train | n test | needs_training | set_classifier | | ||
|---------- |------------------------------ |--------- |-------- |---------------- | | ||
| [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | movie review | 11k | 11k | 1 | yes | | ||
| [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | product review | 4k | 4k | 1 | yes | | ||
| [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | subjectivity status | 10k | 10k | 1 | yes | | ||
| [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | opinion-polarity | 11k | 11k | 1 | yes | | ||
| [SST](https://nlp.stanford.edu/sentiment/index.html) | (binary) sentiment analysis | 67k | 1.8k | yes | 1 | | ||
| [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) | question-type classification | 6k | 500 | yes | 1 | | ||
| [SICK-E](http://clic.cimec.unitn.it/composes/sick.html) | recognizing textual entailment | 4.5k | 4.9k | yes | 1 | | ||
| [SNLI](https://nlp.stanford.edu/projects/snli/) | natural language inference | 550k | 9.8k | yes | 1 | | ||
| [STS 2012](https://www.cs.york.ac.uk/semeval-2012/task6/) | semantic textual similarity | N/A | 3.1k | no | 0 | | ||
| [STS 2013](http://ixa2.si.ehu.es/sts/) | semantic textual similarity | N/A | 1.5k | no | 0 | | ||
| [STS 2014](http://alt.qcri.org/semeval2014/task10/) | semantic textual similarity | N/A | 3.7k | no | 0 | | ||
| [STS 2015](http://alt.qcri.org/semeval2015/task2/) | semantic textual similarity | N/A | 8.5k | no | 0 | | ||
| [STS 2016](http://alt.qcri.org/semeval2016/task1/) | semantic textual similarity | N/A | 9.2k | no | 0 | | ||
| [STS B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | semantic textual similarity | 5.7k | 1.4k | yes | 0 | | ||
| [SICK-R](http://clic.cimec.unitn.it/composes/sick.html) | semantic relatedness | 4.5k | 4.9k | yes | 0 | | ||
| [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) | paraphrase detection | 4.1k | 1 | 1.7k | yes | | ||
| [COCO](http://mscoco.org/) | image-caption retrieval | 567k | 5*1k | yes | 0 | | ||
|
||
Note: COCO comes with ResNet-101 2048d image embeddings. [More details on the tasks.](https://arxiv.org/pdf/1705.02364.pdf) | ||
|
||
## Download datasets | ||
To get all the transfer tasks datasets, run (in data/): | ||
```bash | ||
./get_transfer_data_ptb.bash | ||
``` | ||
This will automatically download and preprocess the datasets, and put them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip). | ||
This will automatically download and preprocess the datasets, and store them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip). Note: we provide PTB or MOSES tokenization. | ||
|
||
WARNING: Extracting the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) MSI file requires the "[cabextract](https://www.cabextract.org.uk/#install)" command line (i.e *apt-get/yum install cabextract*). | ||
|
||
## Example (average word2vec) : examples/bow.py | ||
## How to use SentEval: examples | ||
|
||
### examples/bow.py | ||
|
||
|
@@ -58,24 +79,24 @@ curl -Lo examples/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infe | |
curl -Lo examples/infersent.snli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.snli.pickle | ||
``` | ||
|
||
## How SentEval works | ||
## How to use SentEval | ||
|
||
To evaluate your own sentence embedding method, you will need to implement two functions: | ||
To evaluate your sentence embeddings, SentEval requires that you implement two functions: | ||
|
||
1. **prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc) | ||
2. **batcher** (transforms a batch of text sentences into sentence embeddings) | ||
|
||
|
||
### 1.) prepare(params, samples) (optional) | ||
|
||
batcher only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task. | ||
*batcher* only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task. | ||
|
||
``` | ||
prepare(params, samples) | ||
``` | ||
* *batch*: numpy array of text sentences | ||
* *params*: senteval parameters (note that "prepare" outputs are stored in params). | ||
* *output*: None. Any "output" computed in this function is stored in "params" and can be further used by *batcher*. | ||
* *params*: senteval parameters. | ||
* *samples*: list of all sentences from the tranfer task. | ||
* *output*: No output. Arguments stored in "params" can further be used by *batcher*. | ||
|
||
*Example*: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors. | ||
|
||
|
@@ -84,25 +105,34 @@ prepare(params, samples) | |
``` | ||
batcher(params, batch) | ||
``` | ||
* *params*: senteval parameters. | ||
* *batch*: numpy array of text sentences (of size params.batch_size) | ||
* *params*: senteval parameters (note that "prepare" outputs are stored in params). | ||
* *output*: numpy array of sentence embeddings (of size params.batch_size) | ||
|
||
*Example*: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences. | ||
|
||
### 3.) set parameters and define your classifier | ||
|
||
|
||
### 3.) evaluation on transfer tasks | ||
### 4.) evaluation on transfer tasks | ||
|
||
After having implemented the batch and prepare function for your own sentence encoder, | ||
|
||
1) to perform the actual evaluation, first import senteval and define a SentEval object: | ||
1) to perform the actual evaluation, first import senteval and set parameters: | ||
```python | ||
import senteval | ||
se = senteval.SentEval(params, batcher, prepare) | ||
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10} | ||
params['classifier'] = {'nhid': 0, 'optim': 'adam', | ||
'tenacity': 5, 'epoch_size': 4, | ||
'max_epoch': 200, 'dropout': 0.} | ||
``` | ||
|
||
2) Create an instance of the class SE: | ||
```python | ||
se = senteval.engine.SE(params, batcher, prepare) | ||
``` | ||
(to import senteval, you can either add senteval path to your pythonpath, use sys.path.insert or "*pip install git+https://github.com/facebookresearch/SentEval*") | ||
|
||
2) define the set of transfer tasks on which you want SentEval to perform evaluation and run the evaluation: | ||
3) define the set of transfer tasks and run the evaluation: | ||
```python | ||
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark'] | ||
results = se.eval(transfer_tasks) | ||
|
@@ -113,23 +143,46 @@ The current list of available tasks is: | |
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval', | ||
'STS12', 'STS13', 'STS14', 'STS15', 'STS16'] | ||
``` | ||
Note that the tasks of image-caption retrieval, SICKRelatedness, STSBenchmark and SNLI require pytorch and the use of a GPU. For the other tasks, | ||
setting *usepytorch* to False will make them run on the CPU (with sklearn), which can be faster for small embeddings but slower for large embeddings. | ||
## SentEval parameters | ||
|
||
## SentEval parameters (fast version) | ||
SentEval has several parameters (only task_path is required): | ||
* **task_path** (str): path to data, generated by data/get_transfer_data.py | ||
* **seed** (int): random seed for reproducability (default: 1111) | ||
* **usepytorch** (bool): use pytorch or scikit learn (when possible) for logistic regression (default: True). Note that sklearn is quite fast for small dimensions. Use pytorch for SNLI. | ||
* **classifier** (str): if usepytorch, choose between 'LogReg' and 'MLP' (tanh) (default: 'LogReg') | ||
* **nhid** (int): if usepytorch and classifier=='MLP' choose nb hidden units (default: 0) | ||
* **batch_size** (int): size of minibatch of text sentences provided to "batcher" (sentences are sorted by length). Note that this is not the batch_size used by pytorch logistic regression, which is fixed. | ||
* **kfold** (int): k in the kfold-validation. Set to 10 to be comparable to published results (default: 5) | ||
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
Sorry, something went wrong.
aconneau
Contributor
|
||
* ... and any parameter you want to have access to in "batcher" or "prepare" functions. | ||
Global parameters of SentEval: | ||
```bash | ||
# senteval parameters | ||
task_path # path to SentEval datasets | ||
seed # seed | ||
usepytorch # use cuda-pytorch (else scikit-learn) where possible | ||
kfold # k-fold validation for MR/CR/SUB/MPQA. | ||
``` | ||
|
||
Parameters of the classifier (by default nonlineary is Tanh for MLP): | ||
```bash | ||
nhid: # number of hidden units (0: Logistic Regression) | ||
optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..) | ||
tenacity: # how many times dev acc does not increase before stopping | ||
epoch_size: # each epoch corresponds to epoch_size pass on the train set | ||
max_epoch: # max number of epoches | ||
dropout: # dropout for MLP | ||
``` | ||
|
||
Note that to get a proxy of the results while **dramatically reducing computation time**, | ||
we suggest the **prototyping config**: | ||
```python | ||
params_senteval['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128, | ||
'tenacity': 3, 'epoch_size': 2} | ||
``` | ||
which will results in a 5 times speedup for classification tasks. | ||
|
||
To produce results that are **comparable to the literature**, use the **default config**: | ||
```python | ||
params_senteval['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64, | ||
'tenacity': 5, 'epoch_size': 4} | ||
``` | ||
which takes longer but will produce better and comparable results. | ||
|
||
## References | ||
|
||
Please cite [1](https://arxiv.org/abs/1705.02364) [2](https://arxiv.org/abs/1707.06320) if using this code for evaluating sentence embedding methods. | ||
Please considering citing [[1]](https://arxiv.org/abs/1705.02364) [[2]](https://arxiv.org/abs/1707.06320) if using this code for evaluating sentence embedding methods. | ||
|
||
### Supervised Learning of Universal Sentence Representations from Natural Language Inference Data | ||
|
||
|
@@ -145,7 +198,7 @@ Please cite [1](https://arxiv.org/abs/1705.02364) [2](https://arxiv.org/abs/1707 | |
``` | ||
|
||
### Learning Visually Grounded Sentence Representations | ||
|
||
[2] D. Kiela, A. Conneau, A. Jabri, M. Nickel, [*Learning Visually Grounded Sentence Representations*](https://arxiv.org/abs/1707.06320) | ||
|
||
``` | ||
|
Oops, something went wrong.
Here it is said
kfold = 10
to be comparable to published results.On the current README, there is no such information. It is written:
... and default config is kfold = 5.
What is the right value for kfold to compare to literature ?