Skip to content

Commit

Permalink
Fixed STS reproducibility issues. Refactored. Classifier parameter se…
Browse files Browse the repository at this point in the history
…tting option.
  • Loading branch information
Alexis Conneau committed Dec 26, 2017
1 parent e97d861 commit 79b19e2
Show file tree
Hide file tree
Showing 21 changed files with 394 additions and 970 deletions.
415 changes: 23 additions & 392 deletions LICENSE

Large diffs are not rendered by default.

131 changes: 92 additions & 39 deletions README.md
@@ -1,38 +1,59 @@
# SentEval
# SentEval - evaluation tool for sentence embeddings

SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks (more details [here](https://arxiv.org/abs/1705.02364)). Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.

**
SentEval recent fixes (12/26):
* renamed main directory: new way to import SentEval (see below)
* fixed reproducibility issue for STS tasks (preprocessing and lower-casing fixed in data download file)
* added option to set parameters of the classifier (nhid, optim, lr, batch size, ...)
* added example of classifier setting to speed up training (x5) in prototyping phase
**

## Dependencies

This code is written in python. The dependencies are:

* Python 2.7 (with recent versions of [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/))
* [Pytorch](http://pytorch.org/) >= 0.2 (**recent new pytorch version**)
* Python 2.7 (with [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/))
* [Pytorch](http://pytorch.org/) >= 0.2
* [scikit-learn](http://scikit-learn.org/stable/index.html)>=0.18.0

## Tasks
## Transfer tasks

SentEval allows you to evaluate your sentence embeddings as features for the following tasks:
* Binary classification: [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (movie review), [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (product review), [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (subjectivity status), [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (opinion-polarity), [SST](https://nlp.stanford.edu/sentiment/index.html) (Stanford sentiment analysis)
* Multi-class classification: [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) (question-type classification), [SST](http://www.aclweb.org/anthology/P13-1045) (fine-grained Stanford sentiment analysis)
* Entailment (NLI): [SNLI](https://nlp.stanford.edu/projects/snli/) (caption-based NLI), [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) (Multi-genre NLI), [SICK](http://clic.cimec.unitn.it/composes/sick.html) (Sentences Involving Compositional Knowledge, entailment)
* Semantic Textual Similarity: [STS12](https://www.cs.york.ac.uk/semeval-2012/task6/), [STS13](http://ixa2.si.ehu.es/sts/) (-SMT), [STS14](http://alt.qcri.org/semeval2014/task10/), [STS15](http://alt.qcri.org/semeval2015/task2/), [STS16](http://alt.qcri.org/semeval2016/task1/)
* Semantic Relatedness: [STSBenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results), [SICK](http://clic.cimec.unitn.it/composes/sick.html)
* Paraphrase detection: [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) (Microsoft Research Paraphrase Corpus)
* Caption-Image retrieval: [COCO](http://mscoco.org/) dataset (with ResNet-101 2048d image embeddings)

[more details on the tasks](https://arxiv.org/pdf/1705.02364.pdf)
| Task | Type | n train | n test | needs_training | set_classifier |
|---------- |------------------------------ |--------- |-------- |---------------- |
| [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | movie review | 11k | 11k | 1 | yes |
| [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | product review | 4k | 4k | 1 | yes |
| [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | subjectivity status | 10k | 10k | 1 | yes |
| [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | opinion-polarity | 11k | 11k | 1 | yes |
| [SST](https://nlp.stanford.edu/sentiment/index.html) | (binary) sentiment analysis | 67k | 1.8k | yes | 1 |
| [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) | question-type classification | 6k | 500 | yes | 1 |
| [SICK-E](http://clic.cimec.unitn.it/composes/sick.html) | recognizing textual entailment | 4.5k | 4.9k | yes | 1 |
| [SNLI](https://nlp.stanford.edu/projects/snli/) | natural language inference | 550k | 9.8k | yes | 1 |
| [STS 2012](https://www.cs.york.ac.uk/semeval-2012/task6/) | semantic textual similarity | N/A | 3.1k | no | 0 |
| [STS 2013](http://ixa2.si.ehu.es/sts/) | semantic textual similarity | N/A | 1.5k | no | 0 |
| [STS 2014](http://alt.qcri.org/semeval2014/task10/) | semantic textual similarity | N/A | 3.7k | no | 0 |
| [STS 2015](http://alt.qcri.org/semeval2015/task2/) | semantic textual similarity | N/A | 8.5k | no | 0 |
| [STS 2016](http://alt.qcri.org/semeval2016/task1/) | semantic textual similarity | N/A | 9.2k | no | 0 |
| [STS B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | semantic textual similarity | 5.7k | 1.4k | yes | 0 |
| [SICK-R](http://clic.cimec.unitn.it/composes/sick.html) | semantic relatedness | 4.5k | 4.9k | yes | 0 |
| [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) | paraphrase detection | 4.1k | 1 | 1.7k | yes |
| [COCO](http://mscoco.org/) | image-caption retrieval | 567k | 5*1k | yes | 0 |

Note: COCO comes with ResNet-101 2048d image embeddings. [More details on the tasks.](https://arxiv.org/pdf/1705.02364.pdf)

## Download datasets
To get all the transfer tasks datasets, run (in data/):
```bash
./get_transfer_data_ptb.bash
```
This will automatically download and preprocess the datasets, and put them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip).
This will automatically download and preprocess the datasets, and store them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip). Note: we provide PTB or MOSES tokenization.

WARNING: Extracting the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) MSI file requires the "[cabextract](https://www.cabextract.org.uk/#install)" command line (i.e *apt-get/yum install cabextract*).

## Example (average word2vec) : examples/bow.py
## How to use SentEval: examples

### examples/bow.py

Expand All @@ -58,24 +79,24 @@ curl -Lo examples/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infe
curl -Lo examples/infersent.snli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.snli.pickle
```

## How SentEval works
## How to use SentEval

To evaluate your own sentence embedding method, you will need to implement two functions:
To evaluate your sentence embeddings, SentEval requires that you implement two functions:

1. **prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
2. **batcher** (transforms a batch of text sentences into sentence embeddings)


### 1.) prepare(params, samples) (optional)

batcher only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task.
*batcher* only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task.

```
prepare(params, samples)
```
* *batch*: numpy array of text sentences
* *params*: senteval parameters (note that "prepare" outputs are stored in params).
* *output*: None. Any "output" computed in this function is stored in "params" and can be further used by *batcher*.
* *params*: senteval parameters.
* *samples*: list of all sentences from the tranfer task.
* *output*: No output. Arguments stored in "params" can further be used by *batcher*.

*Example*: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.

Expand All @@ -84,25 +105,34 @@ prepare(params, samples)
```
batcher(params, batch)
```
* *params*: senteval parameters.
* *batch*: numpy array of text sentences (of size params.batch_size)
* *params*: senteval parameters (note that "prepare" outputs are stored in params).
* *output*: numpy array of sentence embeddings (of size params.batch_size)

*Example*: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

### 3.) set parameters and define your classifier


### 3.) evaluation on transfer tasks
### 4.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

1) to perform the actual evaluation, first import senteval and define a SentEval object:
1) to perform the actual evaluation, first import senteval and set parameters:
```python
import senteval
se = senteval.SentEval(params, batcher, prepare)
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam',
'tenacity': 5, 'epoch_size': 4,
'max_epoch': 200, 'dropout': 0.}
```

2) Create an instance of the class SE:
```python
se = senteval.engine.SE(params, batcher, prepare)
```
(to import senteval, you can either add senteval path to your pythonpath, use sys.path.insert or "*pip install git+https://github.com/facebookresearch/SentEval*")

2) define the set of transfer tasks on which you want SentEval to perform evaluation and run the evaluation:
3) define the set of transfer tasks and run the evaluation:
```python
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)
Expand All @@ -113,23 +143,46 @@ The current list of available tasks is:
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16']
```
Note that the tasks of image-caption retrieval, SICKRelatedness, STSBenchmark and SNLI require pytorch and the use of a GPU. For the other tasks,
setting *usepytorch* to False will make them run on the CPU (with sklearn), which can be faster for small embeddings but slower for large embeddings.
## SentEval parameters

## SentEval parameters (fast version)
SentEval has several parameters (only task_path is required):
* **task_path** (str): path to data, generated by data/get_transfer_data.py
* **seed** (int): random seed for reproducability (default: 1111)
* **usepytorch** (bool): use pytorch or scikit learn (when possible) for logistic regression (default: True). Note that sklearn is quite fast for small dimensions. Use pytorch for SNLI.
* **classifier** (str): if usepytorch, choose between 'LogReg' and 'MLP' (tanh) (default: 'LogReg')
* **nhid** (int): if usepytorch and classifier=='MLP' choose nb hidden units (default: 0)
* **batch_size** (int): size of minibatch of text sentences provided to "batcher" (sentences are sorted by length). Note that this is not the batch_size used by pytorch logistic regression, which is fixed.
* **kfold** (int): k in the kfold-validation. Set to 10 to be comparable to published results (default: 5)

This comment has been minimized.

Copy link
@theevann

theevann Feb 2, 2018

Here it is said kfold = 10 to be comparable to published results.
On the current README, there is no such information. It is written:

To produce results that are comparable to the literature, use the default config:

... and default config is kfold = 5.

What is the right value for kfold to compare to literature ?

This comment has been minimized.

Copy link
@aconneau

aconneau Feb 2, 2018

Contributor

Good catch, thank you! I didn't notice that this was ambiguous in the README. Thanks for noticing.
The right value for kfold to compare to literature (see InferSent paper) is 10. For prototyping though you could use kfold=5 (for instance if you evaluate at each epoch) so that it's faster.

* ... and any parameter you want to have access to in "batcher" or "prepare" functions.
Global parameters of SentEval:
```bash
# senteval parameters
task_path # path to SentEval datasets
seed # seed
usepytorch # use cuda-pytorch (else scikit-learn) where possible
kfold # k-fold validation for MR/CR/SUB/MPQA.
```

Parameters of the classifier (by default nonlineary is Tanh for MLP):
```bash
nhid: # number of hidden units (0: Logistic Regression)
optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity: # how many times dev acc does not increase before stopping
epoch_size: # each epoch corresponds to epoch_size pass on the train set
max_epoch: # max number of epoches
dropout: # dropout for MLP
```

Note that to get a proxy of the results while **dramatically reducing computation time**,
we suggest the **prototyping config**:
```python
params_senteval['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
'tenacity': 3, 'epoch_size': 2}
```
which will results in a 5 times speedup for classification tasks.

To produce results that are **comparable to the literature**, use the **default config**:
```python
params_senteval['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}
```
which takes longer but will produce better and comparable results.

## References

Please cite [1](https://arxiv.org/abs/1705.02364) [2](https://arxiv.org/abs/1707.06320) if using this code for evaluating sentence embedding methods.
Please considering citing [[1]](https://arxiv.org/abs/1705.02364) [[2]](https://arxiv.org/abs/1707.06320) if using this code for evaluating sentence embedding methods.

### Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Expand All @@ -145,7 +198,7 @@ Please cite [1](https://arxiv.org/abs/1705.02364) [2](https://arxiv.org/abs/1707
```

### Learning Visually Grounded Sentence Representations

[2] D. Kiela, A. Conneau, A. Jabri, M. Nickel, [*Learning Visually Grounded Sentence Representations*](https://arxiv.org/abs/1707.06320)

```
Expand Down

0 comments on commit 79b19e2

Please sign in to comment.