Fixed STS reproducibility issues. Refactored. Classifier parameter se…

…tting option.
facebookresearch · Dec 26, 2017 · 79b19e2 · theevann · Feb 2, 2018 · aconneau
1 parent e97d861
commit 79b19e2
Show file tree

Hide file tree

Showing 21 changed files with 394 additions and 970 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,38 +1,59 @@
-# SentEval
+# SentEval - evaluation tool for sentence embeddings
 
 SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks (more details [here](https://arxiv.org/abs/1705.02364)). Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.
 
+**
+SentEval recent fixes (12/26):
+* renamed main directory: new way to import SentEval (see below)
+* fixed reproducibility issue for STS tasks (preprocessing and lower-casing fixed in data download file)
+* added option to set parameters of the classifier (nhid, optim, lr, batch size, ...)
+* added example of classifier setting to speed up training (x5) in prototyping phase
+**
+
 ## Dependencies
 
 This code is written in python. The dependencies are:
 
-* Python 2.7 (with recent versions of [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/))
-* [Pytorch](http://pytorch.org/) >= 0.2 (**recent new pytorch version**)
+* Python 2.7 (with [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/))
+* [Pytorch](http://pytorch.org/) >= 0.2
 * [scikit-learn](http://scikit-learn.org/stable/index.html)>=0.18.0
 
-## Tasks
+## Transfer tasks
 
 SentEval allows you to evaluate your sentence embeddings as features for the following tasks:
-* Binary classification: [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (movie review), [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (product review), [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (subjectivity status), [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) (opinion-polarity), [SST](https://nlp.stanford.edu/sentiment/index.html) (Stanford sentiment analysis)
-* Multi-class classification: [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) (question-type classification), [SST](http://www.aclweb.org/anthology/P13-1045) (fine-grained Stanford sentiment analysis)
-* Entailment (NLI): [SNLI](https://nlp.stanford.edu/projects/snli/) (caption-based NLI), [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) (Multi-genre NLI), [SICK](http://clic.cimec.unitn.it/composes/sick.html) (Sentences Involving Compositional Knowledge, entailment)
-* Semantic Textual Similarity: [STS12](https://www.cs.york.ac.uk/semeval-2012/task6/), [STS13](http://ixa2.si.ehu.es/sts/) (-SMT), [STS14](http://alt.qcri.org/semeval2014/task10/), [STS15](http://alt.qcri.org/semeval2015/task2/), [STS16](http://alt.qcri.org/semeval2016/task1/)
-* Semantic Relatedness: [STSBenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results), [SICK](http://clic.cimec.unitn.it/composes/sick.html)
-* Paraphrase detection: [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) (Microsoft Research Paraphrase Corpus)
-* Caption-Image retrieval: [COCO](http://mscoco.org/) dataset (with ResNet-101 2048d image embeddings)
 
-[more details on the tasks](https://arxiv.org/pdf/1705.02364.pdf)
+| Task     	| Type                         	| n train 	| n test 	| needs_training 	| set_classifier |
+|----------	|------------------------------	|---------	|--------	|----------------	|
+| [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm)       	| movie review                 	| 11k     	| 11k    	| 1 | yes            	|
+| [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm)       	| product review               	| 4k      	| 4k     	| 1 | yes            	|
+| [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm)     	| subjectivity status          	| 10k     	| 10k    	| 1 | yes            	|
+| [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm)     	| opinion-polarity             	| 11k     	| 11k    	| 1 | yes            	|
+| [SST](https://nlp.stanford.edu/sentiment/index.html)      	| (binary) sentiment analysis  	| 67k     	| 1.8k   	| yes            	| 1 |
+| [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/)     	| question-type classification 	| 6k      	| 500    	| yes            	| 1 |
+| [SICK-E](http://clic.cimec.unitn.it/composes/sick.html)   	| recognizing textual entailment 	| 4.5k    	| 4.9k   	| yes            	| 1 |
+| [SNLI](https://nlp.stanford.edu/projects/snli/)     	| natural language inference   	| 550k    	| 9.8k   	| yes            	| 1 |
+| [STS 2012](https://www.cs.york.ac.uk/semeval-2012/task6/) 	| semantic textual similarity  	| N/A     	| 3.1k   	| no             	| 0 |
+| [STS 2013](http://ixa2.si.ehu.es/sts/) 	| semantic textual similarity  	| N/A     	| 1.5k   	| no             	| 0 |
+| [STS 2014](http://alt.qcri.org/semeval2014/task10/) 	| semantic textual similarity  	| N/A     	| 3.7k   	| no             	| 0 |
+| [STS 2015](http://alt.qcri.org/semeval2015/task2/) 	| semantic textual similarity  	| N/A     	| 8.5k   	| no             	| 0 |
+| [STS 2016](http://alt.qcri.org/semeval2016/task1/) 	| semantic textual similarity  	| N/A     	| 9.2k   	| no             	| 0 |
+| [STS B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results)    	| semantic textual similarity  	| 5.7k    	| 1.4k   	| yes            	| 0 |
+| [SICK-R](http://clic.cimec.unitn.it/composes/sick.html)   	| semantic relatedness           	| 4.5k    	| 4.9k   	| yes            	| 0 |
+| [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art))     	| paraphrase detection         	| 4.1k    	| 1 | 1.7k   	| yes            	|
+| [COCO](http://mscoco.org/)     	| image-caption retrieval      	| 567k    	| 5*1k   	| yes            	| 0 |
+
+Note: COCO comes with ResNet-101 2048d image embeddings. [More details on the tasks.](https://arxiv.org/pdf/1705.02364.pdf)
 
 ## Download datasets
 To get all the transfer tasks datasets, run (in data/):
 ```bash
 ./get_transfer_data_ptb.bash
 ```
-This will automatically download and preprocess the datasets, and put them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip).
+This will automatically download and preprocess the datasets, and store them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip). Note: we provide PTB or MOSES tokenization.
 
 WARNING: Extracting the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) MSI file requires the "[cabextract](https://www.cabextract.org.uk/#install)" command line (i.e *apt-get/yum install cabextract*).
 
-## Example (average word2vec) : examples/bow.py
+## How to use SentEval: examples
 
 ### examples/bow.py
 
@@ -58,24 +79,24 @@ curl -Lo examples/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infe
 curl -Lo examples/infersent.snli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.snli.pickle
 ```
 
-## How SentEval works
+## How to use SentEval
 
-To evaluate your own sentence embedding method, you will need to implement two functions: 
+To evaluate your sentence embeddings, SentEval requires that you implement two functions:
 
 1. **prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
 2. **batcher** (transforms a batch of text sentences into sentence embeddings)
 
 
 ### 1.) prepare(params, samples) (optional)
 
-batcher only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task.
+*batcher* only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task.
 
 ```
 prepare(params, samples)
 ```
-* *batch*: numpy array of text sentences
-* *params*: senteval parameters (note that "prepare" outputs are stored in params).
-* *output*: None. Any "output" computed in this function is stored in "params" and can be further used by *batcher*.
+* *params*: senteval parameters.
+* *samples*: list of all sentences from the tranfer task.
+* *output*: No output. Arguments stored in "params" can further be used by *batcher*.
 
 *Example*: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.
 
@@ -84,25 +105,34 @@ prepare(params, samples)
 ```
 batcher(params, batch)
 ```
+* *params*: senteval parameters.
 * *batch*: numpy array of text sentences (of size params.batch_size)
-* *params*: senteval parameters (note that "prepare" outputs are stored in params).
 * *output*: numpy array of sentence embeddings (of size params.batch_size)
 
 *Example*: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.
 
+### 3.) set parameters and define your classifier
+
 
-### 3.) evaluation on transfer tasks
+### 4.) evaluation on transfer tasks
 
 After having implemented the batch and prepare function for your own sentence encoder,
 
-1) to perform the actual evaluation, first import senteval and define a SentEval object:
+1) to perform the actual evaluation, first import senteval and set parameters:
 ```python
 import senteval
-se = senteval.SentEval(params, batcher, prepare)
+params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
+params['classifier'] = {'nhid': 0, 'optim': 'adam',
+                                 'tenacity': 5, 'epoch_size': 4,
+                                 'max_epoch': 200, 'dropout': 0.}
+```
+
+2) Create an instance of the class SE:
+```python
+se = senteval.engine.SE(params, batcher, prepare)
 ```
-(to import senteval, you can either add senteval path to your pythonpath, use sys.path.insert or "*pip install git+https://github.com/facebookresearch/SentEval*")
 
-2) define the set of transfer tasks on which you want SentEval to perform evaluation and run the evaluation: 
+3) define the set of transfer tasks and run the evaluation:
 ```python
 transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
 results = se.eval(transfer_tasks)
@@ -113,23 +143,46 @@ The current list of available tasks is:
 'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
 'STS12', 'STS13', 'STS14', 'STS15', 'STS16']
 ```
-Note that the tasks of image-caption retrieval, SICKRelatedness, STSBenchmark and SNLI require pytorch and the use of a GPU. For the other tasks, 
-setting *usepytorch* to False will make them run on the CPU (with sklearn), which can be faster for small embeddings but slower for large embeddings.
-## SentEval parameters
+
+## SentEval parameters (fast version)
 SentEval has several parameters (only task_path is required):
-* **task_path** (str): path to data, generated by data/get_transfer_data.py
-* **seed** (int): random seed for reproducability (default: 1111)
-* **usepytorch** (bool): use pytorch or scikit learn (when possible) for logistic regression (default: True). Note that sklearn is quite fast for small dimensions. Use pytorch for SNLI.
-* **classifier** (str): if usepytorch, choose between 'LogReg' and 'MLP' (tanh) (default: 'LogReg')
-* **nhid** (int): if usepytorch and classifier=='MLP' choose nb hidden units (default: 0)
-* **batch_size** (int): size of minibatch of text sentences provided to "batcher" (sentences are sorted by length). Note that this is not the batch_size used by pytorch logistic regression, which is fixed.
-* **kfold** (int): k in the kfold-validation. Set to 10 to be comparable to published results (default: 5)
-* ... and any parameter you want to have access to in "batcher" or "prepare" functions.
+Global parameters of SentEval:
+```bash
+# senteval parameters
+task_path                   # path to SentEval datasets
+seed                        # seed
+usepytorch                  # use cuda-pytorch (else scikit-learn) where possible
+kfold                       # k-fold validation for MR/CR/SUB/MPQA.
+```
 
+Parameters of the classifier (by default nonlineary is Tanh for MLP):
+```bash
+nhid:                       # number of hidden units (0: Logistic Regression)
+optim:                      # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
+tenacity:                   # how many times dev acc does not increase before stopping
+epoch_size:                 # each epoch corresponds to epoch_size pass on the train set
+max_epoch:                  # max number of epoches
+dropout:                    # dropout for MLP
+```
+
+Note that to get a proxy of the results while **dramatically reducing computation time**,
+we suggest the **prototyping config**:
+```python
+params_senteval['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
+                                 'tenacity': 3, 'epoch_size': 2}
+```
+which will results in a 5 times speedup for classification tasks.
+
+To produce results that are **comparable to the literature**, use the **default config**:
+```python
+params_senteval['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
+                                 'tenacity': 5, 'epoch_size': 4}
+```
+which takes longer but will produce better and comparable results.
 
 ## References
 
-Please cite [1](https://arxiv.org/abs/1705.02364) [2](https://arxiv.org/abs/1707.06320) if using this code for evaluating sentence embedding methods.
+Please considering citing [[1]](https://arxiv.org/abs/1705.02364) [[2]](https://arxiv.org/abs/1707.06320) if using this code for evaluating sentence embedding methods.
 
 ### Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
 
@@ -145,7 +198,7 @@ Please cite [1](https://arxiv.org/abs/1705.02364) [2](https://arxiv.org/abs/1707
 ```
 
 ### Learning Visually Grounded Sentence Representations
- 
+
 [2] D. Kiela, A. Conneau, A. Jabri, M. Nickel, [*Learning Visually Grounded Sentence Representations*](https://arxiv.org/abs/1707.06320)
 
 ```