castorini · lintool · May 23, 2018 · May 20, 2018 · May 20, 2018 · May 20, 2018
diff --git a/README.md b/README.md
@@ -1,40 +1,51 @@
 # Castor
 
-PyTorch deep learning models.
+Deep learning for information retrieval with PyTorch.
 
-1. [SM model](./sm_cnn/): Similarity between question and candidate answers.
+## Models
 
+### Baselines
 
-## Setting up PyTorch
-
-You need Python 3.6 to use the models in this repository.
-
-As per [pytorch.org](pytorch.org),
-> "[Anaconda](https://www.continuum.io/downloads) is our recommended package manager"
+1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers
 
-```conda install pytorch torchvision -c soumith```
+### Deep Learning Models
 
-Other pytorch installation modalities (e.g. via ```pip```) can be seen at [pytorch.org](pytorch.org).
+1. [SM-CNN](./sm_cnn/): Ranking short text pairs with Convolutional Neural Networks
+2. [Kim CNN](./kim_cnn/): Sentence classification using Convolutional Neural Networks
+3. [MP-CNN](./mp_cnn/): Sentence pair modelling with Multi-Perspective Convolutional Neural Networks
+4. [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN
+5. [conv-RNN](./conv_rnn): Convolutional RNN for text modelling
 
-We also recommend [gensim](https://radimrehurek.com/gensim/). We use some gensim modules to cache word embeddings.
-
-```conda install gensim```
+## Setting up PyTorch
 
+Copy and run the command at https://pytorch.org/ for your environment. PyTorch recommends the Anaconda environment, which we use in our lab.
 
-PyTorch has good support for GPU computations.
-CUDA installation guide for linux can be found [here](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
+The typical installation command is
 
-**NOTE**: Install CUDA libraries **before** installing conda and pytorch.
+```bash
+conda install pytorch torchvision -c pytorch
+```
 
+## Data and Pre-Trained Models
 
-## data for models
+Data associated for use with this repository can be found at: https://git.uwaterloo.ca/jimmylin/Castor-data.git.
 
-Sourcing and pre-processing of input data for each model is described in respective ```model/README.md```'s
+Pre-trained models can be found at: https://github.com/castorini/models.git.
 
-## Baselines
+Your directory structure should look like
+```
+.
+├── Castor
+├── Castor-data
+└── models
+```
 
-1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers.
+For example (if you use HTTPS instead of SSH):
 
-## Tutorials
+```bash
+git clone https://github.com/castorini/Castor.git
+git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git
+git clone https://github.com/castorini/models.git
+```
 
-SM Model tutorial: [sm_cnn/tutorial.ipynb](sm_cnn/tutorial.ipynb) - notebook that walks through SM CNN model, good for beginnners.
+Sourcing and pre-processing of input data for each model is described in the respective ```model/README.md```'s.
diff --git a/anserini_dependency/README.md b/anserini_dependency/README.md
@@ -1,19 +1,16 @@
 ## Setup Retrieve Sentences and end2end QA pipeline
 
-#### 1. Clone [Anserini](https://github.com/castorini/Anserini.git), [Castor](https://github.com/castorini/Castor.git), [data](https://github.com/castorini/data.git), and [models](https://github.com/castorini/models.git):
+#### 1. Assuming you've already followed the main [README](../README.md) instructions, just clone [Anserini](https://github.com/castorini/Anserini.git):
 ```bash
 git clone https://github.com/castorini/Anserini.git
-git clone https://github.com/castorini/Castor.git
-git clone https://github.com/castorini/data.git
-git clone https://github.com/castorini/models.git
 ```
 
 Your directory structure should look like
 ```
 .
 ├── Anserini
 ├── Castor
-├── data
+├── Castor-data
 └── models
 ```
 
@@ -34,22 +31,13 @@ Install the dependency packages:
 
 ```
 cd Castor
-pip3 install -r requirements.txt
+pip install -r requirements.txt
 ```
-Make sure that you have PyTorch installed. For more help, follow [these](https://github.com/castorini/Castor) steps.
 
 #### 3. Download Dependencies
 - Download the TrecQA lucene index
 - Download the Google word2vec file from [here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNWJkWExmaklYNTA?usp=sharing)
 
-#### 4. Additional files for pipeline:
-As some of the files are too large to be uploaded onto GitHub, please download the following files from
-[here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNm1LdjlwUFdzQVE?usp=sharing) and place them
-in the appropriate locations:
-
-- copy the contents of `word2vec` directory to `data/word2vec`
-- copy `word2dfs.p` to `data/TrecQA/`
-
 ### To run RetrieveSentences:
 
 ```bash

diff --git a/anserini_dependency/api.py b/anserini_dependency/api.py
@@ -51,7 +51,7 @@ def get_answers(question, num_hits, k):
     parser.add_argument("--scorer", help="passage scores", default="Idf")
     parser.add_argument("--k", help="top-k passages to be retrieved", default=1)
     parser.add_argument('--model', help="the path to the saved model file")
-    parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/')
+    parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/')
     parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda')
     parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU
     parser.add_argument('--seed', type=int, default=3435)
@@ -109,7 +109,7 @@ def get_answers(question, num_hits, k):
     parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda')
     parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU
     parser.add_argument('--seed', type=int, default=3435)
-    parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/')
+    parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/')
 
     args = parser.parse_args()
     if not args.cuda:

diff --git a/idf_baseline/README.md b/idf_baseline/README.md
@@ -4,19 +4,19 @@ Implements IDF baselines for QA datasets.
 
 ### Getting the data
 
-Git clone [castorini/data](https://github.com/castorini/data) to get TrecQA and WikiQA datasets.
+Assuming you followed instructions in the main [README](../README.md) instructions to clone Castor-data.
 
 Follow instructions in ``TrecQA/README.txt`` and ``WikiQA/README.txt`` to process the data into a _standard_ format.
 
-After running the respectve scripts, you should have the following directories structure in ``castorini/data/TrecQA``
+After running the respective scripts, you should have the following directories structure in ``castorini/Castor-data/TrecQA``
 ```
 ├── raw-dev
 ├── raw-test
 ├── train
 └── train-all
 ```
 
-and, the following directories in ``castorini/data/WikiQA``.
+and, the following directories in ``castorini/Castor-data/WikiQA``.
 ```
 ├── dev
 ├── test
@@ -138,25 +138,25 @@ eval/trec_eval.9.0/trec_eval -m map -m recip_rank <qrel-file> <run-file>
 
 For the WikiQA dataset
 ```
-../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim 
+../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../Castor-data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim
 ```
 
 For the TrecQA dataset
 ```
-../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/TrecQA/$set.qrel TrecQA.$set.idfsim
+../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../Castor-data/TrecQA/$set.qrel TrecQA.$set.idfsim
 ```
 
 #### 3. IDF sum similarity  using only the QA dataset to compute IDF of terms
 
 ```
-python qa-data-idf-only.py ../../data/TrecQA TrecQA
-python qa-data-only-idf.py ../../data/WikiQA WikiQA
+python qa-data-idf-only.py ../../Castor-data/TrecQA TrecQA
+python qa-data-only-idf.py ../../Castor-data/WikiQA WikiQA
 ```
 Evaluate these using step 2.
 
-The same script can now also be used to comput idf sum similarity based on corpus idf statistics
+The same script can now also be used to compute idf sum similarity based on corpus idf statistics
 ```
-python qa-data-only-idf.py ../../data/TrecQA TrecQA --index-for-corpusIDF ../../data/indices/index.qadata.pos.docvectors.keepstopwords/ 
+python qa-data-only-idf.py ../../Castor-data/TrecQA TrecQA --index-for-corpusIDF ../../Castor-data/indices/index.qadata.pos.docvectors.keepstopwords/
 ```
 
 ### Baseline results

diff --git a/idf_baseline/experimental_settings.py b/idf_baseline/experimental_settings.py
@@ -122,7 +122,7 @@ def list_settings(self):
     ap.add_argument("--runall", help="runs all experiments in order", action="store_true")
     ap.add_argument("index_path", help="required for some combination of experiments")
     ap.add_argument('qa_data', help="path to the QA dataset",
-                    choices=['../../data/TrecQA', '../../data/WikiQA'])
+                    choices=['../../Castor-data/TrecQA', '../../Castor-data/WikiQA'])
 
     args = ap.parse_args()
 

diff --git a/idf_baseline/qa-data-only-idf.py b/idf_baseline/qa-data-only-idf.py
@@ -125,7 +125,7 @@ def write_out_idf_sum_similarities(qids, questions, answers, term_idfs, outfile,
     ap = argparse.ArgumentParser(description="uses idf weights from the question-answer pairs only,\
                    and not from the whole corpus")
     ap.add_argument('qa_data', help="path to the QA dataset",
-                    choices=['../../data/TrecQA', '../../data/WikiQA'])
+                    choices=['../../Castor-data/TrecQA', '../../Castor-data/WikiQA'])
     ap.add_argument('outfile_prefix', help="output file prefix")
     ap.add_argument('--ignore-test', help="does not consider test data when computing IDF of terms",
                     action="store_true")

diff --git a/kim_cnn/README.md b/kim_cnn/README.md
@@ -16,31 +16,12 @@ Assuming you already have PyTorch, just install torchtext (`pip install torchtex
 
 ## Quick Start
 
-Clone and create the dataset.
-```
-git clone https://github.com/castorini/Castor.git
-```
-
-```
-. 
-├── Castor
-    ├── README.md 
-    ├── baseline_results.tsv 
-    ├── idf_baseline 
-    ├── kim_cnn 
-    ├── mp_cnn 
-    ├── setup.py 
-    ├── sm_cnn 
-    └── sm_modified_cnn 
-```
-
 To get the dataset, you can run this.
 ```
 cd kim_cnn
 bash get_data.sh
 ```
 
-
 To run the model on SST-1 dataset on multichannel, just run the following code.
 
 ```

diff --git a/mp_cnn/README.md b/mp_cnn/README.md
@@ -4,21 +4,7 @@ This is a PyTorch implementation of the following paper
 
 * Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586.
 
-The SICK and MSRVID datasets are available in https://github.com/castorini/data, as well as the GloVe word embeddings.
-
-Directory layout should be like this:
-```
-├── Castor
-│   ├── README.md
-│   ├── ...
-│   └── mp_cnn/
-├── data
-│   ├── README.md
-│   ├── ...
-│   ├── msrvid/
-│   ├── sick/
-│   └── GloVe/
-```
+Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc.
 
 ## SICK Dataset
 

diff --git a/mp_cnn/dataset.py b/mp_cnn/dataset.py
@@ -31,14 +31,14 @@ class MPCNNDatasetFactory(object):
     @staticmethod
     def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, device, castor_dir="../", utils_trecqa="utils/trec_eval-9.0.5/trec_eval"):
         if dataset_name == 'sick':
-            dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'sick/')
+            dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'sick/')
             train_loader, dev_loader, test_loader = SICK.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
             embedding_dim = SICK.TEXT_FIELD.vocab.vectors.size()
             embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
             embedding.weight = nn.Parameter(SICK.TEXT_FIELD.vocab.vectors)
             return SICK, embedding, train_loader, test_loader, dev_loader
         elif dataset_name == 'msrvid':
-            dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'msrvid/')
+            dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'msrvid/')
             dev_loader = None
             train_loader, test_loader = MSRVID.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
             embedding_dim = MSRVID.TEXT_FIELD.vocab.vectors.size()
@@ -48,7 +48,7 @@ def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, d
         elif dataset_name == 'trecqa':
             if not os.path.exists(os.path.join(castor_dir, utils_trecqa)):
                 raise FileNotFoundError('TrecQA requires the trec_eval tool to run. Please run get_trec_eval.sh inside Castor/utils (as working directory) before continuing.')
-            dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'TrecQA/')
+            dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'TrecQA/')
             train_loader, dev_loader, test_loader = TRECQA.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
             embedding_dim = TRECQA.TEXT_FIELD.vocab.vectors.size()
             embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
@@ -57,7 +57,7 @@ def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, d
         elif dataset_name == 'wikiqa':
             if not os.path.exists(os.path.join(castor_dir, utils_trecqa)):
                 raise FileNotFoundError('TrecQA requires the trec_eval tool to run. Please run get_trec_eval.sh inside Castor/utils (as working directory) before continuing.')
-            dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'WikiQA/')
+            dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'WikiQA/')
             train_loader, dev_loader, test_loader = WikiQA.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
             embedding_dim = WikiQA.TEXT_FIELD.vocab.vectors.size()
             embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])

diff --git a/mp_cnn/main.py b/mp_cnn/main.py
@@ -18,7 +18,7 @@
     parser = argparse.ArgumentParser(description='PyTorch implementation of Multi-Perspective CNN')
     parser.add_argument('model_outfile', help='file to save final model')
     parser.add_argument('--dataset', help='dataset to use, one of [sick, msrvid, trecqa, wikiqa]', default='sick')
-    parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, 'data', 'GloVe'))
+    parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, 'Castor-data', 'embeddings', 'GloVe'))
     parser.add_argument('--word-vectors-file', help='word vectors filename', default='glove.840B.300d.txt')
     parser.add_argument('--skip-training', help='will load pre-trained model', action='store_true')
     parser.add_argument('--device', type=int, default=0, help='GPU device, -1 for CPU (default: 0)')

diff --git a/nce/nce_pairwise_mp/README.md b/nce/nce_pairwise_mp/README.md
@@ -5,22 +5,7 @@ This is a PyTorch implementation of the following paper
 * Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586.
 * Jinfeng Rao, Hua He, and Jimmy Lin. [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks.](http://dl.acm.org/citation.cfm?id=2983872) *Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016)*, pages 1913-1916.
 
-
-The SICK and MSRVID datasets are available in https://github.com/castorini/data, as well as the GloVe word embeddings.
-
-Directory layout should be like this:
-```
-├── Castor
-│   ├── README.md
-│   ├── ...
-│   └── mp_cnn/
-├── data
-│   ├── README.md
-│   ├── ...
-│   ├── msrvid/
-│   ├── sick/
-│   └── GloVe/
-```
+Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc.
 
 ## TrecQA Dataset
 

diff --git a/nce/nce_pairwise_mp/main.py b/nce/nce_pairwise_mp/main.py
@@ -18,7 +18,7 @@
     parser = argparse.ArgumentParser(description='PyTorch implementation of Multi-Perspective CNN')
     parser.add_argument('model_outfile', help='file to save final model')
     parser.add_argument('--dataset', help='dataset to use, one of [sick, msrvid, trecqa, wikiqa]', default='sick')
-    parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'GloVe'))
+    parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'GloVe'))
     parser.add_argument('--word-vectors-file', help='word vectors filename', default='glove.840B.300d.txt')
     parser.add_argument('--skip-training', help='will load pre-trained model', action='store_true')
     parser.add_argument('--device', type=int, default=0, help='GPU device, -1 for CPU (default: 0)')

diff --git a/nce/nce_pairwise_sm/args.py b/nce/nce_pairwise_sm/args.py
@@ -20,7 +20,7 @@ def get_args():
     parser.add_argument('--words_dim', type=int, default=50)
     parser.add_argument('--dropout', type=float, default=0.5)
     parser.add_argument('--epoch_decay', type=int, default=15)
-    parser.add_argument('--wordvec_dir', type=str, default='../../../data/word2vec/')
+    parser.add_argument('--wordvec_dir', type=str, default='../../../Castor-data/embeddings/word2vec/')
     parser.add_argument('--vector_cache', type=str, default='word2vec.trecqa.pt')
     parser.add_argument('--trained_model', type=str, default="")
     parser.add_argument('--weight_decay',type=float, default=1e-5)

diff --git a/nce/nce_pairwise_sm/main.py b/nce/nce_pairwise_sm/main.py
@@ -38,10 +38,10 @@
 
 if args.dataset == "trec":
     dataset_cls = TRECQA
-    dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'TrecQA/')
+    dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'TrecQA/')
 elif args.dataset == "wiki":
     dataset_cls = WikiQA
-    dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'WikiQA/')
+    dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'WikiQA/')
 else:
     logger.info("Unsupported dataset")
     exit()

diff --git a/nce/nce_pairwise_sm/overlap_features.py b/nce/nce_pairwise_sm/overlap_features.py
@@ -102,10 +102,10 @@ def compute_dfs(docs):
 
 if __name__ == '__main__':
     parser = ArgumentParser(description='create TrecQA/WikiQA dataset')
-    parser.add_argument('--dir', help='path to the TrecQA|WikiQA data directory', default="../../data/TrecQA")
+    parser.add_argument('--dir', help='path to the TrecQA|WikiQA data directory', default="../../Castor-data/TrecQA")
     args = parser.parse_args()
 
-    stoplist = set([line.strip() for line in open('../../data/TrecQA/stopwords.txt', encoding='utf-8')])
+    stoplist = set([line.strip() for line in open('../../Castor-data/TrecQA/stopwords.txt', encoding='utf-8')])
     punct = set(string.punctuation)
     stoplist.update(punct)