Skip to content
Permalink
Browse files

Migrate to from GitHub castorini/data to uWaterloo Castor-data (#103)

* Refactor main README
* Update Anserini Dependency docs
* Update idf baseline and Kim CNN docs to use Castor-data
* Update remaining READMEs to reference Castor-data
* Change default path from data to Castor-data
* Fix wrong order of embeddings path
  • Loading branch information...
tuzhucheng authored and lintool committed May 23, 2018
1 parent ef21aa9 commit f7a0167b81b040c9764522e431ee8937155c2664
@@ -1,40 +1,51 @@
# Castor

PyTorch deep learning models.
Deep learning for information retrieval with PyTorch.

1. [SM model](./sm_cnn/): Similarity between question and candidate answers.
## Models

### Baselines

## Setting up PyTorch

You need Python 3.6 to use the models in this repository.

As per [pytorch.org](pytorch.org),
> "[Anaconda](https://www.continuum.io/downloads) is our recommended package manager"
1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers

```conda install pytorch torchvision -c soumith```
### Deep Learning Models

Other pytorch installation modalities (e.g. via ```pip```) can be seen at [pytorch.org](pytorch.org).
1. [SM-CNN](./sm_cnn/): Ranking short text pairs with Convolutional Neural Networks
2. [Kim CNN](./kim_cnn/): Sentence classification using Convolutional Neural Networks
3. [MP-CNN](./mp_cnn/): Sentence pair modelling with Multi-Perspective Convolutional Neural Networks
4. [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN
5. [conv-RNN](./conv_rnn): Convolutional RNN for text modelling

We also recommend [gensim](https://radimrehurek.com/gensim/). We use some gensim modules to cache word embeddings.
```conda install gensim```
## Setting up PyTorch

Copy and run the command at https://pytorch.org/ for your environment. PyTorch recommends the Anaconda environment, which we use in our lab.

PyTorch has good support for GPU computations.
CUDA installation guide for linux can be found [here](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
The typical installation command is

**NOTE**: Install CUDA libraries **before** installing conda and pytorch.
```bash
conda install pytorch torchvision -c pytorch
```

## Data and Pre-Trained Models

## data for models
Data associated for use with this repository can be found at: https://git.uwaterloo.ca/jimmylin/Castor-data.git.

Sourcing and pre-processing of input data for each model is described in respective ```model/README.md```'s
Pre-trained models can be found at: https://github.com/castorini/models.git.

## Baselines
Your directory structure should look like
```
.
├── Castor
├── Castor-data
└── models
```

1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers.
For example (if you use HTTPS instead of SSH):

## Tutorials
```bash
git clone https://github.com/castorini/Castor.git
git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git
git clone https://github.com/castorini/models.git
```

SM Model tutorial: [sm_cnn/tutorial.ipynb](sm_cnn/tutorial.ipynb) - notebook that walks through SM CNN model, good for beginnners.
Sourcing and pre-processing of input data for each model is described in the respective ```model/README.md```'s.
@@ -1,19 +1,16 @@
## Setup Retrieve Sentences and end2end QA pipeline

#### 1. Clone [Anserini](https://github.com/castorini/Anserini.git), [Castor](https://github.com/castorini/Castor.git), [data](https://github.com/castorini/data.git), and [models](https://github.com/castorini/models.git):
#### 1. Assuming you've already followed the main [README](../README.md) instructions, just clone [Anserini](https://github.com/castorini/Anserini.git):
```bash
git clone https://github.com/castorini/Anserini.git
git clone https://github.com/castorini/Castor.git
git clone https://github.com/castorini/data.git
git clone https://github.com/castorini/models.git
```

Your directory structure should look like
```
.
├── Anserini
├── Castor
├── data
├── Castor-data
└── models
```

@@ -34,22 +31,13 @@ Install the dependency packages:

```
cd Castor
pip3 install -r requirements.txt
pip install -r requirements.txt
```
Make sure that you have PyTorch installed. For more help, follow [these](https://github.com/castorini/Castor) steps.

#### 3. Download Dependencies
- Download the TrecQA lucene index
- Download the Google word2vec file from [here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNWJkWExmaklYNTA?usp=sharing)

#### 4. Additional files for pipeline:
As some of the files are too large to be uploaded onto GitHub, please download the following files from
[here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNm1LdjlwUFdzQVE?usp=sharing) and place them
in the appropriate locations:

- copy the contents of `word2vec` directory to `data/word2vec`
- copy `word2dfs.p` to `data/TrecQA/`

### To run RetrieveSentences:

```bash
@@ -51,7 +51,7 @@ def get_answers(question, num_hits, k):
parser.add_argument("--scorer", help="passage scores", default="Idf")
parser.add_argument("--k", help="top-k passages to be retrieved", default=1)
parser.add_argument('--model', help="the path to the saved model file")
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/')
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/')
parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda')
parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU
parser.add_argument('--seed', type=int, default=3435)
@@ -109,7 +109,7 @@ def get_answers(question, num_hits, k):
parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda')
parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU
parser.add_argument('--seed', type=int, default=3435)
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/')
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/')

args = parser.parse_args()
if not args.cuda:
@@ -4,19 +4,19 @@ Implements IDF baselines for QA datasets.

### Getting the data

Git clone [castorini/data](https://github.com/castorini/data) to get TrecQA and WikiQA datasets.
Assuming you followed instructions in the main [README](../README.md) instructions to clone Castor-data.

Follow instructions in ``TrecQA/README.txt`` and ``WikiQA/README.txt`` to process the data into a _standard_ format.

After running the respectve scripts, you should have the following directories structure in ``castorini/data/TrecQA``
After running the respective scripts, you should have the following directories structure in ``castorini/Castor-data/TrecQA``
```
├── raw-dev
├── raw-test
├── train
└── train-all
```

and, the following directories in ``castorini/data/WikiQA``.
and, the following directories in ``castorini/Castor-data/WikiQA``.
```
├── dev
├── test
@@ -138,25 +138,25 @@ eval/trec_eval.9.0/trec_eval -m map -m recip_rank <qrel-file> <run-file>

For the WikiQA dataset
```
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../Castor-data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim
```

For the TrecQA dataset
```
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/TrecQA/$set.qrel TrecQA.$set.idfsim
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../Castor-data/TrecQA/$set.qrel TrecQA.$set.idfsim
```

#### 3. IDF sum similarity using only the QA dataset to compute IDF of terms

```
python qa-data-idf-only.py ../../data/TrecQA TrecQA
python qa-data-only-idf.py ../../data/WikiQA WikiQA
python qa-data-idf-only.py ../../Castor-data/TrecQA TrecQA
python qa-data-only-idf.py ../../Castor-data/WikiQA WikiQA
```
Evaluate these using step 2.

The same script can now also be used to comput idf sum similarity based on corpus idf statistics
The same script can now also be used to compute idf sum similarity based on corpus idf statistics
```
python qa-data-only-idf.py ../../data/TrecQA TrecQA --index-for-corpusIDF ../../data/indices/index.qadata.pos.docvectors.keepstopwords/
python qa-data-only-idf.py ../../Castor-data/TrecQA TrecQA --index-for-corpusIDF ../../Castor-data/indices/index.qadata.pos.docvectors.keepstopwords/
```

### Baseline results
@@ -122,7 +122,7 @@ def list_settings(self):
ap.add_argument("--runall", help="runs all experiments in order", action="store_true")
ap.add_argument("index_path", help="required for some combination of experiments")
ap.add_argument('qa_data', help="path to the QA dataset",
choices=['../../data/TrecQA', '../../data/WikiQA'])
choices=['../../Castor-data/TrecQA', '../../Castor-data/WikiQA'])

args = ap.parse_args()

@@ -125,7 +125,7 @@ def write_out_idf_sum_similarities(qids, questions, answers, term_idfs, outfile,
ap = argparse.ArgumentParser(description="uses idf weights from the question-answer pairs only,\
and not from the whole corpus")
ap.add_argument('qa_data', help="path to the QA dataset",
choices=['../../data/TrecQA', '../../data/WikiQA'])
choices=['../../Castor-data/TrecQA', '../../Castor-data/WikiQA'])
ap.add_argument('outfile_prefix', help="output file prefix")
ap.add_argument('--ignore-test', help="does not consider test data when computing IDF of terms",
action="store_true")
@@ -16,31 +16,12 @@ Assuming you already have PyTorch, just install torchtext (`pip install torchtex

## Quick Start

Clone and create the dataset.
```
git clone https://github.com/castorini/Castor.git
```

```
.
├── Castor
├── README.md
├── baseline_results.tsv
├── idf_baseline
├── kim_cnn
├── mp_cnn
├── setup.py
├── sm_cnn
└── sm_modified_cnn
```

To get the dataset, you can run this.
```
cd kim_cnn
bash get_data.sh
```


To run the model on SST-1 dataset on multichannel, just run the following code.

```
@@ -4,21 +4,7 @@ This is a PyTorch implementation of the following paper

* Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586.

The SICK and MSRVID datasets are available in https://github.com/castorini/data, as well as the GloVe word embeddings.

Directory layout should be like this:
```
├── Castor
│ ├── README.md
│ ├── ...
│ └── mp_cnn/
├── data
│ ├── README.md
│ ├── ...
│ ├── msrvid/
│ ├── sick/
│ └── GloVe/
```
Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc.

## SICK Dataset

@@ -31,14 +31,14 @@ class MPCNNDatasetFactory(object):
@staticmethod
def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, device, castor_dir="../", utils_trecqa="utils/trec_eval-9.0.5/trec_eval"):
if dataset_name == 'sick':
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'sick/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'sick/')
train_loader, dev_loader, test_loader = SICK.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = SICK.TEXT_FIELD.vocab.vectors.size()
embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
embedding.weight = nn.Parameter(SICK.TEXT_FIELD.vocab.vectors)
return SICK, embedding, train_loader, test_loader, dev_loader
elif dataset_name == 'msrvid':
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'msrvid/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'msrvid/')
dev_loader = None
train_loader, test_loader = MSRVID.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = MSRVID.TEXT_FIELD.vocab.vectors.size()
@@ -48,7 +48,7 @@ def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, d
elif dataset_name == 'trecqa':
if not os.path.exists(os.path.join(castor_dir, utils_trecqa)):
raise FileNotFoundError('TrecQA requires the trec_eval tool to run. Please run get_trec_eval.sh inside Castor/utils (as working directory) before continuing.')
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'TrecQA/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'TrecQA/')
train_loader, dev_loader, test_loader = TRECQA.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = TRECQA.TEXT_FIELD.vocab.vectors.size()
embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
@@ -57,7 +57,7 @@ def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, d
elif dataset_name == 'wikiqa':
if not os.path.exists(os.path.join(castor_dir, utils_trecqa)):
raise FileNotFoundError('TrecQA requires the trec_eval tool to run. Please run get_trec_eval.sh inside Castor/utils (as working directory) before continuing.')
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'WikiQA/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'WikiQA/')
train_loader, dev_loader, test_loader = WikiQA.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = WikiQA.TEXT_FIELD.vocab.vectors.size()
embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
@@ -18,7 +18,7 @@
parser = argparse.ArgumentParser(description='PyTorch implementation of Multi-Perspective CNN')
parser.add_argument('model_outfile', help='file to save final model')
parser.add_argument('--dataset', help='dataset to use, one of [sick, msrvid, trecqa, wikiqa]', default='sick')
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, 'data', 'GloVe'))
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, 'Castor-data', 'embeddings', 'GloVe'))
parser.add_argument('--word-vectors-file', help='word vectors filename', default='glove.840B.300d.txt')
parser.add_argument('--skip-training', help='will load pre-trained model', action='store_true')
parser.add_argument('--device', type=int, default=0, help='GPU device, -1 for CPU (default: 0)')
@@ -5,22 +5,7 @@ This is a PyTorch implementation of the following paper
* Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586.
* Jinfeng Rao, Hua He, and Jimmy Lin. [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks.](http://dl.acm.org/citation.cfm?id=2983872) *Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016)*, pages 1913-1916.


The SICK and MSRVID datasets are available in https://github.com/castorini/data, as well as the GloVe word embeddings.

Directory layout should be like this:
```
├── Castor
│ ├── README.md
│ ├── ...
│ └── mp_cnn/
├── data
│ ├── README.md
│ ├── ...
│ ├── msrvid/
│ ├── sick/
│ └── GloVe/
```
Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc.

## TrecQA Dataset

@@ -18,7 +18,7 @@
parser = argparse.ArgumentParser(description='PyTorch implementation of Multi-Perspective CNN')
parser.add_argument('model_outfile', help='file to save final model')
parser.add_argument('--dataset', help='dataset to use, one of [sick, msrvid, trecqa, wikiqa]', default='sick')
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'GloVe'))
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'GloVe'))
parser.add_argument('--word-vectors-file', help='word vectors filename', default='glove.840B.300d.txt')
parser.add_argument('--skip-training', help='will load pre-trained model', action='store_true')
parser.add_argument('--device', type=int, default=0, help='GPU device, -1 for CPU (default: 0)')
@@ -20,7 +20,7 @@ def get_args():
parser.add_argument('--words_dim', type=int, default=50)
parser.add_argument('--dropout', type=float, default=0.5)
parser.add_argument('--epoch_decay', type=int, default=15)
parser.add_argument('--wordvec_dir', type=str, default='../../../data/word2vec/')
parser.add_argument('--wordvec_dir', type=str, default='../../../Castor-data/embeddings/word2vec/')
parser.add_argument('--vector_cache', type=str, default='word2vec.trecqa.pt')
parser.add_argument('--trained_model', type=str, default="")
parser.add_argument('--weight_decay',type=float, default=1e-5)
@@ -38,10 +38,10 @@

if args.dataset == "trec":
dataset_cls = TRECQA
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'TrecQA/')
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'TrecQA/')
elif args.dataset == "wiki":
dataset_cls = WikiQA
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'WikiQA/')
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'WikiQA/')
else:
logger.info("Unsupported dataset")
exit()
@@ -102,10 +102,10 @@ def compute_dfs(docs):

if __name__ == '__main__':
parser = ArgumentParser(description='create TrecQA/WikiQA dataset')
parser.add_argument('--dir', help='path to the TrecQA|WikiQA data directory', default="../../data/TrecQA")
parser.add_argument('--dir', help='path to the TrecQA|WikiQA data directory', default="../../Castor-data/TrecQA")
args = parser.parse_args()

stoplist = set([line.strip() for line in open('../../data/TrecQA/stopwords.txt', encoding='utf-8')])
stoplist = set([line.strip() for line in open('../../Castor-data/TrecQA/stopwords.txt', encoding='utf-8')])
punct = set(string.punctuation)
stoplist.update(punct)

Oops, something went wrong.

0 comments on commit f7a0167

Please sign in to comment.
You can’t perform that action at this time.