Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to from GitHub castorini/data to uWaterloo Castor-data #103

Merged
merged 6 commits into from May 23, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
55 changes: 33 additions & 22 deletions README.md
@@ -1,40 +1,51 @@
# Castor

PyTorch deep learning models.
Deep learning for information retrieval with PyTorch.

1. [SM model](./sm_cnn/): Similarity between question and candidate answers.
## Models

### Baselines

## Setting up PyTorch

You need Python 3.6 to use the models in this repository.

As per [pytorch.org](pytorch.org),
> "[Anaconda](https://www.continuum.io/downloads) is our recommended package manager"
1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers

```conda install pytorch torchvision -c soumith```
### Deep Learning Models

Other pytorch installation modalities (e.g. via ```pip```) can be seen at [pytorch.org](pytorch.org).
1. [SM-CNN](./sm_cnn/): Ranking short text pairs with Convolutional Neural Networks
2. [Kim CNN](./kim_cnn/): Sentence classification using Convolutional Neural Networks
3. [MP-CNN](./mp_cnn/): Sentence pair modelling with Multi-Perspective Convolutional Neural Networks
4. [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN
5. [conv-RNN](./conv_rnn): Convolutional RNN for text modelling

We also recommend [gensim](https://radimrehurek.com/gensim/). We use some gensim modules to cache word embeddings.

```conda install gensim```
## Setting up PyTorch

Copy and run the command at https://pytorch.org/ for your environment. PyTorch recommends the Anaconda environment, which we use in our lab.

PyTorch has good support for GPU computations.
CUDA installation guide for linux can be found [here](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
The typical installation command is

**NOTE**: Install CUDA libraries **before** installing conda and pytorch.
```bash
conda install pytorch torchvision -c pytorch
```

## Data and Pre-Trained Models

## data for models
Data associated for use with this repository can be found at: https://git.uwaterloo.ca/jimmylin/Castor-data.git.

Sourcing and pre-processing of input data for each model is described in respective ```model/README.md```'s
Pre-trained models can be found at: https://github.com/castorini/models.git.

## Baselines
Your directory structure should look like
```
.
├── Castor
├── Castor-data
└── models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just rename it to Castor-models while we're at it. I'll create this repo on UWaterloo git also.

```

1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers.
For example (if you use HTTPS instead of SSH):

## Tutorials
```bash
git clone https://github.com/castorini/Castor.git
git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git
git clone https://github.com/castorini/models.git
```

SM Model tutorial: [sm_cnn/tutorial.ipynb](sm_cnn/tutorial.ipynb) - notebook that walks through SM CNN model, good for beginnners.
Sourcing and pre-processing of input data for each model is described in the respective ```model/README.md```'s.
18 changes: 3 additions & 15 deletions anserini_dependency/README.md
@@ -1,19 +1,16 @@
## Setup Retrieve Sentences and end2end QA pipeline

#### 1. Clone [Anserini](https://github.com/castorini/Anserini.git), [Castor](https://github.com/castorini/Castor.git), [data](https://github.com/castorini/data.git), and [models](https://github.com/castorini/models.git):
#### 1. Assuming you've already followed the main [README](../README.md) instructions, just clone [Anserini](https://github.com/castorini/Anserini.git):
```bash
git clone https://github.com/castorini/Anserini.git
git clone https://github.com/castorini/Castor.git
git clone https://github.com/castorini/data.git
git clone https://github.com/castorini/models.git
```

Your directory structure should look like
```
.
├── Anserini
├── Castor
├── data
├── Castor-data
└── models
```

Expand All @@ -34,22 +31,13 @@ Install the dependency packages:

```
cd Castor
pip3 install -r requirements.txt
pip install -r requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all our models support both py2 and py3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it only supports Python 3. But if they followed the setup correctly pip should be the correct one by default - if they explicitly use pip3 and pip doesn't work something is wrong with their environment set-up.

```
Make sure that you have PyTorch installed. For more help, follow [these](https://github.com/castorini/Castor) steps.

#### 3. Download Dependencies
- Download the TrecQA lucene index
- Download the Google word2vec file from [here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNWJkWExmaklYNTA?usp=sharing)

#### 4. Additional files for pipeline:
As some of the files are too large to be uploaded onto GitHub, please download the following files from
[here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNm1LdjlwUFdzQVE?usp=sharing) and place them
in the appropriate locations:

- copy the contents of `word2vec` directory to `data/word2vec`
- copy `word2dfs.p` to `data/TrecQA/`

### To run RetrieveSentences:

```bash
Expand Down
4 changes: 2 additions & 2 deletions anserini_dependency/api.py
Expand Up @@ -51,7 +51,7 @@ def get_answers(question, num_hits, k):
parser.add_argument("--scorer", help="passage scores", default="Idf")
parser.add_argument("--k", help="top-k passages to be retrieved", default=1)
parser.add_argument('--model', help="the path to the saved model file")
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/')
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/')
parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda')
parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU
parser.add_argument('--seed', type=int, default=3435)
Expand Down Expand Up @@ -109,7 +109,7 @@ def get_answers(question, num_hits, k):
parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda')
parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU
parser.add_argument('--seed', type=int, default=3435)
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/')
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two argument parsers in a file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't understand this comment :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean there are two parser objects in this file api.py. Is that what you want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - that parser was already there (unchanged except for default path arg). This PR is only about the Castor-data issue. If that is a bug I would prefer to fix it separately in another PR.


args = parser.parse_args()
if not args.cuda:
Expand Down
18 changes: 9 additions & 9 deletions idf_baseline/README.md
Expand Up @@ -4,19 +4,19 @@ Implements IDF baselines for QA datasets.

### Getting the data

Git clone [castorini/data](https://github.com/castorini/data) to get TrecQA and WikiQA datasets.
Assuming you followed instructions in the main [README](../README.md) instructions to clone Castor-data.

Follow instructions in ``TrecQA/README.txt`` and ``WikiQA/README.txt`` to process the data into a _standard_ format.

After running the respectve scripts, you should have the following directories structure in ``castorini/data/TrecQA``
After running the respective scripts, you should have the following directories structure in ``castorini/Castor-data/TrecQA``
```
├── raw-dev
├── raw-test
├── train
└── train-all
```

and, the following directories in ``castorini/data/WikiQA``.
and, the following directories in ``castorini/Castor-data/WikiQA``.
```
├── dev
├── test
Expand Down Expand Up @@ -138,25 +138,25 @@ eval/trec_eval.9.0/trec_eval -m map -m recip_rank <qrel-file> <run-file>

For the WikiQA dataset
```
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../Castor-data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim
```

For the TrecQA dataset
```
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/TrecQA/$set.qrel TrecQA.$set.idfsim
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../Castor-data/TrecQA/$set.qrel TrecQA.$set.idfsim
```

#### 3. IDF sum similarity using only the QA dataset to compute IDF of terms

```
python qa-data-idf-only.py ../../data/TrecQA TrecQA
python qa-data-only-idf.py ../../data/WikiQA WikiQA
python qa-data-idf-only.py ../../Castor-data/TrecQA TrecQA
python qa-data-only-idf.py ../../Castor-data/WikiQA WikiQA
```
Evaluate these using step 2.

The same script can now also be used to comput idf sum similarity based on corpus idf statistics
The same script can now also be used to compute idf sum similarity based on corpus idf statistics
```
python qa-data-only-idf.py ../../data/TrecQA TrecQA --index-for-corpusIDF ../../data/indices/index.qadata.pos.docvectors.keepstopwords/
python qa-data-only-idf.py ../../Castor-data/TrecQA TrecQA --index-for-corpusIDF ../../Castor-data/indices/index.qadata.pos.docvectors.keepstopwords/
```

### Baseline results
Expand Down
2 changes: 1 addition & 1 deletion idf_baseline/experimental_settings.py
Expand Up @@ -122,7 +122,7 @@ def list_settings(self):
ap.add_argument("--runall", help="runs all experiments in order", action="store_true")
ap.add_argument("index_path", help="required for some combination of experiments")
ap.add_argument('qa_data', help="path to the QA dataset",
choices=['../../data/TrecQA', '../../data/WikiQA'])
choices=['../../Castor-data/TrecQA', '../../Castor-data/WikiQA'])

args = ap.parse_args()

Expand Down
2 changes: 1 addition & 1 deletion idf_baseline/qa-data-only-idf.py
Expand Up @@ -125,7 +125,7 @@ def write_out_idf_sum_similarities(qids, questions, answers, term_idfs, outfile,
ap = argparse.ArgumentParser(description="uses idf weights from the question-answer pairs only,\
and not from the whole corpus")
ap.add_argument('qa_data', help="path to the QA dataset",
choices=['../../data/TrecQA', '../../data/WikiQA'])
choices=['../../Castor-data/TrecQA', '../../Castor-data/WikiQA'])
ap.add_argument('outfile_prefix', help="output file prefix")
ap.add_argument('--ignore-test', help="does not consider test data when computing IDF of terms",
action="store_true")
Expand Down
19 changes: 0 additions & 19 deletions kim_cnn/README.md
Expand Up @@ -16,31 +16,12 @@ Assuming you already have PyTorch, just install torchtext (`pip install torchtex

## Quick Start

Clone and create the dataset.
```
git clone https://github.com/castorini/Castor.git
```

```
.
├── Castor
├── README.md
├── baseline_results.tsv
├── idf_baseline
├── kim_cnn
├── mp_cnn
├── setup.py
├── sm_cnn
└── sm_modified_cnn
```

To get the dataset, you can run this.
```
cd kim_cnn
bash get_data.sh
```


To run the model on SST-1 dataset on multichannel, just run the following code.

```
Expand Down
16 changes: 1 addition & 15 deletions mp_cnn/README.md
Expand Up @@ -4,21 +4,7 @@ This is a PyTorch implementation of the following paper

* Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586.

The SICK and MSRVID datasets are available in https://github.com/castorini/data, as well as the GloVe word embeddings.

Directory layout should be like this:
```
├── Castor
│ ├── README.md
│ ├── ...
│ └── mp_cnn/
├── data
│ ├── README.md
│ ├── ...
│ ├── msrvid/
│ ├── sick/
│ └── GloVe/
```
Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc.

## SICK Dataset

Expand Down
8 changes: 4 additions & 4 deletions mp_cnn/dataset.py
Expand Up @@ -31,14 +31,14 @@ class MPCNNDatasetFactory(object):
@staticmethod
def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, device, castor_dir="../", utils_trecqa="utils/trec_eval-9.0.5/trec_eval"):
if dataset_name == 'sick':
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'sick/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'sick/')
train_loader, dev_loader, test_loader = SICK.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = SICK.TEXT_FIELD.vocab.vectors.size()
embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
embedding.weight = nn.Parameter(SICK.TEXT_FIELD.vocab.vectors)
return SICK, embedding, train_loader, test_loader, dev_loader
elif dataset_name == 'msrvid':
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'msrvid/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'msrvid/')
dev_loader = None
train_loader, test_loader = MSRVID.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = MSRVID.TEXT_FIELD.vocab.vectors.size()
Expand All @@ -48,7 +48,7 @@ def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, d
elif dataset_name == 'trecqa':
if not os.path.exists(os.path.join(castor_dir, utils_trecqa)):
raise FileNotFoundError('TrecQA requires the trec_eval tool to run. Please run get_trec_eval.sh inside Castor/utils (as working directory) before continuing.')
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'TrecQA/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'TrecQA/')
train_loader, dev_loader, test_loader = TRECQA.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = TRECQA.TEXT_FIELD.vocab.vectors.size()
embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
Expand All @@ -57,7 +57,7 @@ def get_dataset(dataset_name, word_vectors_dir, word_vectors_file, batch_size, d
elif dataset_name == 'wikiqa':
if not os.path.exists(os.path.join(castor_dir, utils_trecqa)):
raise FileNotFoundError('TrecQA requires the trec_eval tool to run. Please run get_trec_eval.sh inside Castor/utils (as working directory) before continuing.')
dataset_root = os.path.join(os.pardir, castor_dir, 'data', 'WikiQA/')
dataset_root = os.path.join(castor_dir, os.pardir, 'Castor-data', 'WikiQA/')
train_loader, dev_loader, test_loader = WikiQA.iters(dataset_root, word_vectors_file, word_vectors_dir, batch_size, device=device, unk_init=UnknownWordVecCache.unk)
embedding_dim = WikiQA.TEXT_FIELD.vocab.vectors.size()
embedding = nn.Embedding(embedding_dim[0], embedding_dim[1])
Expand Down
2 changes: 1 addition & 1 deletion mp_cnn/main.py
Expand Up @@ -18,7 +18,7 @@
parser = argparse.ArgumentParser(description='PyTorch implementation of Multi-Perspective CNN')
parser.add_argument('model_outfile', help='file to save final model')
parser.add_argument('--dataset', help='dataset to use, one of [sick, msrvid, trecqa, wikiqa]', default='sick')
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, 'data', 'GloVe'))
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, 'Castor-data', 'embeddings', 'GloVe'))
parser.add_argument('--word-vectors-file', help='word vectors filename', default='glove.840B.300d.txt')
parser.add_argument('--skip-training', help='will load pre-trained model', action='store_true')
parser.add_argument('--device', type=int, default=0, help='GPU device, -1 for CPU (default: 0)')
Expand Down
17 changes: 1 addition & 16 deletions nce/nce_pairwise_mp/README.md
Expand Up @@ -5,22 +5,7 @@ This is a PyTorch implementation of the following paper
* Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586.
* Jinfeng Rao, Hua He, and Jimmy Lin. [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks.](http://dl.acm.org/citation.cfm?id=2983872) *Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016)*, pages 1913-1916.


The SICK and MSRVID datasets are available in https://github.com/castorini/data, as well as the GloVe word embeddings.

Directory layout should be like this:
```
├── Castor
│ ├── README.md
│ ├── ...
│ └── mp_cnn/
├── data
│ ├── README.md
│ ├── ...
│ ├── msrvid/
│ ├── sick/
│ └── GloVe/
```
Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc.

## TrecQA Dataset

Expand Down
2 changes: 1 addition & 1 deletion nce/nce_pairwise_mp/main.py
Expand Up @@ -18,7 +18,7 @@
parser = argparse.ArgumentParser(description='PyTorch implementation of Multi-Perspective CNN')
parser.add_argument('model_outfile', help='file to save final model')
parser.add_argument('--dataset', help='dataset to use, one of [sick, msrvid, trecqa, wikiqa]', default='sick')
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'GloVe'))
parser.add_argument('--word-vectors-dir', help='word vectors directory', default=os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'GloVe'))
parser.add_argument('--word-vectors-file', help='word vectors filename', default='glove.840B.300d.txt')
parser.add_argument('--skip-training', help='will load pre-trained model', action='store_true')
parser.add_argument('--device', type=int, default=0, help='GPU device, -1 for CPU (default: 0)')
Expand Down
2 changes: 1 addition & 1 deletion nce/nce_pairwise_sm/args.py
Expand Up @@ -20,7 +20,7 @@ def get_args():
parser.add_argument('--words_dim', type=int, default=50)
parser.add_argument('--dropout', type=float, default=0.5)
parser.add_argument('--epoch_decay', type=int, default=15)
parser.add_argument('--wordvec_dir', type=str, default='../../../data/word2vec/')
parser.add_argument('--wordvec_dir', type=str, default='../../../Castor-data/embeddings/word2vec/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different from the path below word2vec/embeddings ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, embeddings are moved to the embeddings subdirectory now.

parser.add_argument('--vector_cache', type=str, default='word2vec.trecqa.pt')
parser.add_argument('--trained_model', type=str, default="")
parser.add_argument('--weight_decay',type=float, default=1e-5)
Expand Down
4 changes: 2 additions & 2 deletions nce/nce_pairwise_sm/main.py
Expand Up @@ -38,10 +38,10 @@

if args.dataset == "trec":
dataset_cls = TRECQA
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'TrecQA/')
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'TrecQA/')
elif args.dataset == "wiki":
dataset_cls = WikiQA
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'data', 'WikiQA/')
dataset_root = os.path.join(os.pardir, os.pardir, os.pardir, 'Castor-data', 'embeddings', 'WikiQA/')
else:
logger.info("Unsupported dataset")
exit()
Expand Down
4 changes: 2 additions & 2 deletions nce/nce_pairwise_sm/overlap_features.py
Expand Up @@ -102,10 +102,10 @@ def compute_dfs(docs):

if __name__ == '__main__':
parser = ArgumentParser(description='create TrecQA/WikiQA dataset')
parser.add_argument('--dir', help='path to the TrecQA|WikiQA data directory', default="../../data/TrecQA")
parser.add_argument('--dir', help='path to the TrecQA|WikiQA data directory', default="../../Castor-data/TrecQA")
args = parser.parse_args()

stoplist = set([line.strip() for line in open('../../data/TrecQA/stopwords.txt', encoding='utf-8')])
stoplist = set([line.strip() for line in open('../../Castor-data/TrecQA/stopwords.txt', encoding='utf-8')])
punct = set(string.punctuation)
stoplist.update(punct)

Expand Down