New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to from GitHub castorini/data to uWaterloo Castor-data #103
Changes from 5 commits
6ef6c76
166eb9b
fcd5b11
293a6bc
2cd34dd
d425a49
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,40 +1,51 @@ | ||
# Castor | ||
|
||
PyTorch deep learning models. | ||
Deep learning for information retrieval with PyTorch. | ||
|
||
1. [SM model](./sm_cnn/): Similarity between question and candidate answers. | ||
## Models | ||
|
||
### Baselines | ||
|
||
## Setting up PyTorch | ||
|
||
You need Python 3.6 to use the models in this repository. | ||
|
||
As per [pytorch.org](pytorch.org), | ||
> "[Anaconda](https://www.continuum.io/downloads) is our recommended package manager" | ||
1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers | ||
|
||
```conda install pytorch torchvision -c soumith``` | ||
### Deep Learning Models | ||
|
||
Other pytorch installation modalities (e.g. via ```pip```) can be seen at [pytorch.org](pytorch.org). | ||
1. [SM-CNN](./sm_cnn/): Ranking short text pairs with Convolutional Neural Networks | ||
2. [Kim CNN](./kim_cnn/): Sentence classification using Convolutional Neural Networks | ||
3. [MP-CNN](./mp_cnn/): Sentence pair modelling with Multi-Perspective Convolutional Neural Networks | ||
4. [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN | ||
5. [conv-RNN](./conv_rnn): Convolutional RNN for text modelling | ||
|
||
We also recommend [gensim](https://radimrehurek.com/gensim/). We use some gensim modules to cache word embeddings. | ||
|
||
```conda install gensim``` | ||
## Setting up PyTorch | ||
|
||
Copy and run the command at https://pytorch.org/ for your environment. PyTorch recommends the Anaconda environment, which we use in our lab. | ||
|
||
PyTorch has good support for GPU computations. | ||
CUDA installation guide for linux can be found [here](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/) | ||
The typical installation command is | ||
|
||
**NOTE**: Install CUDA libraries **before** installing conda and pytorch. | ||
```bash | ||
conda install pytorch torchvision -c pytorch | ||
``` | ||
|
||
## Data and Pre-Trained Models | ||
|
||
## data for models | ||
Data associated for use with this repository can be found at: https://git.uwaterloo.ca/jimmylin/Castor-data.git. | ||
|
||
Sourcing and pre-processing of input data for each model is described in respective ```model/README.md```'s | ||
Pre-trained models can be found at: https://github.com/castorini/models.git. | ||
|
||
## Baselines | ||
Your directory structure should look like | ||
``` | ||
. | ||
├── Castor | ||
├── Castor-data | ||
└── models | ||
``` | ||
|
||
1. [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers. | ||
For example (if you use HTTPS instead of SSH): | ||
|
||
## Tutorials | ||
```bash | ||
git clone https://github.com/castorini/Castor.git | ||
git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git | ||
git clone https://github.com/castorini/models.git | ||
``` | ||
|
||
SM Model tutorial: [sm_cnn/tutorial.ipynb](sm_cnn/tutorial.ipynb) - notebook that walks through SM CNN model, good for beginnners. | ||
Sourcing and pre-processing of input data for each model is described in the respective ```model/README.md```'s. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,16 @@ | ||
## Setup Retrieve Sentences and end2end QA pipeline | ||
|
||
#### 1. Clone [Anserini](https://github.com/castorini/Anserini.git), [Castor](https://github.com/castorini/Castor.git), [data](https://github.com/castorini/data.git), and [models](https://github.com/castorini/models.git): | ||
#### 1. Assuming you've already followed the main [README](../README.md) instructions, just clone [Anserini](https://github.com/castorini/Anserini.git): | ||
```bash | ||
git clone https://github.com/castorini/Anserini.git | ||
git clone https://github.com/castorini/Castor.git | ||
git clone https://github.com/castorini/data.git | ||
git clone https://github.com/castorini/models.git | ||
``` | ||
|
||
Your directory structure should look like | ||
``` | ||
. | ||
├── Anserini | ||
├── Castor | ||
├── data | ||
├── Castor-data | ||
└── models | ||
``` | ||
|
||
|
@@ -34,22 +31,13 @@ Install the dependency packages: | |
|
||
``` | ||
cd Castor | ||
pip3 install -r requirements.txt | ||
pip install -r requirements.txt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do all our models support both py2 and py3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it only supports Python 3. But if they followed the setup correctly |
||
``` | ||
Make sure that you have PyTorch installed. For more help, follow [these](https://github.com/castorini/Castor) steps. | ||
|
||
#### 3. Download Dependencies | ||
- Download the TrecQA lucene index | ||
- Download the Google word2vec file from [here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNWJkWExmaklYNTA?usp=sharing) | ||
|
||
#### 4. Additional files for pipeline: | ||
As some of the files are too large to be uploaded onto GitHub, please download the following files from | ||
[here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNm1LdjlwUFdzQVE?usp=sharing) and place them | ||
in the appropriate locations: | ||
|
||
- copy the contents of `word2vec` directory to `data/word2vec` | ||
- copy `word2dfs.p` to `data/TrecQA/` | ||
|
||
### To run RetrieveSentences: | ||
|
||
```bash | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,7 +51,7 @@ def get_answers(question, num_hits, k): | |
parser.add_argument("--scorer", help="passage scores", default="Idf") | ||
parser.add_argument("--k", help="top-k passages to be retrieved", default=1) | ||
parser.add_argument('--model', help="the path to the saved model file") | ||
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/') | ||
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/') | ||
parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda') | ||
parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU | ||
parser.add_argument('--seed', type=int, default=3435) | ||
|
@@ -109,7 +109,7 @@ def get_answers(question, num_hits, k): | |
parser.add_argument('--no_cuda', action='store_false', help='do not use cuda', dest='cuda') | ||
parser.add_argument('--gpu', type=int, default=0) # Use -1 for CPU | ||
parser.add_argument('--seed', type=int, default=3435) | ||
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../data/TrecQA/') | ||
parser.add_argument('--dataset', help="the QA dataset folder {TrecQA|WikiQA}", default='../../Castor-data/TrecQA/') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. two argument parsers in a file? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't understand this comment :( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean there are two parser objects in this file There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see - that parser was already there (unchanged except for default path arg). This PR is only about the Castor-data issue. If that is a bug I would prefer to fix it separately in another PR. |
||
|
||
args = parser.parse_args() | ||
if not args.cuda: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,7 @@ def get_args(): | |
parser.add_argument('--words_dim', type=int, default=50) | ||
parser.add_argument('--dropout', type=float, default=0.5) | ||
parser.add_argument('--epoch_decay', type=int, default=15) | ||
parser.add_argument('--wordvec_dir', type=str, default='../../../data/word2vec/') | ||
parser.add_argument('--wordvec_dir', type=str, default='../../../Castor-data/embeddings/word2vec/') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. different from the path below There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto, embeddings are moved to the |
||
parser.add_argument('--vector_cache', type=str, default='word2vec.trecqa.pt') | ||
parser.add_argument('--trained_model', type=str, default="") | ||
parser.add_argument('--weight_decay',type=float, default=1e-5) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just rename it to
Castor-models
while we're at it. I'll create this repo on UWaterloo git also.