diff --git a/README.md b/README.md index 7403c5a5..bf0d0d3f 100644 --- a/README.md +++ b/README.md @@ -17,12 +17,18 @@ For paraphrase detection, question answering, etc. + [SM-CNN](./sm_cnn/): Siamese CNN for ranking texts [(Severyn and Moschitti, SIGIR 2015)](https://dl.acm.org/citation.cfm?id=2767738) + [MP-CNN](./mp_cnn/): Multi-Perspective CNN [(He et al., EMNLP 2015)](http://anthology.aclweb.org/D/D15/D15-1181.pdf) -+ [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN ++ [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN [(Rao et al., CIKM 2016)](https://dl.acm.org/citation.cfm?id=2983872) + [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers +Each model directory has a `README.md` with further details. + ## Setting up PyTorch -Copy and run the command at https://pytorch.org/ for your environment. PyTorch recommends the Anaconda environment, which we use in our lab. +**If you are an internal Castor contributor and is planning to use the Data System Group's GPU machines in the lab, +please follow the instructions [here](./docs/internal-instructions.md) instead.** + +Copy and run the command at [https://pytorch.org/](https://pytorch.org/) for your environment. +PyTorch recommends the Anaconda environment, which we use in our lab. We are currently targeting PyTorch 0.4 for our codebase. The typical installation command is @@ -30,18 +36,33 @@ The typical installation command is conda install pytorch torchvision -c pytorch ``` +Other Python packages we use can be installed via pip: + +```bash +pip install -r requirements.txt +``` + +Please also run the following inside the `utils` directory to build the `trec_eval` tool for evaluating certain datasets. + +```bash +./get_trec_eval.sh +``` + ## Data and Pre-Trained Models +**If you are an internal Castor contributor and is planning to use the Data System Group's GPU machines in the lab, +please follow the instructions [here](./docs/internal-instructions.md) instead.** + Data associated for use with this repository can be found at: https://git.uwaterloo.ca/jimmylin/Castor-data.git. -Pre-trained models can be found at: https://github.com/castorini/models.git. +Pre-trained models can be found at: https://git.uwaterloo.ca/jimmylin/Castor-models. Your directory structure should look like ``` . ├── Castor ├── Castor-data -└── models +└── Castor-models ``` For example (if you use HTTPS instead of SSH): @@ -49,7 +70,13 @@ For example (if you use HTTPS instead of SSH): ```bash git clone https://github.com/castorini/Castor.git git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git -git clone https://github.com/castorini/models.git +git clone https://git.uwaterloo.ca/jimmylin/Castor-models.git ``` -Sourcing and pre-processing of input data for each model is described in the respective ```model/README.md```'s. +After cloning the Castor-data repo, you need to unzip embeddings and run data pre-processing scripts. You can choose +to follow instructions under each dataset / embedding directory separately, or just run the following script in Castor-data +to do all of the steps for you: + +```bash +./setup.sh +``` diff --git a/docs/internal-instructions.md b/docs/internal-instructions.md new file mode 100644 index 00000000..75d61c82 --- /dev/null +++ b/docs/internal-instructions.md @@ -0,0 +1,49 @@ +# Instructions for DSG Castor Contributors + +Please follow these instructions if you are a graduate student or undergrad research assistant working with the group +in the Data Systems Lab and want to run Castor on the lab desktop GPU machine (dragon). + +If you have trouble / questions with instructions on this page, ping @tuzhucheng on Slack. + +## PyTorch Environment + +We already have a multi-user Conda environment with PyTorch and all other dependencies installed, so you do not need to +install anything yourself. However, you can create [Conda environments](https://conda.io/docs/user-guide/tasks/manage-environments.html) +if you need to experiment with different library versions etc. + +The multi-user Conda environment is located at `/anaconda3/`. +To use this multi-user environment, just add the following to your `.bashrc` or configuration file for your favourite shell. + +```bash +export PATH="/anaconda3/bin:$PATH" +export LIBRARY_PATH="/usr/lib/nvidia-375" +``` + +Please also ensure `/usr/local/cuda-8.0/lib64` is in the `LD_LIBRARY_PATH` environment variable **if it is not already**. +If not, you should add it in the `.bashrc` similar to above. + +Please re-login or re-source your shell configuration after `.bashrc` is updated for the updated environment variables +to take effect. + +## Data and Pre-Trained Models + +We use shared cloned versions of the Castor-data and Castor-models repositories. +Instead of making your own cloned copies, you can just create symbolic links to the shared version instead +in your own working directory to save disk space. Assuming you want to put `Castor`, `Castor-data`, and `Castor-models` +under a directory called `castorini` and you are currently in the `castorini` directory, you can enter these commands: + +```bash +ln -s /Castor-data Castor-data +ln -s /Castor-models Castor-models +``` + +So after you clone Castor, you have a directory structure under `castorini` that looks like this: + +``` +. +├── Castor +├── Castor-data +└── Castor-models +``` + +where `Castor-data` and `Castor-models` are actually symbolic links to `/Castor-data` and `/Castor-models`. diff --git a/idf_baseline/requirements.txt b/idf_baseline/requirements.txt deleted file mode 100644 index 4cc717a8..00000000 --- a/idf_baseline/requirements.txt +++ /dev/null @@ -1,2 +0,0 @@ -nltk==3.2.1 -numpy==1.11.3 diff --git a/mp_cnn/README.md b/mp_cnn/README.md index 326267d2..5811d719 100644 --- a/mp_cnn/README.md +++ b/mp_cnn/README.md @@ -5,6 +5,7 @@ This is a PyTorch implementation of the following paper * Hua He, Kevin Gimpel, and Jimmy Lin. [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks](http://aclweb.org/anthology/D/D15/D15-1181.pdf). *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015)*, pages 1576-1586. Please ensure you have followed instructions in the main [README](../README.md) doc before running any further commands in this doc. +The commands in this doc assume you are under the root directory of the Castor repo. ## Pre-Trained Models diff --git a/requirements.txt b/requirements.txt index 3c6d5e7d..79dbce5d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,9 @@ +Flask==0.12.1 gensim==1.0.1 -numpy==1.12.1 +nltk==3.2.5 +numpy==1.14.0 pandas==0.19.2 -Flask==0.12.1 -nltk==3.2.2 pyjnius==1.1.1 --e git+https://github.com/castorini/Castor.git#egg=sm-cnn-1.0.0 \ No newline at end of file +scikit-learn==0.19.1 +scipy==1.0.0 +torchtext==0.2.3 diff --git a/sm_cnn/requirements.txt b/sm_cnn/requirements.txt deleted file mode 100644 index 0c28e6b7..00000000 --- a/sm_cnn/requirements.txt +++ /dev/null @@ -1,4 +0,0 @@ -nltk==3.2.4 -numpy==1.13.1 -pytorch==0.2.0 -torchtext==0.2.0