Update README.md

armancohan · Mar 22, 2019 · fe8054d · fe8054d
1 parent 969ac74
commit fe8054d
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -7,12 +7,12 @@ Two datasets of long and structured documents (scientific papers) are provided.
 ArXiv dataset: [Download](https://drive.google.com/file/d/1K2kDBTNXS2ikx9xKmi2Fy0Wsc5u_Lls0/view?usp=sharing)  
 PubMed dataset: [Download](https://drive.google.com/file/d/1Sa3kip8IE0J1SkMivlgOwq1jBgOnzeny/view?usp=sharing)
 
-The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each `tar` file consists of 4 files. `train.txt`, `val.txt`, `test.txt` respectively correspond to the training, validation, and test sets. These files are text files where each line is a json object corresponding to one scientific paper from ArXiv or PubMed. Use the the following script: `scripts/json_to_bin.py` to convert these files into Tensorflow bin files that are used for training. The `vocab` file is a plaintext file for the vocabulary. 
+The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each `tar` file consists of 4 files. `train.txt`, `val.txt`, `test.txt` respectively correspond to the training, validation, and test sets. These files are text files where each line is a json object corresponding to one scientific paper from ArXiv or PubMed. The `vocab` file is a plaintext file for the vocabulary. 
 
 #### Code
 
 The code is based on the pointer-generator network code by [See et al. (2017)](https://github.com/abisee/pointer-generator). Refer to their repo for documentation about the structure of the code.
-You will need `python 3.6` and `Tensorflow 1.5` to run the code. The code might run with later versions of Tensorflow but it is not tested. Checkout other dependencies in `requirements.txt` file. To run the code unzip the files in the `data` directory and simply execute the run script: `./run.sh`.  
+You will need `python 3.6` and `Tensorflow 1.5` to run the code. The code might run with later versions of Tensorflow but it is not tested. Checkout other dependencies in `requirements.txt` file. A small sample of the dataset is already provided in this repo. To run the code with the sample data unzip the files in the `data` directory and simply execute the run script: `./run.sh`. To train the model with the entire dataset, first convert the jsonlines files to binary using the the following script: `scripts/json_to_bin.py` and modify the corresponding training data path in the `run.sh` script.
 
 #### References