initial commit

inital commit initial commit
armancohan · Oct 29, 2018 · c767d75 · c767d75
1 parent 6c54644
commit c767d75
Show file tree

Hide file tree

Showing 23 changed files with 4,990 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,115 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+.DS_Store
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+.idea
diff --git a/README.md b/README.md
@@ -1,3 +1,4 @@
+This repository contains data and code for the NAACL 2018 paper ["A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"](https://arxiv.org/abs/1804.05685). Please note that the code is not actively maintained.
 
 #### Data
 
@@ -8,10 +9,23 @@ PubMed dataset: [Download](https://drive.google.com/file/d/1Sa3kip8IE0J1SkMivlgO
 
 The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each `tar` file consists of 4 files. `train.txt`, `val.txt`, `test.txt` respectively correspond to the training, validation, and test sets. These files are text files where each line is a json object corresponding to one scientific paper from ArXiv or PubMed. The `vocab` file is a plaintext file for the vocabulary. 
 
-#### Reference
+#### Code
 
+The code is based on the pointer-generator network code by [See et al. (2017)](https://github.com/abisee/pointer-generator). Refer to their repo for documentation about the structure of the code.
+You will need `python 3.6` and `Tensorflow 1.5` to run the code. The code might run with later versions of Tensorflow but it is not tested. Checkout other dependencies in `requirements.txt` file. To run the code unzip the files in the `data` directory and simply execute the run script: `./run.sh`.  
+
+#### References
+
+If you ended up finding this paper or repo useful please cite:
 ```
-"A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"
+"A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"  
 Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian  
 NAACL-HLT 2018
 ```
+
+Another relevant reference is Pointer-Generator network by See et al. (2017):
+```
+"Get to the point: Summarization with pointer-generator networks."  
+Abigail See, Peter J. Liu, and Christopher D. Manning.  
+ACL (2017).
+``` 
diff --git a/__init__.py b/__init__.py