Text Simplification System and Dataset
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data add truecased data Nov 25, 2016
.gitignore gitignore Jan 16, 2018
LICENSE Create LICENSE Sep 12, 2018
README.md Update README.md May 3, 2018
SARI.py pep8 Nov 30, 2017
__init__.py made SARI importable Oct 24, 2017
tacl2016-smt-simplification.pdf add the paper Sep 3, 2016

README.md

Contact: Wei Xu (Ohio State University)

Code, data and trained models from the following paper:

 @article{Xu-EtAl:2016:TACL,
 author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch},
 title = {Optimizing Statistical Machine Translation for Text Simplification},
 journal = {Transactions of the Association for Computational Linguistics},
 volume = {4},
 year = {2016},
 url = {https://cocoxu.github.io/publications/tacl2016-smt-simplification.pdf},
 pages = {401--415}
 }

Data

./tacl2016-smt-simplification.pdf the paper

./data/turkcorpus/ tuning and test data

*.norm       tokenized sentences from English Wikipedia

*.simp       tokenized, corresponding sentences from Simple English Wikipedia

*.turk.0~7   8 reference simplifications by 8 different Amazon Mechanical Turkers 

./data/systemoutputs/ 4 different system outputs compared in the paper

./data/ppdb/ppdb-1.0-xl-all-simp.gz (a 3.8G file) paraphrase rules (PPDB 1.0) with added simplification-specific features

./data/ppdb/ppdb-1.0-xxxl-lexical-self-simp.gz (a 27M file) self-paraphrase lexical rules that map words to themselves, and help to copy input words into outputs

Code

./SARI.py a stand-alone Python implementation of the SARI metric for text simplification evaluation

There is also a Java implementation of SARI that is integrated as part of the Joshua's codebase.

The Text Simplificaiton System

The text simplification system was implemented into the MT toolkit Joshua Decoder.

./ppdb-simplification-release-joshua5.0.zip (a 281M file) The experiments in our TACL 2016 paper used the Joshua 5.0. Example scripts for training the simplification are under the directory ./bin/. Note that STAR is corpus-level version of SARI, SARI is sentence-level. The joshua_TACL2016.config is also provided -- that is corresponding to the best system in our paper. You may find the Joshua pipeline tutorial useful.