News to Medical EDU Segmentation

This repository includes:

  • a corpus of medical articles segmeted into EDUs, following these guidelines (from RST-DT)
  • code for preprocessing, postprocessing, evaluation

Please cite our NAACL DISRPT Workshop paper as:

  title={From News to Medical: Cross-domain Discourse Segmentation},
  author={Ferracane, Elisa and Page, Titan and Li, Junyi Jessy and Erk, Katrin},
  booktitle={Proceedings of the 7th Workshop on Rhetorical Structure Theory and Related Formalisms},
  publisher ="Association for Computational Linguistics",
  pages={in press},


Raw data is in data/raw

Gold segmented data is in data/gold



We use three publicly available RST segmenters to segment the raw data.

1. Neural EDU Segmenter

  1. get code and requirements from here:
  2. segment: python --segment --input_files <this_repo>/data/gold/*.out --result_dir <this_repo>/data/segmented/predicted_neural
  3. rename files:
cd <this_repo>/data/segmented/predicted_neural  
for f in *.out; do 
 mv -- "$f" "${f}.edus" 

2. Two-pass Feng Parser

  1. download code and requirements to <feng_parser_dir>
  2. update to newer version of Stanford Core NLP:
    1. download and unzip Stanford Core NLP to <stanford_corenlp_dir>: curl -O
    2. download and unzip Stanford parser to <stanford_parser_dir>: curl -O
    3. modify paths in <feng_parser_dir>/src/ to point to <stanford_corenlp_dir> and <stanford_parser_dir>
    4. replace in <stanford_parser_dir> with our version here
    5. compile new java file: javac -cp ".:*"
  3. run parser: python -s -g -t <this_repo>/data/segmented/predicted_feng/ -D <this_repo>/data/file_list.txt
  4. postprocess predicted EDU files: python feng ../data/segmented/predicted_feng/ ../data/segmented/postprocessed_feng/

3. DPLP Parser

  1. download code and requirements
  2. run DPLP parser:
    1. use our version of to parse the input: ./ <stanford_corenlp_dir> ../data/raw/ ../data/segmented/preprocessed_dplp/
    2. python <this_repo>/data/segmented/preprocessed_dplp/
    3. python <this_repo>/data/segmented/preprocessed_dplp/ <this_repo>/data/segmented/predicted_dplp/
  3. postprocess predicted EDU files: python dplp ../data/segmented/predicted_dplp/ ../data/segmented/postprocessed_dplp/


python ../data/gold/postprocessed/stanford/ ../data/gold/postprocessed/spacy/ ../data/segmented/postprocessed_dplp/ ../data/segmented/postprocessed_feng/ ../data/segmented/predicted_neural/

Update segmented gold data

If the segmented gold data is updated, follow these steps to reprocess:

  1. python edu ../data/gold/ ../data/gold/postprocessed/spacy/ spacy
  2. ./ <stanford_corenlp_dir> ../data/gold/ ../data/gold/postprocessed/stanford/
  3. python edu ../data/gold/postprocessed/stanford ../data/gold/postprocessed/stanford/ stanford
