Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

News to Medical EDU Segmentation

This repository includes:

  • a corpus of medical articles segmeted into EDUs, following these guidelines (from RST-DT)
  • code for preprocessing, postprocessing, evaluation

Please cite our NAACL DISRPT Workshop paper as:

  title={From News to Medical: Cross-domain Discourse Segmentation},
  author={Ferracane, Elisa and Page, Titan and Li, Junyi Jessy and Erk, Katrin},
  booktitle={Proceedings of the 7th Workshop on Rhetorical Structure Theory and Related Formalisms},
  publisher ="Association for Computational Linguistics",
  pages={in press},


Raw data is in data/raw

Gold segmented data is in data/gold



We use three publicly available RST segmenters to segment the raw data.

1. Neural EDU Segmenter

  1. get code and requirements from here:
  2. segment: python --segment --input_files <this_repo>/data/gold/*.out --result_dir <this_repo>/data/segmented/predicted_neural
  3. rename files:
cd <this_repo>/data/segmented/predicted_neural  
for f in *.out; do 
 mv -- "$f" "${f}.edus" 

2. Two-pass Feng Parser

  1. download code and requirements to <feng_parser_dir>
  2. update to newer version of Stanford Core NLP:
    1. download and unzip Stanford Core NLP to <stanford_corenlp_dir>: curl -O
    2. download and unzip Stanford parser to <stanford_parser_dir>: curl -O
    3. modify paths in <feng_parser_dir>/src/ to point to <stanford_corenlp_dir> and <stanford_parser_dir>
    4. replace in <stanford_parser_dir> with our version here
    5. compile new java file: javac -cp ".:*"
  3. run parser: python -s -g -t <this_repo>/data/segmented/predicted_feng/ -D <this_repo>/data/file_list.txt
  4. postprocess predicted EDU files: python feng ../data/segmented/predicted_feng/ ../data/segmented/postprocessed_feng/

3. DPLP Parser

  1. download code and requirements
  2. run DPLP parser:
    1. use our version of to parse the input: ./ <stanford_corenlp_dir> ../data/raw/ ../data/segmented/preprocessed_dplp/
    2. python <this_repo>/data/segmented/preprocessed_dplp/
    3. python <this_repo>/data/segmented/preprocessed_dplp/ <this_repo>/data/segmented/predicted_dplp/
  3. postprocess predicted EDU files: python dplp ../data/segmented/predicted_dplp/ ../data/segmented/postprocessed_dplp/


python ../data/gold/postprocessed/stanford/ ../data/gold/postprocessed/spacy/ ../data/segmented/postprocessed_dplp/ ../data/segmented/postprocessed_feng/ ../data/segmented/predicted_neural/

Update segmented gold data

If the segmented gold data is updated, follow these steps to reprocess:

  1. python edu ../data/gold/ ../data/gold/postprocessed/spacy/ spacy
  2. ./ <stanford_corenlp_dir> ../data/gold/ ../data/gold/postprocessed/stanford/
  3. python edu ../data/gold/postprocessed/stanford ../data/gold/postprocessed/stanford/ stanford
You can’t perform that action at this time.