This repository includes:
- a corpus of medical articles segmeted into EDUs, following these guidelines (from RST-DT)
- code for preprocessing, postprocessing, evaluation
Please cite our NAACL DISRPT Workshop paper as:
@inproceedings{ferracane-etal-2019-news,
title = "From News to Medical: Cross-domain Discourse Segmentation",
author = "Ferracane, Elisa and
Page, Titan and
Li, Junyi Jessy and
Erk, Katrin",
booktitle = "Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019",
month = jun,
year = "2019",
address = "Minneapolis, MN",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W19-2704",
pages = "22--29"}
Raw data is in data/raw
Gold segmented data is in data/gold
Notes:
- data/rst-dt_tiny_file_list.txt contains list of files used from RST-DT (we only provide samples due to licensing)
- version of gold data used in paper is in data/gold_paper
We use three publicly available RST segmenters to segment the raw data.
- get code and requirements from here: https://github.com/PKU-TANGENT/NeuralEDUSeg
- segment:
python run.py --segment --input_files <this_repo>/data/gold/*.out --result_dir <this_repo>/data/segmented/predicted_neural
- rename files:
cd <this_repo>/data/segmented/predicted_neural
for f in *.out; do
mv -- "$f" "${f}.edus"
done
- download code and requirements to <feng_parser_dir>
- update to newer version of Stanford Core NLP:
- download and unzip Stanford Core NLP to <stanford_corenlp_dir>:
curl -O https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-17.zip
- download and unzip Stanford parser to <stanford_parser_dir>:
curl -O https://nlp.stanford.edu/software/stanford-parser-full-2018-10-17.zip
- modify paths in <feng_parser_dir>/src/paths.py to point to <stanford_corenlp_dir> and <stanford_parser_dir>
- replace ParserDemo.java in <stanford_parser_dir> with our version here
- compile new java file:
javac -cp ".:*" ParserDemo.java
- download and unzip Stanford Core NLP to <stanford_corenlp_dir>:
- run parser:
python parse.py -s -g -t <this_repo>/data/segmented/predicted_feng/ -D <this_repo>/data/file_list.txt
- postprocess predicted EDU files:
python parse_predicted.py feng ../data/segmented/predicted_feng/ ../data/segmented/postprocessed_feng/
- download code and requirements
- run DPLP parser:
- use our version of corenlp.sh to parse the input:
./corenlp_raw.sh <stanford_corenlp_dir> ../data/raw/ ../data/segmented/preprocessed_dplp/
python convert.py <this_repo>/data/segmented/preprocessed_dplp/
python segmenter.py <this_repo>/data/segmented/preprocessed_dplp/ <this_repo>/data/segmented/predicted_dplp/
- use our version of corenlp.sh to parse the input:
- postprocess predicted EDU files:
python parse_predicted.py dplp ../data/segmented/predicted_dplp/ ../data/segmented/postprocessed_dplp/
python evaluate_segmentation.py ../data/gold/postprocessed/stanford/ ../data/gold/postprocessed/spacy/ ../data/segmented/postprocessed_dplp/ ../data/segmented/postprocessed_feng/ ../data/segmented/predicted_neural/
If the segmented gold data is updated, follow these steps to reprocess:
python parse_gold.py edu ../data/gold/ ../data/gold/postprocessed/spacy/ spacy
./corenlp_segmented.sh <stanford_corenlp_dir> ../data/gold/ ../data/gold/postprocessed/stanford/
python parse_gold.py edu ../data/gold/postprocessed/stanford ../data/gold/postprocessed/stanford/ stanford