Skip to content

elisaF/news-med-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

News to Medical EDU Segmentation

This repository includes:

  • a corpus of medical articles segmeted into EDUs, following these guidelines (from RST-DT)
  • code for preprocessing, postprocessing, evaluation

Please cite our NAACL DISRPT Workshop paper as:

@inproceedings{ferracane-etal-2019-news,
    title = "From News to Medical: Cross-domain Discourse Segmentation",
    author = "Ferracane, Elisa  and
      Page, Titan  and
      Li, Junyi Jessy  and
      Erk, Katrin",
    booktitle = "Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019",
    month = jun,
    year = "2019",
    address = "Minneapolis, MN",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-2704",
    pages = "22--29"}

Corpus:

Raw data is in data/raw

Gold segmented data is in data/gold

Notes:

Segmentation:

We use three publicly available RST segmenters to segment the raw data.

1. Neural EDU Segmenter

  1. get code and requirements from here: https://github.com/PKU-TANGENT/NeuralEDUSeg
  2. segment: python run.py --segment --input_files <this_repo>/data/gold/*.out --result_dir <this_repo>/data/segmented/predicted_neural
  3. rename files:
cd <this_repo>/data/segmented/predicted_neural  
for f in *.out; do 
 mv -- "$f" "${f}.edus" 
done

2. Two-pass Feng Parser

  1. download code and requirements to <feng_parser_dir>
  2. update to newer version of Stanford Core NLP:
    1. download and unzip Stanford Core NLP to <stanford_corenlp_dir>: curl -O https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-17.zip
    2. download and unzip Stanford parser to <stanford_parser_dir>: curl -O https://nlp.stanford.edu/software/stanford-parser-full-2018-10-17.zip
    3. modify paths in <feng_parser_dir>/src/paths.py to point to <stanford_corenlp_dir> and <stanford_parser_dir>
    4. replace ParserDemo.java in <stanford_parser_dir> with our version here
    5. compile new java file: javac -cp ".:*" ParserDemo.java
  3. run parser: python parse.py -s -g -t <this_repo>/data/segmented/predicted_feng/ -D <this_repo>/data/file_list.txt
  4. postprocess predicted EDU files: python parse_predicted.py feng ../data/segmented/predicted_feng/ ../data/segmented/postprocessed_feng/

3. DPLP Parser

  1. download code and requirements
  2. run DPLP parser:
    1. use our version of corenlp.sh to parse the input: ./corenlp_raw.sh <stanford_corenlp_dir> ../data/raw/ ../data/segmented/preprocessed_dplp/
    2. python convert.py <this_repo>/data/segmented/preprocessed_dplp/
    3. python segmenter.py <this_repo>/data/segmented/preprocessed_dplp/ <this_repo>/data/segmented/predicted_dplp/
  3. postprocess predicted EDU files: python parse_predicted.py dplp ../data/segmented/predicted_dplp/ ../data/segmented/postprocessed_dplp/

Evaluation

python evaluate_segmentation.py ../data/gold/postprocessed/stanford/ ../data/gold/postprocessed/spacy/ ../data/segmented/postprocessed_dplp/ ../data/segmented/postprocessed_feng/ ../data/segmented/predicted_neural/

Update segmented gold data

If the segmented gold data is updated, follow these steps to reprocess:

  1. python parse_gold.py edu ../data/gold/ ../data/gold/postprocessed/spacy/ spacy
  2. ./corenlp_segmented.sh <stanford_corenlp_dir> ../data/gold/ ../data/gold/postprocessed/stanford/
  3. python parse_gold.py edu ../data/gold/postprocessed/stanford ../data/gold/postprocessed/stanford/ stanford

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published