This repository contains YAML data files for the entire ACL Anthology, augmented with data from Semantic Scholar.
IMPORTANT: The ACL Anthology requires Python 3.7+; no, it won't work with Python 3.6.
Run the build_acl_data.sh script:
sh scripts/build_acl_data.shOR
To do this step-by-step, clone the ACL anthology repository containing the raw XML data:
git clone https://github.com/acl-org/acl-anthology.gitNext, navigate to the acl-anthology folder and install dependencies:
cd acl-anthologypip3 install -r bin/requirements.txtCreate the data export directory:
mkdir -p build/dataModify the YAML generation script to generate the abstracts without HTML tags:
sed -i 's/data\[\"abstract_html\"\] = paper.get_abstract(\"html\")/data["abstract"] = paper.get_abstract("plain")/g' \
bin/create_hugo_yaml.pyGenerate cleaned YAML data:
python3 bin/create_hugo_yaml.pyGenerated ACL files can now be found in acl-anthology/build/data/
First, download the latest Semantic Scholar open corpus using the AWS CLI. This requires ~120GB of hard disk space.
aws s3 cp --no-sign-request --recursive s3://ai2-s2-research-public/open-corpus/LATEST-CORPUS-DATE/ SEMANTIC_SCHOLAR_PATHUnzip all of the compressed files into JSON files using the following script. This requires ~300GB of additional hard disk space.
for file in *.gz; do
gunzip -c "$file" > "${file/.gz*/.json}"
doneWe filter out all Semantic Scholar papers that are from venues in the ACL Anthology.
python3 scripts/semantic_scholar/filter_acl.py --semantic_scholar_path SEMANTIC_SCHOLAR_PATH --acl_data_path ACL_DATA_PATH