ACLAM: ACL Anthology Mirror

This repository contains YAML data files for the entire ACL Anthology, augmented with data from Semantic Scholar.

Building the ACL Anthology YAML data

IMPORTANT: The ACL Anthology requires Python 3.7+; no, it won't work with Python 3.6.

Run the build_acl_data.sh script:

sh scripts/build_acl_data.sh

OR

To do this step-by-step, clone the ACL anthology repository containing the raw XML data:

git clone https://github.com/acl-org/acl-anthology.git

Next, navigate to the acl-anthology folder and install dependencies:

cd acl-anthology

pip3 install -r bin/requirements.txt

Create the data export directory:

mkdir -p build/data

Modify the YAML generation script to generate the abstracts without HTML tags:

sed -i 's/data\[\"abstract_html\"\] = paper.get_abstract(\"html\")/data["abstract"] = paper.get_abstract("plain")/g' \
    bin/create_hugo_yaml.py

Generate cleaned YAML data:

python3 bin/create_hugo_yaml.py

Generated ACL files can now be found in acl-anthology/build/data/

Augmenting YAML files with Semantic Scholar data

First, download the latest Semantic Scholar open corpus using the AWS CLI. This requires ~120GB of hard disk space.

aws s3 cp --no-sign-request --recursive s3://ai2-s2-research-public/open-corpus/LATEST-CORPUS-DATE/ SEMANTIC_SCHOLAR_PATH

Unzip all of the compressed files into JSON files using the following script. This requires ~300GB of additional hard disk space.

for file in *.gz; do
  gunzip -c "$file" > "${file/.gz*/.json}"
done

Filtering out ACL papers

We filter out all Semantic Scholar papers that are from venues in the ACL Anthology.

python3 scripts/semantic_scholar/filter_acl.py --semantic_scholar_path SEMANTIC_SCHOLAR_PATH --acl_data_path ACL_DATA_PATH

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACLAM: ACL Anthology Mirror

Building the ACL Anthology YAML data

Augmenting YAML files with Semantic Scholar data

Filtering out ACL papers

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ACLAM: ACL Anthology Mirror

Building the ACL Anthology YAML data

Augmenting YAML files with Semantic Scholar data

Filtering out ACL papers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages