Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClassCastException when indexing ACL Anthology #2069

Closed
ygorg opened this issue Mar 1, 2023 · 2 comments
Closed

ClassCastException when indexing ACL Anthology #2069

ygorg opened this issue Mar 1, 2023 · 2 comments

Comments

@ygorg
Copy link
Contributor

ygorg commented Mar 1, 2023

When following the "Indexing the ACL Anthology with Anserini" the actual indexing raises the following traceback (see AclAnthology.java:158):

java.lang.ClassCastException: class com.fasterxml.jackson.databind.node.TextNode cannot be cast to class com.fasterxml.jackson.databind.node.ArrayNode (com.fasterxml.jackson.databind.node.TextNode and com.fasterxml.jackson.databind.node.ArrayNode are in unnamed module of loader 'app')
	at io.anserini.collection.AclAnthology$Document.<init>(AclAnthology.java:158) ~[anserini-0.20.0-fatjar.jar:?]
	at io.anserini.collection.AclAnthology$Segment.readNext(AclAnthology.java:115) ~[anserini-0.20.0-fatjar.jar:?]
	at io.anserini.collection.FileSegment$1.hasNext(FileSegment.java:136) ~[anserini-0.20.0-fatjar.jar:?]
	at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:298) [anserini-0.20.0-fatjar.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

There seem to have been a change on how the venues are processed in the acl-org/acl-anthology which breaks anserini's collection.ACLAnthology.

My use case is using castorini/covidex with ACL documents. My workaround was to index the bibtex of the aclanthology, but there is a lot of brackets and LaTeX things in the text, so I'd rather go with this solution.

Steps to reproduce:

git clone https://github.com/acl-org/acl-anthology
conda create -n acl_anth python=3.8
conda activate acl_anth
cd acl-anthology
pip install -r bin/requirements.txt
python bin/create_hugo_yaml.py 

pip install pyserini
python -m pyserini.index -collection AclAnthology -generator AclAnthologyGenerator -threads 8 -input build/data/ -index index/lucene-index-acl-paragraph -storePositions -storeDocvectors -storeContents -storeRaw -optimize

But everything works well when the acl-anthology version used is close to the creation of the "Indexing the ACL Anthology with Anserini" tutorial.

git clone https://github.com/acl-org/acl-anthology
git checkout -b same_date 9b3f001d2e705d6751118046643de71075836379
# 16/04/2020 acl-anthology commit 9b3f001d2e705d6751118046643de71075836379
# 07/04/2020 creation of tutorial in anserini https://github.com/castorini/anserini/blob/master/docs/acl-anthology.md
@ygorg
Copy link
Contributor Author

ygorg commented Mar 3, 2023

It seems that now the the name of the venues is stored in volume.get("venue"), and that volume.get("venues") is a yaml pointer (?) to volume.get("venue").

@lintool
Copy link
Member

lintool commented Mar 6, 2023

hi @ygorg thanks for your interest in Anserini and apologies for the late reply on this. If you've figured out the fix, can you perhaps send a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants