-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clueweb12 diversity task datasets #198
Conversation
* Add dataset clueweb12/trec-web-2013/diversity * Add dataset clueweb12/trec-web-2014/diversity
This looks great to me! I added tests for the datasets that have been implemented and added the diversity qrels (for all 6 years) to irds-mirror: seanmacavaney/irds-mirror@c607cf8 Looks fine to go ahead and add the remaining years. I'll hold off 'till the end to generate the documentation. I also added you to the list of contributors in the readme. Please make sure that the name and affiliation are correct. Thanks again for your help! |
* Add dataset clueweb12/trec-web-2013/diversity * Add dataset clueweb12/trec-web-2014/diversity
850dbac
to
9534ef1
Compare
Hi ! I added the TrecWeb2009-2012 diversity tasks and corrected the affiliation (I have to update my github :D). Because I did not manage to run the tests locally, I did not write them for the new datasets. However with some pointers, I could write a Running tests locally section in the readme and write them for this case. |
Sorry for the delay in the review. It looks great -- just had to import The trick for running tests locally is to invoke it as a module like so: python -m test.integration.clueweb12 There's probably a better way. Feel free to create a PR for a "Running tests locally" section in the README :D. |
Following the issue #197
Question : do we correct the type of
query_id
andsubtopic_id
? (I'd be glad to do it)Dataset Information: ClueWeb09/12, (see #1) diversity tracks
The clueweb dataset and correcponding trec-web tasks are already implemented in ir_datasets but not the diversity track.
I begun an implementation of the trec-web-2013/2014 diversity tasks (see fork).
Links to Resources:
Dataset ID(s) & supported entities:
The diversity track relies on the already implemented queries and document. Yet, the standard qrels are not suitable since here, the qrels are also relative to subtopics.
This is why in the implementation, I created a new entity
TrecSubQrel
(for trec subtopic query relevance).Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.Additional comments/concerns/ideas/etc.
This is a first draft of this proposition, I'd be glad to hear any idea or suggestion !
When we reach a suitable structure, I can also handle the trec web diversity track related to clueweb09 (2009-2012)