Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clueweb12 diversity task datasets #198

Merged
merged 10 commits into from
Jun 20, 2022

Conversation

grodino
Copy link
Contributor

@grodino grodino commented Jun 10, 2022

Following the issue #197

  • Add dataset clueweb12/trec-web-2013/diversity
  • Add dataset clueweb12/trec-web-2014/diversity

Question : do we correct the type of query_id and subtopic_id ? (I'd be glad to do it)

Dataset Information: ClueWeb09/12, (see #1) diversity tracks

The clueweb dataset and correcponding trec-web tasks are already implemented in ir_datasets but not the diversity track.
I begun an implementation of the trec-web-2013/2014 diversity tasks (see fork).

Links to Resources:

Dataset ID(s) & supported entities:

The diversity track relies on the already implemented queries and document. Yet, the standard qrels are not suitable since here, the qrels are also relative to subtopics.
This is why in the implementation, I created a new entity TrecSubQrel (for trec subtopic query relevance).

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json)
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

This is a first draft of this proposition, I'd be glad to hear any idea or suggestion !
When we reach a suitable structure, I can also handle the trec web diversity track related to clueweb09 (2009-2012)

Augustin Godinot and others added 3 commits June 10, 2022 10:21
* Add dataset clueweb12/trec-web-2013/diversity
* Add dataset clueweb12/trec-web-2014/diversity
@seanmacavaney
Copy link
Collaborator

This looks great to me!

I added tests for the datasets that have been implemented and added the diversity qrels (for all 6 years) to irds-mirror: seanmacavaney/irds-mirror@c607cf8

Looks fine to go ahead and add the remaining years. I'll hold off 'till the end to generate the documentation.

I also added you to the list of contributors in the readme. Please make sure that the name and affiliation are correct.

Thanks again for your help!

Augustin Godinot and others added 4 commits June 15, 2022 10:48
* Add dataset clueweb12/trec-web-2013/diversity
* Add dataset clueweb12/trec-web-2014/diversity
@grodino
Copy link
Contributor Author

grodino commented Jun 15, 2022

Hi !
Thanks for adding the tests. I tried running them myself but kept having relative import errors. Is there a specific way to run some tests locally ?

I added the TrecWeb2009-2012 diversity tasks and corrected the affiliation (I have to update my github :D). Because I did not manage to run the tests locally, I did not write them for the new datasets. However with some pointers, I could write a Running tests locally section in the readme and write them for this case.

@seanmacavaney
Copy link
Collaborator

Sorry for the delay in the review. It looks great -- just had to import TrecSubQrel in the cw12 test.

The trick for running tests locally is to invoke it as a module like so:

python -m test.integration.clueweb12

There's probably a better way. Feel free to create a PR for a "Running tests locally" section in the README :D.

seanmacavaney added a commit to seanmacavaney/ir-datasets.com that referenced this pull request Jun 20, 2022
@seanmacavaney seanmacavaney merged commit cd5e32a into allenai:master Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants