Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anchor Text for msmarco-document and msmarco-document-v2 #154

Closed
8 tasks done
mam10eks opened this issue Jan 18, 2022 · 7 comments · Fixed by #155
Closed
8 tasks done

Anchor Text for msmarco-document and msmarco-document-v2 #154

mam10eks opened this issue Jan 18, 2022 · 7 comments · Fixed by #155

Comments

@mam10eks
Copy link
Contributor

mam10eks commented Jan 18, 2022

Dataset Information:

We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).

Links to Resources:

Dataset ID(s) & supported entities:

  • Dataset ID: msmarco-document/anchor-text and msmarco-document-v2/anchor-text

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json)
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

I would be happy to help integrate the anchor texts into ir_datasets. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document and msmarco-document-v2 Ids but it might also make sense to have it as independent Ids.

@mam10eks
Copy link
Contributor Author

It looks like I have the basic integration working. In the current version, the anchor texts pointing to a document are concatenated to produce the text of a GenericDoc. I think it might also be helpful to build a second representation where a document has a list of anchors (i.e., not concatenating them) because this might be helpful for training models like DeepCT.

@seanmacavaney
Copy link
Collaborator

Thanks for contributing @mam10eks!

There's no strong rule about where datasets belong in the hierarchy. But I think I think I lean towards putting them under msmarco-document/anchor-text and msmarco-document-v2/anchor-text because it feels natural there. There's precedent for something similar to this, too, e.g., cord19 includes titles and abstracts for documents, while cord19/fulltext provides the article full text content (which are auxiliary and from other files).

I agree that both formats would be useful. This comes down to two central use cases identified in #72 -- cases where the user just wants the text as unstructured as possible (e.g., easy for indexing, re-ranking, etc.) and those where they want all possible information the dataset exposes (e.g., for your case about a particular way to train DeepCT). We have a plan to address this, but in the meantime, the general approach we've been going is providing both. So the doc object could provide: doc_id, text (as str, concat'd version of the anchors), and anchors (as a List[str] individually), + any other fields your dataset provides.

Let me know if you have any other questions or need help adding it.

@mam10eks
Copy link
Contributor Author

Thanks @seanmacavaney for the feedback!

I have changed the implementation accordingly so that the anchor-text-documents now provide the doc_id, text, and anchors.

The main parts are done, but two things are still missing:

  • I have not generated the Metadata because it looks like this needs to download all other datasets as well
  • I was able to generate the Documentation, but I had to change a small thing in the associated script since otherwise the script failed for some Optional datatypes from other datasets and I would not like to push my changes in the script since they are only a workaround. But I checked that the generated documentation for the anchor-text looks like expected:
    Screenshot_20220119_083055

Would it be ok when we merge the current state and you can help with the generation of the metadata and documentation?

@seanmacavaney
Copy link
Collaborator

Awesome, thanks 🤘! This looks great. I opened a PR for this, and I'll take care of the metadata and documentation.

@mam10eks
Copy link
Contributor Author

Nice, thanks! Please let me know when I can help further (I already saw that the automated checks failed, but this seems to be caused by the missing metadata).

@seanmacavaney
Copy link
Collaborator

@mam10eks -- can you accept the PR here with the metadata when you get a chance? mam10eks#1

@seanmacavaney
Copy link
Collaborator

Excellent, thanks again @mam10eks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants