-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anchor Text for msmarco-document and msmarco-document-v2 #154
Comments
It looks like I have the basic integration working. In the current version, the anchor texts pointing to a document are concatenated to produce the text of a GenericDoc. I think it might also be helpful to build a second representation where a document has a list of anchors (i.e., not concatenating them) because this might be helpful for training models like DeepCT. |
Thanks for contributing @mam10eks! There's no strong rule about where datasets belong in the hierarchy. But I think I think I lean towards putting them under I agree that both formats would be useful. This comes down to two central use cases identified in #72 -- cases where the user just wants the text as unstructured as possible (e.g., easy for indexing, re-ranking, etc.) and those where they want all possible information the dataset exposes (e.g., for your case about a particular way to train DeepCT). We have a plan to address this, but in the meantime, the general approach we've been going is providing both. So the doc object could provide: Let me know if you have any other questions or need help adding it. |
Thanks @seanmacavaney for the feedback! I have changed the implementation accordingly so that the anchor-text-documents now provide the The main parts are done, but two things are still missing:
Would it be ok when we merge the current state and you can help with the generation of the metadata and documentation? |
Awesome, thanks 🤘! This looks great. I opened a PR for this, and I'll take care of the metadata and documentation. |
Nice, thanks! Please let me know when I can help further (I already saw that the automated checks failed, but this seems to be caused by the missing metadata). |
@mam10eks -- can you accept the PR here with the metadata when you get a chance? mam10eks#1 |
Excellent, thanks again @mam10eks! |
Dataset Information:
We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).
Links to Resources:
Dataset ID(s) & supported entities:
msmarco-document/anchor-text
andmsmarco-document-v2/anchor-text
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.Additional comments/concerns/ideas/etc.
I would be happy to help integrate the anchor texts into
ir_datasets
. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existingmsmarco-document
andmsmarco-document-v2
Ids but it might also make sense to have it as independent Ids.The text was updated successfully, but these errors were encountered: