Anchor Text for msmarco-document and msmarco-document-v2 #154

mam10eks · 2022-01-18T07:40:25Z

Dataset Information:

We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).

Links to Resources:

Dataset ID(s) & supported entities:

Dataset ID: msmarco-document/anchor-text and msmarco-document-v2/anchor-text

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
- Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

I would be happy to help integrate the anchor texts into ir_datasets. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document and msmarco-document-v2 Ids but it might also make sense to have it as independent Ids.

The text was updated successfully, but these errors were encountered:

mam10eks · 2022-01-18T10:44:05Z

It looks like I have the basic integration working. In the current version, the anchor texts pointing to a document are concatenated to produce the text of a GenericDoc. I think it might also be helpful to build a second representation where a document has a list of anchors (i.e., not concatenating them) because this might be helpful for training models like DeepCT.

seanmacavaney · 2022-01-18T11:05:33Z

Thanks for contributing @mam10eks!

There's no strong rule about where datasets belong in the hierarchy. But I think I think I lean towards putting them under msmarco-document/anchor-text and msmarco-document-v2/anchor-text because it feels natural there. There's precedent for something similar to this, too, e.g., cord19 includes titles and abstracts for documents, while cord19/fulltext provides the article full text content (which are auxiliary and from other files).

I agree that both formats would be useful. This comes down to two central use cases identified in #72 -- cases where the user just wants the text as unstructured as possible (e.g., easy for indexing, re-ranking, etc.) and those where they want all possible information the dataset exposes (e.g., for your case about a particular way to train DeepCT). We have a plan to address this, but in the meantime, the general approach we've been going is providing both. So the doc object could provide: doc_id, text (as str, concat'd version of the anchors), and anchors (as a List[str] individually), + any other fields your dataset provides.

Let me know if you have any other questions or need help adding it.

mam10eks · 2022-01-19T07:54:58Z

Thanks @seanmacavaney for the feedback!

I have changed the implementation accordingly so that the anchor-text-documents now provide the doc_id, text, and anchors.

The main parts are done, but two things are still missing:

I have not generated the Metadata because it looks like this needs to download all other datasets as well
I was able to generate the Documentation, but I had to change a small thing in the associated script since otherwise the script failed for some Optional datatypes from other datasets and I would not like to push my changes in the script since they are only a workaround. But I checked that the generated documentation for the anchor-text looks like expected:

Would it be ok when we merge the current state and you can help with the generation of the metadata and documentation?

seanmacavaney · 2022-01-19T13:57:19Z

Awesome, thanks 🤘! This looks great. I opened a PR for this, and I'll take care of the metadata and documentation.

mam10eks · 2022-01-19T15:00:17Z

Nice, thanks! Please let me know when I can help further (I already saw that the automated checks failed, but this seems to be caused by the missing metadata).

seanmacavaney · 2022-01-20T19:10:52Z

@mam10eks -- can you accept the PR here with the metadata when you get a chance? mam10eks#1

seanmacavaney · 2022-01-20T19:58:42Z

Excellent, thanks again @mam10eks!

mam10eks added the add-dataset label Jan 18, 2022

mam10eks added a commit to mam10eks/ir_datasets that referenced this issue Jan 18, 2022

Start to integrate the anchor-text dataset for MS-Marco (allenai#154)

eb0358a

seanmacavaney mentioned this issue Jan 19, 2022

Anchor Text for msmarco-document and msmarco-document-v2 #155

Merged

seanmacavaney closed this as completed in #155 Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anchor Text for msmarco-document and msmarco-document-v2 #154

Anchor Text for msmarco-document and msmarco-document-v2 #154

mam10eks commented Jan 18, 2022 •

edited by seanmacavaney

Loading

mam10eks commented Jan 18, 2022

seanmacavaney commented Jan 18, 2022

mam10eks commented Jan 19, 2022

seanmacavaney commented Jan 19, 2022

mam10eks commented Jan 19, 2022

seanmacavaney commented Jan 20, 2022

seanmacavaney commented Jan 20, 2022

Anchor Text for msmarco-document and msmarco-document-v2 #154

Anchor Text for msmarco-document and msmarco-document-v2 #154

Comments

mam10eks commented Jan 18, 2022 • edited by seanmacavaney Loading

mam10eks commented Jan 18, 2022

seanmacavaney commented Jan 18, 2022

mam10eks commented Jan 19, 2022

seanmacavaney commented Jan 19, 2022

mam10eks commented Jan 19, 2022

seanmacavaney commented Jan 20, 2022

seanmacavaney commented Jan 20, 2022

mam10eks commented Jan 18, 2022 •

edited by seanmacavaney

Loading