Skip to content

Direct access to all doc_ids #184

@ArthurCamara

Description

@ArthurCamara

This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn't seem to be.
Say I want to gather all doc_ids from a given corpus (for instance, if I want to use a random negative sampler on run time).
Currently, this is what I do:

data = ir_datasets.load("msmarco-document/train")
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())

which is fine, but, from what I can get, this triggers an iteration over all docs in the collection (and is also not very intuitive).

Is there a better way to achieve this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions