Skip to content

Latest commit

 

History

History
265 lines (205 loc) · 14.2 KB

adding_a_dataset.md

File metadata and controls

265 lines (205 loc) · 14.2 KB

Adding a Dataset

To add a new dataset to MTEB, you need to do three things:

  1. Implement a task with the desired dataset, by subclassing an abstract task
  2. Add metadata to the task
  3. Submit the edits to the MTEB repository

If you have any questions regarding this process feel free to open a discussion thread.

Note: When we mention adding a dataset we refer to a subclass of one of the abstasks.

1) Creating a new subclass

A Simple Example

To add a new task, you need to implement a new class that inherits from the AbsTask associated with the task type (e.g. AbsTaskReranking for reranking tasks). You can find the supported task types in here.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer
from mteb.abstasks.TaskMetadata import TaskMetadata

class SciDocsReranking(AbsTaskReranking):
    metadata = TaskMetadata(
        name="SciDocsRR",
        description="Ranking of related scientific papers based on their title.",
        reference="https://allenai.org/data/scidocs",
        type="Reranking",
        category="s2s",
        eval_splits=["test"],
        eval_langs=["eng-Latn"],
        main_score="map",
        dataset={
            "path": "mteb/scidocs-reranking",
            "revision": "d3c5e1fc0b855ab6097bf1cda04dd73947d7caab",
        }
        date=("2000-01-01", "2020-12-31"), # best guess
        form="written",
        domains=["Academic", "Non-fiction"],
        task_subtypes=["Scientific Reranking"],
        license="cc-by-4.0",
        socioeconomic_status="high",
        annotations_creators="derived",
        dialect=[],
        text_creation="found",
        n_samples={"test": 19599},
        avg_character_length={"test": 69.0},
        bibtex_citation="""
@inproceedings{cohan-etal-2020-specter,
    title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers",
    author = "Cohan, Arman  and
      Feldman, Sergey  and
      Beltagy, Iz  and
      Downey, Doug  and
      Weld, Daniel",
    editor = "Jurafsky, Dan  and
      Chai, Joyce  and
      Schluter, Natalie  and
      Tetreault, Joel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.207",
    doi = "10.18653/v1/2020.acl-main.207",
    pages = "2270--2282",
    abstract = "Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are a necessity. We propose SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, Specter can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that Specter outperforms a variety of competitive baselines on the benchmark.",
}
""",
)

# testing the task with a model:
model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

Note: for multilingual / crosslingual tasks, make sure your class also inherits from the MultilingualTask class like in this example.

A Detailed Example

Often the dataset from HuggingFace is not in the format expected by MTEB. To resolve this you can either change the format on Hugging Face or add a dataset_transform method to your dataset to transform it into the right format on the fly. Here is an example along with some design considerations:

class VGClustering(AbsTaskClustering):
    metadata = TaskMetadata(
        name="VGClustering",
        description="Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.",
        reference="https://huggingface.co/datasets/navjordj/VG_summarization",
        type="Clustering",
        category="p2p",
        eval_splits=["test"],
        eval_langs=["nob-Latn"],
        main_score="v_measure",
        dataset={
            "path": "navjordj/VG_summarization",
            "revision": "d4c5a8ba10ae71224752c727094ac4c46947fa29",
        },
        date=("2012-01-01", "2020-01-01"),
        form="written",
        domains=["Academic", "Non-fiction"],
        task_subtypes=["Scientific Reranking"],
        license="cc-by-nc",
        socioeconomic_status="high",
        annotations_creators="derived",
        dialect=[],
        text_creation="found",
        bibtex_citation= ... # removed for brevity
)

    def dataset_transform(self):
        splits = self.description["eval_splits"]

        documents: list = []
        labels: list = []
        label_col = "classes"

        ds = {}
        for split in splits:
            ds_split = self.dataset[split]

            _label = self.normalize_labels(ds_split[label_col])
            documents.extend(ds_split["title"])
            labels.extend(_label)

            documents.extend(ds_split["ingress"])
            labels.extend(_label)

            documents.extend(ds_split["article"])
            labels.extend(_label)

            assert len(documents) == len(labels)

            rng = random.Random(1111)  # local only seed
            pairs = list(zip(documents, labels))
            rng.shuffle(pairs)
            documents, labels = [list(collection) for collection in zip(*pairs)]

            # To get a more robust estimate we create batches of size 512, this decision can vary depending on dataset
            documents_batched = list(batched(documents, 512))
            labels_batched = list(batched(labels, 512))

            # reduce the size of the dataset as we see that we obtain a consistent scores (if we change the seed) even
            # with only 512x4 samples.
            documents_batched = documents_batched[:4]
            labels_batched = labels_batched[:4]


            ds[split] = datasets.Dataset.from_dict(
                {
                    "sentences": documents_batched,
                    "labels": labels_batched,
                }
            )

        self.dataset = datasets.DatasetDict(ds)

2) Creating the metadata object

Along with the task MTEB requires metadata regarding the task. If the metadata isn't available please provide your best guess or leave the field as None.

To get an overview of the fields in the metadata object, you can look at the TaskMetadata class.

Note that these fields can be left blank if the information is not available and can be extended if necessary. We do not include any machine-translated (without verification) datasets in the benchmark.

Domains

The domains follow the categories used in the Universal Dependencies project, though we updated them where deemed appropriate. These do not have to be mutually exclusive.

Domain Description
Academic Academic writing
Religious Religious text e.g. bibles
Blog Blogpost, weblog etc.
Fiction Works of fiction
Government Governmental communication, websites or similar
Legal Legal documents, laws etc.
Medical doctors notes, medical procedures or similar
News News articles, tabloids etc.
Reviews Reviews e.g. user reviews of products
Non-fiction non-fiction writing
Poetry Poems, Epics etc.
Social social media content
Spoken Spoken dialogues
Encyclopaedic E.g. Wikipedias
Web Web content

Task Subtypes

These domains subtypes were introduced in the Scandinavian Embedding Benchmark and are intended to be extended as needed.

Formalization Task Description
Retrieval Retrieval focuses on locating and providing relevant information or documents based on a query.
Question answering Finding answers to queries in a dataset, focusing on exact answers or relevant passages.
Article retrieval Identifying and retrieving full articles that are relevant to a given query.
Bitext Mining Bitext mining involves identifying parallel texts across languages or dialects for translation or analysis.
Dialect pairing Identifying pairs of text that are translations of each other across different dialects.
Classification Classification is the process of categorizing text into predefined groups or classes based on their content.
Political Categorizing text according to political orientation or content.
Language Identification Determining the language in which a given piece of text is written.
Linguistic Acceptability Assessing whether a sentence is grammatically correct according to linguistic norms.
Sentiment/Hate Speech Detecting the sentiment of text or identifying hate speech within the content.
Dialog Systems Creating or evaluating systems capable of conversing with humans in a natural manner.
Clustering Clustering involves grouping sets of texts together based on their similarity without pre-defined labels.
Thematic Clustering Grouping texts based on their thematic similarity without prior labeling.
Reranking Reranking adjusts the order of items in a list to improve relevance or accuracy according to specific criteria.
Pair Classification Pair classification assesses relationships between pairs of items, such as texts, to classify their connection.
STS Semantic Textual Similarity measures the degree of semantic equivalence between two pieces of text.

Submit a PR

Once you are finished create a PR to the MTEB repository. If you haven't created a PR before please refer to the GitHub documentation

The PR will be reviewed by one of the organizers or contributors who might ask you to change things. Once the PR is approved the dataset will be added into the main repository.

Before you commit here is a checklist you should consider completing before submitting:

  • I have tested that the dataset runs with the mteb package.

An easy way to test it is using:

from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=[YourNewTask()])
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.