NEDMed Dataset

This dataset has been released alongside "Improving Few-Shot Domain Transfer for Named Entity Disambiguation with Pattern Exploitation" by BasisTech.

The dataset is split into NEDMed-train and NEDMed-dev. Each file contains a single JSON document, which has been annotated with entities. The documentMetadata field contains various pieces of metadata relating to the source document and/or the annotators/adjudicators.

Within the nedmed-train and nedmed-dev folders, there are two subfolders:

raw contains the annotated data (containing entities and their Wikidata QIDs), without any redactions
with-candidates contains the annotated data along with the list of candidates used in "Improving Few-Shot Domain Transfer for Named Entity Disambiguation with Pattern Exploitation". This is intended for the evaluation of disambiguation-only models.

Here is an example of an entity:

        {
          "mentions": [
            {
              "startOffset": 85,
              "endOffset": 101
            }
          ],
          "type": "LOC",
          "entityId": "Q16572",
          "candidates": [
            {
              "enwikiId": "en_12537",
              "enwikiPage": "https://en.wikipedia.org/?curid=12537",
              "qid": "Q16572",
              "title": "Guangzhou"
            }
          ],
          "goldIdx": 0
        },

mentions[0] indicates the location in the document's data field of the mention.

candidates and goldIdx are only found inside of the with-candidates subfolders. Each candidate contains the English Wikipedia page ID (as of January 2022) and Wikidata QID. The goldIdx field indicates which entry in the candidate list is the zero-indexed correct answer (which can also be deduced from the QID in the entityId field), or -1 if the candidate is not in the list.

Data Sources

The raw (pre-annotated) data consists of mental health news articles collected from Wikinews and The Conversation.

License Information

The raw data (found in the data field of each JSON object) which has been annotated is released under CC BY 2.5 (for data from Wikinews) and CC BY-ND 4.0 (for data from The Conversation) licenses. Each JSON file contains the license of the raw data in the documentMetadata.license field. The annotations collected by BasisTech are released under the CC BY-SA 4.0, as specified in the LICENSE file.

Citing

@inproceedings{blair-bar-2022-improving-few-shot,
    title = "Improving Few-Shot Domain Transfer for Named Entity Disambiguation with Pattern Exploitation",
    editor = "Blair, Philip  and
      Bar, Kfir",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEDMed Dataset

Data Sources

License Information

Citing

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
nedmed-dev		nedmed-dev
nedmed-train		nedmed-train
LICENSE		LICENSE
README.md		README.md

License

basis-technology-corp/NEDMed

Folders and files

Latest commit

History

Repository files navigation

NEDMed Dataset

Data Sources

License Information

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages