This section keeps a list of Danish NLP datasets publicly available.
Dataset | Task | Words | Sents | License | DaNLP |
---|---|---|---|---|---|
OpenSubtitles2018 | Translation | 206,700,000 | 30,178,452 | None | ❌ |
EU Bookshop | Translation | 208,175,843 | 8,650,537 | - | ❌ |
Europarl7 | Translation | 47,761,381 | 2,323,099 | None | ❌ |
ParaCrawl5 | Translation | - | - | CC0 | ❌ |
WikiANN | NER | 832.901 | 95.924 | ODC-BY 1.0 | ✔️ |
UD-DDT (DaNE) | DEP, POS, NER | 100,733 | 5,512 | CC BY-SA 4.0 | ✔️ |
LCC Sentiment | Sentiment | 10.588 | 499 | CC BY | ✔️ |
Europarl Sentiment1 | Sentiment | 3.359 | 184 | None | ✔️ |
Europarl Sentiment2 | sentiment | 957 | CC BY-SA 4.0 | ✔️ | |
Wikipedia | Raw | - | - | CC BY-SA 3.0 | ❌ |
WordSim-353 | Word Similarity | 353 | - | CC BY 4.0 | ✔️ |
Danish Similarity Dataset | Word Similarity | 99 | - | CC BY 4.0 | ✔️ |
Twitter Sentiment | Sentiment | - | train: 1215, test: 512 | Twitter privacy policy applies | ✔️ |
Dacoref | coreference resolution | 64.076 (tokens) | 3.403 | GNU Public License version 2 | ✔️ |
DanNet | Wordnet | 66.308 (concepts) | - | license | ✔️ |
It is also recommend to check out Finn Årup Nielsen's dasem github which also provides script for loading different Danish corpus.
The Danish UD treebank (Johannsen et al., 2015, UD-DDT) is a conversion of the Danish Dependency Treebank (Buch-Kromann et al. 2003) based on texts from Parole (Britt, 1998). UD-DDT has annotations for dependency parsing and part-of-speech (POS) tagging. The dataset was annotated with Named Entities for PER, ORG and LOC by the Alexandra Institute in the DaNE dataset (Hvingelby et al. 2020). To read more about how the dataset was annotated with POS and DEP tags we refer to the Universal Dependencies page. The dataset can be loaded with the DaNLP package:
from danlp.datasets import DDT
ddt = DDT()
spacy_corpus = ddt.load_with_spacy()
flair_corpus = ddt.load_with_flair()
conllu_format = ddt.load_as_conllu()
The dataset can also be downloaded directly in CoNLL-U format.
This Danish coreference annotation contains parts of the Copenhagen Dependency Treebank (Kromann and Lynge, 2004), It was originally annotated as part of the Copenhagen Dependency Treebank (CDT) project but never finished. This resource extends the annotation by using different mapping techniques and by augmenting with Qcodes from Wiktionary. This work is conducted by Maria Jung Barrett. Read more about it in the dedicated dacoref docs.
The dataset can be loaded with the DaNLP package:
from danlp.datasets import Dacoref
dacoref = Dacoref()
# The corpus can be loaded with or without splitting into train, dev and test in a list in that order
corpus = dacoref.load_as_conllu(predefined_splits=True)
The dataset can also be downloaded directly:
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset can be loaded with the DaNLP package:
from danlp.datasets import WikiAnn
wikiann = WikiAnn()
spacy_corpus = wikiann.load_with_spacy()
flair_corpus = wikiann.load_with_flair()
The WordSim-353 dataset (Finkelstein et al. 2002) contains word pairs annotated with a similarity score (1-10). It is common to use it to do intrinsic evaluations on word embeddings to test for syntactic or semantic relationships between words. The dataset has been translated to Danish by Finn Årup Nielsen.
The Danish Similarity Dataset consists of 99 word pairs annotated by 38 annotators with a similarity score (1-6). It is constructed with frequently used Danish words.
The Twitter sentiment is a small manually annotated dataset by the Alexandra Institute. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ]. It is split in train and test part. Due to Twitters privacy policy, it is only allowed to display the "tweet ID" and not the actually text. This allows people to delete their tweets. Therefore, to download the actual tweet text one need a Twitter development account and to generate the sets of login keys, read how to get started here. Then the dataset can be loaded with the DaNLP package by setting the following environment variable for the keys:
TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_SECRET
from danlp.datasets import TwitterSent
twitSent = TwitterSent()
df_test, df_train = twitSent.load_with_pandas()
The dataset can also be downloaded directly with the labels and tweet id:
The Europarl Sentiment1 dataset contains sentences from the Europarl corpus which has been annotated manually by Finn Årup Nielsen. Each sentence has been annotated the polarity of the sentiment as an polarity score from -5 to 5. The score can be converted to positive (>0), neutral (=0) and negative (<0). The dataset can be loaded with the DaNLP package:
from danlp.datasets import EuroparlSentiment1
eurosent = EuroparlSentiment1()
df = eurosent.load_with_pandas()
The dataset consist of 957 manually annotation by Alexandra institute on sentences from Eruroparl. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ]. The dataset can be loaded with the DaNLP package:
from danlp.datasets import EuroparlSentiment2
eurosent = EuroparlSentiment2()
df = eurosent.load_with_pandas()
The LCC Sentiment dataset contains sentences from Leipzig Copora Collection (Quasthoff et al. 2006)
which has been manually annotated by Finn Årup Nielsen.
Each sentence has been annotated the polarity of the sentiment as an polarity score from -5 to 5.
The score can be converted to positive (>0), neutral (=0) and negative (<0).
The dataset can be loaded with the DaNLP package:
from danlp.datasets import LccSentiment
lccsent = LccSentiment()
df = lccsent.load_with_pandas()
DanNet is a lexical database such as Wordnet. "Center for sprogteknologi" at The University of Copenhagen is behind it and more details about it can be found in the paper Pedersen et al 2009.
DanNet depicts the relations between words in Danish (mostly nouns, verbs and adjectives). The main relation among words in WordNet is synonymy.
The dataset consists of 4 databases:
* words
* word senses
* relations
* synsets
DanNet uses the concept of synset
to link words together. All the words in the database are part of one or multiple synsets. A synset is a set of synonyms (words which have the same meanings).
For downloading DanNet through DaNLP, you can do:
from danlp.datasets import DanNet
dannet = DanNet()
# you can load the databases if you want to look into the databases by yourself
words, wordsenses, relations, synsets = dannet.load_with_pandas()
We also provide helper functions to search for synonyms, hyperonyms and hyponyms through the databases. Once you have downloaded the DanNet wrapper, you can use the following features:
word = "myre"
# synonyms
dannet.synonyms(word)
""" ['tissemyre'] """
# hypernyms
dannet.hypernyms(word)
""" ['årevingede insekter'] """
# hyponyms
dannet.hyponyms(word)
""" ['hærmyre', 'skovmyre', 'pissemyre', 'tissemyre'] """
# meanings
dannet.meanings(word)
""" ['ca. 1 cm langt, årevinget insekt med en kraftig in ... (Brug: "Myrer på terrassen, og andre steder udendørs, kan hurtigt blive meget generende")'] """
# to help you dive into the databases
# we also provide the following functions:
# part-of-speech (returns a list comprised in 'Noun', 'Verb' or 'Adjective')
dannet.pos(word)
# wordnet relations (EUROWORDNET or WORDNETOWL)
dannet.wordnet_relations(word, eurowordnet=True))
# word ids
dannet._word_ids(word)
# synset ids
dannet._synset_ids(word)
# word from id
dannet._word_from_id(11034863)
# synset from id
dannet._synset_from_id(3514)
- Johannsen, Anders, Martínez Alonso, Héctor and Plank, Barbara. “Universal Dependencies for Danish”. TLT14, 2015.
- Keson, Britt (1998). Documentation of The Danish Morpho-syntactically Tagged PAROLE Corpus. Technical report, DSL
- Matthias T. Buch-Kromann, Line Mikkelsen, and Stine Kern Lynge. 2003. "Danish dependency treebank". In TLT.
- Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard and Anders Søgaard. 2020. DaNE: A Named Entity Resource for Danish. In LREC.
- Pedersen, Bolette S. Sanni Nimb, Jørg Asmussen, Nicolai H. Sørensen, Lars Trap-Jensen og Henrik Lorentzen (2009). DanNet – the challenge of compiling a WordNet for Danish by reusing a monolingual dictionary. Lang Resources & Evaluation 43:269–299.
- Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight and Heng Ji. 2017. Cross-lingual Name Tagging and Linking for 282 Languages. In ACL.
- Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing Search in Context: The Concept Revisited. In ACM TOIS.
- Uwe Quasthoff, Matthias Richter and Christian Biemann. 2006. Corpus Portal for Search in Monolingual Corpora. In LREC.
- M.T. Kromann and S.K. Lynge. Danish Dependency Treebank v. 1.0. Department of Computational Linguistics, Copenhagen Business School., 2004. https://github.com/mbkromann/copenhagen-dependency-treebank