Skip to content

Commit

Permalink
Update dataset documentation with details
Browse files Browse the repository at this point in the history
  • Loading branch information
hvingelby committed Sep 22, 2019
1 parent dee3924 commit b00ebaf
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 9 deletions.
39 changes: 34 additions & 5 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,37 @@ This section keeps a list of Danish NLP datasets publicly available.
| [EU Bookshop](http://opus.nlpl.eu/EUbookshop-v2.php) | Translation | 208,175,843 | 8,650,537 | - |
| [EuroParl7](http://opus.nlpl.eu/Europarl.php) | Translation | 47,761,381 | 2,323,099 | [None](http://www.statmt.org/europarl/)|
| [ParaCrawl5](https://paracrawl.eu/) | Translation | - | - | [CC0](https://paracrawl.eu/releases.html)
| WikiANN | NER | 832.901 | 95.924 |[ODC-BY 1.0](http://nlp.cs.rpi.edu/wikiann/)|
| [Danish Dependency Treebank](<https://github.com/UniversalDependencies/UD_Danish-DDT/tree/master>) | POS, NER | 100,733 | 5,512 | [CC BY-SA 4.0](https://github.com/UniversalDependencies/UD_Danish-DDT/blob/master/README.md) |
| [Wikipedia](<https://dumps.wikimedia.org/dawiki/latest/>) | Raw | 0.3GB* | - | [CC BY-SA 3.0](https://dumps.wikimedia.org/legal.html) |
| [WordSim-353-da](https://github.com/fnielsen/dasem/tree/master/dasem/data/wordsim353-da) | Word Similarity | 353 | - | [CC BY 4.0](https://github.com/fnielsen/dasem/blob/master/dasem/data/wordsim353-da/LICENSE)|
| [WikiANN](https://github.com/alexandrainst/danlp/blob/add-ner/docs/datasets.md#wikiann)| NER | 832.901 | 95.924 |[ODC-BY 1.0](http://nlp.cs.rpi.edu/wikiann/)|
| [Danish Dependency Treebank](https://github.com/alexandrainst/danlp/blob/add-ner/docs/datasets.md#danish-dependency-treebank) | DEP, POS, NER | 100,733 | 5,512 | [CC BY-SA 4.0](https://github.com/UniversalDependencies/UD_Danish-DDT/blob/master/README.md) |
| [Wikipedia](https://dumps.wikimedia.org/dawiki/latest/) | Raw | - | - | [CC BY-SA 3.0](https://dumps.wikimedia.org/legal.html) |
| [WordSim-353](https://github.com/alexandrainst/danlp/blob/add-ner/docs/datasets.md#wordsim-353) | Word Similarity | - | 353 | [CC BY 4.0](https://github.com/fnielsen/dasem/blob/master/dasem/data/wordsim353-da/LICENSE)|

#### Danish Dependency Treebank
The DDT dataset (Buch-Kromann et al. 2003) has annotations for dependency parsing, POS and NER.
The dataset was annotated with NER annotations for **PER**, **ORG** and **LOC** by the Alexandra Institute.
To read more about how the dataset was annotated with POS and DEP tags we refer to the
[Universal Dependencies](https://github.com/UniversalDependencies/UD_Danish-DDT/blob/master/README.md) page.
The dataset can be used with the DaNLP package:

```python
from danlp.datasets import DDT
ddt = DDT()

spacy_corpus = ddt.load_with_spacy()
flair_corpus = ddt.load_with_flair()
conllu_format = ddt.load_as_conllu()
```

#### WikiANN
The WikiANN dataset [(Pan et al. 2017)](https://aclweb.org/anthology/P17-1178) is a dataset with NER annotations
for **PER**, **ORG** and **LOC**. It has been constructed using the linked entities in Wikipedia pages for 282 different
languages including Danish.

#### WordSim-353
The WordSim-353 dataset [(Finkelstein et al. 2002)](http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf)
contains word pairs annotated with a similarity score (1-10). It is common to use it to do intrinsic evaluations
on word embeddings to test for syntactic or semantic relationships between words. The dataset has been
[translated to Danish](https://github.com/fnielsen/dasem/tree/master/dasem/data/wordsim353-da) by Finn Aarup Nielsen.

### Get started

Expand All @@ -24,4 +51,6 @@ In the moment, the following options are supported: `--euparl`, `--wiki` and `-


## 🎓 References
- Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. [Placing Search in Context: The Concept Revisited](http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf). In **ACM TOIS**.
- Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight and Heng Ji. 2017. [Cross-lingual Name Tagging and Linking for 282 Languages](https://aclweb.org/anthology/P17-1178). In **ACL**.
- Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. [Placing Search in Context: The Concept Revisited](http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf). In **ACM TOIS**.
- Matthias T. Buch-Kromann, Line Mikkelsen, and Stine Kern Lynge. 2003. "Danish dependency treebank". In **TLT**.
8 changes: 4 additions & 4 deletions docs/models/ner.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ This repository keeps a list of pretrained NER models publicly available in Dani
| [daner](https://github.com/ITUnlp/daner) | [Derczynski et al. (2014)](https://www.aclweb.org/anthology/E14-2016) | [ITU NLP](https://nlp.itu.dk/) | PER, ORG, LOC |
| Multilingual BERT | | [MIPT](https://mipt.ru/english/) |



## Get started

## 📈 Benchmarks
The benchmarks has been performed on the test part of the
[Danish Dependency Treebank](https://github.com/alexandrainst/danlp/blob/add-ner/docs/datasets.md#danish-dependency-treebank).
The treebank is annotated by the Alexandra Institute with the **LOC**, **ORG** and **PER** entity tags.


| Model | LOC | ORG | PER | AVG |
|-------|-----|-----|-----|-----|
Expand Down

0 comments on commit b00ebaf

Please sign in to comment.