Skip to content

Commit

Permalink
Standardize the documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ophelielacroix committed Nov 5, 2020
1 parent 066bcb9 commit 6ca5d68
Show file tree
Hide file tree
Showing 7 changed files with 31 additions and 30 deletions.
19 changes: 9 additions & 10 deletions docs/datasets.md
@@ -1,5 +1,6 @@
Datasets
========

This section keeps a list of Danish NLP datasets publicly available.

| Dataset | Task | Words | Sents | License | DaNLP |
Expand All @@ -20,7 +21,7 @@ This section keeps a list of Danish NLP datasets publicly available.

It is also recommend to check out Finn Årup Nielsen's [dasem github](https://github.com/fnielsen/dasem) which also provides script for loading different Danish corpus.

#### Danish Dependency Treebank (DaNE)
### Danish Dependency Treebank (DaNE)

The Danish UD treebank (Johannsen et al., 2015, UD-DDT) is a
conversion of the Danish Dependency Treebank (Buch-Kromann et
Expand All @@ -45,7 +46,7 @@ The dataset can also be downloaded directly in CoNLL-U format.

[Download DDT](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip)

#### WikiANN
### WikiANN
The WikiANN dataset [(Pan et al. 2017)](https://aclweb.org/anthology/P17-1178) is a dataset with NER annotations
for **PER**, **ORG** and **LOC**. It has been constructed using the linked entities in Wikipedia pages for 282 different
languages including Danish. The dataset can be loaded with the DaNLP package:
Expand All @@ -58,18 +59,18 @@ spacy_corpus = wikiann.load_with_spacy()
flair_corpus = wikiann.load_with_flair()
```

#### WordSim-353
### WordSim-353
The WordSim-353 dataset [(Finkelstein et al. 2002)](http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf)
contains word pairs annotated with a similarity score (1-10). It is common to use it to do intrinsic evaluations
on word embeddings to test for syntactic or semantic relationships between words. The dataset has been
[translated to Danish](https://github.com/fnielsen/dasem/tree/master/dasem/data/wordsim353-da) by Finn Årup Nielsen.

#### Danish Similarity Dataset
### Danish Similarity Dataset
The [Danish Similarity Dataset](https://github.com/kuhumcst/Danish-Similarity-Dataset)
consists of 99 word pairs annotated by 38 annotators with a similarity score (1-6).
It is constructed with frequently used Danish words.

#### Twitter Sentiment
### Twitter Sentiment

The Twitter sentiment is a small manually annotated dataset by the Alexandra Institute. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ]. It is split in train and test part. Due to Twitters privacy policy, it is only allowed to display the "tweet ID" and not the actually text. This allows people to delete their tweets. Therefore, to download the actual tweet text one need a Twitter development account and to generate the sets of login keys, read how to get started [here](https://python-twitter.readthedocs.io/en/latest/getting_started.html). Then the dataset can be loaded with the DaNLP package by setting the following environment variable for the keys:

Expand All @@ -86,7 +87,7 @@ The dataset can also be downloaded directly with the labels and tweet id:

[Download TwitterSent](https://danlp.alexandra.dk/304bd159d5de/datasets/twitter.sentiment.zip)

#### Europarl Sentiment1
### Europarl Sentiment1

The [Europarl Sentiment1](https://github.com/fnielsen/europarl-da-sentiment) dataset contains sentences from
the [Europarl](http://www.statmt.org/europarl/) corpus which has been annotated manually by Finn Årup Nielsen.
Expand All @@ -101,7 +102,7 @@ eurosent = EuroparlSentiment1()
df = eurosent.load_with_pandas()
```

#### Europarl Sentiment2
### Europarl Sentiment2

The dataset consist of 957 manually annotation by Alexandra institute on sentences from Eruroparl. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ].
The dataset can be loaded with the DaNLP package:
Expand All @@ -113,9 +114,7 @@ eurosent = EuroparlSentiment2()
df = eurosent.load_with_pandas()
```

####

#### LCC Sentiment
### LCC Sentiment

The [LCC Sentiment](https://github.com/fnielsen/lcc-sentiment) dataset contains sentences from Leipzig Copora Collection [(Quasthoff et al. 2006)](https://www.aclweb.org/anthology/L06-1396/)
which has been manually annotated by Finn Årup Nielsen.
Expand Down
2 changes: 1 addition & 1 deletion docs/models/dependency.md
Expand Up @@ -31,7 +31,7 @@ We provide a convertion function -- from dependencies to NP-chunks -- thus depen



## :wrench:SpaCy
## 🔧 SpaCy

Read more about the SpaCy model in the dedicated [SpaCy docs](<https://github.com/alexandrainst/danlp/blob/master/docs/spacy.md>) , it has also been trained using the [Danish Dependency Treebank](<https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane>) dataset.

Expand Down
1 change: 1 addition & 0 deletions docs/models/embeddings.md
@@ -1,5 +1,6 @@
Pretrained Danish embeddings
============================

This repository keeps a list of pretrained word embeddings publicly available in Danish. The `download_embeddings.py`
and `load_embeddings.py` provides functions for downloading the embeddings as well as prepare them for use in
popular NLP frameworks.
Expand Down
6 changes: 3 additions & 3 deletions docs/models/ner.md
Expand Up @@ -16,7 +16,7 @@ and made available through the DaNLP library.
| [Polyglot](https://polyglot.readthedocs.io/en/latest/POS.html/#) | Wikipedia | Polyglot | PER, ORG, LOC ||
| [daner](https://github.com/ITUnlp/daner) | [Derczynski et al. (2014)](https://www.aclweb.org/anthology/E14-2016) | [ITU NLP](https://nlp.itu.dk/) | PER, ORG, LOC ||

#### BERT
#### 🔧 BERT
The BERT [(Devlin et al. 2019)](https://www.aclweb.org/anthology/N19-1423/) NER model is based on the pre-trained [Danish BERT](https://github.com/botxo/nordic_bert) representations by BotXO which
has been finetuned on the [DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane)
dataset [(Hvingelby et al. 2020)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.565.pdf). The finetuning has been done using the [Transformers](https://github.com/huggingface/transformers) library from HuggingFace.
Expand All @@ -32,7 +32,7 @@ print(" ".join(["{}/{}".format(tok,lbl) for tok,lbl in zip(tokens,labels)]))
```


#### Flair
#### 🔧 Flair
The Flair [(Akbik et al. 2018)](https://www.aclweb.org/anthology/C18-1139/) NER model
uses pretrained [Flair embeddings](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md#-training-details-for-flair-embeddings)
in combination with fastText word embeddings. The model is trained using the [Flair](https://github.com/flairNLP/flair)
Expand All @@ -52,7 +52,7 @@ flair_model.predict(sentence)
print(sentence.to_tagged_string())
```

#### spaCy
#### 🔧 spaCy
The [spaCy](https://spacy.io/) model is trained for several NLP tasks [(read more here)](https://github.com/alexandrainst/danlp/blob/master/docs/spacy.md) uing the [DDT and DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane) annotations.
The spaCy model can be loaded with DaNLP to do NER predictions in the following way.
```python
Expand Down
4 changes: 2 additions & 2 deletions docs/models/pos.md
Expand Up @@ -16,7 +16,7 @@ A medium blog using Part of Speech tagging on Danish, can be found [here](<http

![](../imgs/postag_eksempel.gif)

##### :wrench:Flair
##### 🔧 Flair

This project provides a trained part of speech tagging model for Danish using the [Flair](<https://github.com/flairNLP/flair>) framework from Zalando, based on the paper [Akbik et. al (2018)](<https://alanakbik.github.io/papers/coling2018.pdf>). The model is trained using the data [Danish Dependency Treebank](<https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane>) and by using FastText word embeddings and Flair contextual word embeddings trained in this project on data from Wikipedia and EuroParl corpus, see [here](<https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md>).

Expand Down Expand Up @@ -45,7 +45,7 @@ print(sentence.to_tagged_string())



##### :wrench:SpaCy
##### 🔧 SpaCy

Read more about the spaCy model in the dedicated [spaCy docs](<https://github.com/alexandrainst/danlp/blob/master/docs/spacy.md>) , it has also been trained using the [Danish Dependency Treebank](<https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane>) data.

Expand Down
22 changes: 11 additions & 11 deletions docs/models/sentiment_analysis.md
@@ -1,5 +1,5 @@
Sentiment Analysis
============================
==================

Sentiment analysis is a broad term for a set of tasks with the purpose of identifying an emotion or opinion in a text.

Expand All @@ -9,8 +9,8 @@ In this repository we provide an overview of open sentiment analysis models and
| ------------------------------------------------------------ | -------- | ------------------------------------------------------------ | --------------------------------------------------------- | ------------------ | ------------------------------------------------------------ | ----- |
| [AFINN](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#afinn) | Wordlist | [Apache 2.0](https://github.com/fnielsen/afinn/blob/master/LICENSE) | Finn Årup Nielsen | Polarity | Score (integers) ||
| [Sentida](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#sentida) | Wordlist | [GPL-3.0](https://github.com/esbenkc/emma/blob/master/LICENSE) | Jacob Dalsgaard, Lars Kjartan Svenden og Gustav Lauridsen | Polarity | Score (continuous) ||
| [Bert Emotion](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-emotion) | BERT | CC-BY_4.0 | Alexandra Institute | Emotions | glæde/sindsro, forventning/interesse, tillid/accept, overraskelse/forundring, vrede/irritation, foragt/modvilje, sorg/skuffelse, frygt/bekymring, No emotion | ✔️ |
| [Bert Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) | BERT | CC-BY_4.0 | Alexandra Institute | Polarity, Analytic | ['postive', 'neutral', 'negative'] and ['subjective', 'objective] | ✔️ |
| [BERT Emotion](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-emotion) | BERT | CC-BY_4.0 | Alexandra Institute | Emotions | glæde/sindsro, forventning/interesse, tillid/accept, overraskelse/forundring, vrede/irritation, foragt/modvilje, sorg/skuffelse, frygt/bekymring, No emotion | ✔️ |
| [BERT Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) | BERT | CC-BY_4.0 | Alexandra Institute | Polarity, Analytic | ['postive', 'neutral', 'negative'] and ['subjective', 'objective] | ✔️ |
| [SpaCy Sentiment](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrench-spacy-sentiment) (beta) | spaCy | MIT | Alexandra Institute | Polarity | 'postive', 'neutral', 'negative' | ✔️ |


Expand All @@ -25,11 +25,11 @@ The tool scores texts with an integer where scores <0 are negative, =0 are neutr
The tool Sentida [(Lauridsen et al. 2019)](https://tidsskrift.dk/lwo/article/view/115711)
uses a lexicon based approach to sentiment analysis. The tool scores texts with a continuous value. There exist both an R version and an implementation in Python. In these documentations we evaluate the python version from [sentida](https://github.com/guscode/sentida).

#### :wrench:Bert Emotion
#### 🔧 BERT Emotion

The emotion classifier is developed in a collaboration with Danmarks Radio, which has granted access to a set of social media data. The data has been manual annotated first to distinguish between a binary problem of emotion or no emotion, and afterwards tagged with 8 emotions. The BERT [(Devlin et al. 2019)](https://www.aclweb.org/anthology/N19-1423/) emotion model is finetuned on this data using the [Transformers](https://github.com/huggingface/transformers) library from HuggingFace, and it is based on a pretrained [Danish BERT](https://github.com/botxo/nordic_bert) representations by BotXO . The model to classify the eight emotions achieves an accuracy on 0.65 and a macro-f1 on 0.64 on the social media test set from DR's Facebook containing 999 examples. We do not have permission to distributing the data.

Below is a small snippet for getting started using the Bert Emotion model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning.
Below is a small snippet for getting started using the BERT Emotion model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning.

```python
from danlp.models import load_bert_emotion_model
Expand All @@ -50,11 +50,11 @@ classifier._classes()



#### :wrench:Bert Tone
#### 🔧 BERT Tone

The tone analyzer consists of two BERT [(Devlin et al. 2019)](https://www.aclweb.org/anthology/N19-1423/) classification models, and the first is recognizing the following tags positive, neutral and negative and the second model the tags: subjective and objective. This is a first version of the models, and work should be done to improve performance. Both models is finetuned on annotated twitter data using the [Transformers](https://github.com/huggingface/transformers) library from HuggingFace, and it is based on a pretrained [Danish BERT](https://github.com/botxo/nordic_bert) representations by BotXO . The data used is manually annotated data from Twitter Sentiment (train part)([see here](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#twitter-sentiment) ) and EuroParl sentiment 2 ([se here](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#europarl-sentiment2)), both datasets can be loaded with the DaNLP package.

Below is a small snippet for getting started using the Bert Tone model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning.
Below is a small snippet for getting started using the BERT Tone model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning.

```python
from danlp.models import load_bert_tone_model
Expand All @@ -73,11 +73,11 @@ classifier._clases()



#### :wrench: SpaCy Sentiment
#### 🔧 SpaCy Sentiment

SpaCy sentiment is a text classification model trained using spacy built in command line interface. It uses the CoNLL2017 word vectors, read about it [here](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md) .

The model is trained using hard distil of the [Bert Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) - Meaning, the Bert Tone model is used to make predictions on 50.000 sentences from Twitter and 50.000 sentences from [Europarl7](http://www.statmt.org/europarl/). These data is then used to trained a spacy model. Notice the dataset has first been balanced between the classes by oversampling. The model recognizes the classses: 'positiv', 'neutral' and 'negative'.
The model is trained using hard distil of the [BERT Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) - Meaning, the BERT Tone model is used to make predictions on 50.000 sentences from Twitter and 50.000 sentences from [Europarl7](http://www.statmt.org/europarl/). These data is then used to trained a spacy model. Notice the dataset has first been balanced between the classes by oversampling. The model recognizes the classses: 'positiv', 'neutral' and 'negative'.

It is a first version.

Expand Down Expand Up @@ -124,7 +124,7 @@ In the table we consider the accuracy and macro-f1 in brackets, but to get the s
| ---- | ------------------ | ------------- | ---- |
| AFINN | 0.68 (0.68) | 0.66 (0.61) | 0.48 (0.46) |
| Sentida (version 0.5.0) | 0.67 (0.65) | 0.58 (0.55) | 0.44 (0.44) |
| Bert Tone (polarity, version 0.0.1) | **0.79** (0.78) | **0.74** (0.67) | **0.73** (0.70) |
| BERT Tone (polarity, version 0.0.1) | **0.79** (0.78) | **0.74** (0.67) | **0.73** (0.70) |
| spaCy sentiment (version 0.0.1) | 0.74 (0.73) | 0.66 (0.61) | 0.66 (0.60) |

**Benchmark of subjective versus objective classification**
Expand All @@ -137,7 +137,7 @@ The script for the benchmarks can be found [here](https://github.com/alexandrain

| Model | Twitter sentiment (analytic) |
| ----------------------------------- | ---------------------------- |
| Bert Tone (analytic, version 0.0.1) | 0.90 (0.77) |
| BERT Tone (analytic, version 0.0.1) | 0.90 (0.77) |



Expand Down
7 changes: 4 additions & 3 deletions docs/spacy.md
@@ -1,4 +1,5 @@
# SpaCy model in Danish
SpaCy model in Danish
=====================

SpaCy is an industrial friendly open source framework for doing NLP, and you can read more about it on their [homesite](https://spacy.io/) or [gitHub](https://github.com/explosion/spaCy).

Expand Down Expand Up @@ -32,7 +33,7 @@ The following lists the performance scores of the spaCy model provided in DaNLP



## :hatching_chick: Getting started with the spaCy model
### 🐣 Getting started with the spaCy model

Below is some small snippets to get started using the spaCy model within the DaNLP package. More information about using spaCy can be found on spaCy's own [page](https://spacy.io/).

Expand Down Expand Up @@ -104,7 +105,7 @@ Alexandra ORG
Instituttet ORG
```

## :hatching_chick: Start ​training your own text classification model
### 🐣 Start ​training your own text classification model

The spaCy framework provides an easy command line tool for training an existing model, for example by adding a text classifier. This short example shows how to do so using your own annotated data. It is also possible to use any static embedding provided in the DaNLP wrapper.

Expand Down

0 comments on commit 6ca5d68

Please sign in to comment.