Standardize the documentation

alexandrainst · Nov 5, 2020 · 6ca5d68 · 6ca5d68
1 parent 066bcb9
commit 6ca5d68
Show file tree

Hide file tree

Showing 7 changed files with 31 additions and 30 deletions.
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -1,5 +1,6 @@
 Datasets
 ========
+
 This section keeps a list of Danish NLP datasets publicly available. 
 
 | Dataset | Task | Words | Sents | License | DaNLP |
@@ -20,7 +21,7 @@ This section keeps a list of Danish NLP datasets publicly available.
 
 It is also recommend to check out Finn Årup Nielsen's [dasem github](https://github.com/fnielsen/dasem) which also provides script for loading different Danish corpus. 
 
-#### Danish Dependency Treebank (DaNE)
+### Danish Dependency Treebank (DaNE)
 
 The Danish UD treebank (Johannsen et al., 2015, UD-DDT) is a
 conversion of the Danish Dependency Treebank (Buch-Kromann et
@@ -45,7 +46,7 @@ The dataset can also be downloaded directly in CoNLL-U format.
 
 [Download DDT](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip) 
 
-#### WikiANN
+### WikiANN
 The WikiANN dataset [(Pan et al. 2017)](https://aclweb.org/anthology/P17-1178) is a dataset with NER annotations 
 for **PER**, **ORG** and **LOC**. It has been constructed using the linked entities in Wikipedia pages for 282 different
 languages including Danish. The dataset can be loaded with the DaNLP package: 
@@ -58,18 +59,18 @@ spacy_corpus = wikiann.load_with_spacy()
 flair_corpus = wikiann.load_with_flair()
 ```
 
-#### WordSim-353
+### WordSim-353
 The WordSim-353 dataset [(Finkelstein et al. 2002)](http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf) 
 contains word pairs annotated with a similarity score (1-10). It is common to use it to do intrinsic evaluations 
 on word embeddings to test for syntactic or semantic relationships between words. The dataset has been 
 [translated to Danish](https://github.com/fnielsen/dasem/tree/master/dasem/data/wordsim353-da) by Finn Årup Nielsen.
 
-#### Danish Similarity Dataset
+### Danish Similarity Dataset
 The [Danish Similarity Dataset](https://github.com/kuhumcst/Danish-Similarity-Dataset) 
 consists of 99 word pairs annotated by 38 annotators with a similarity score (1-6).
 It is constructed with frequently used Danish words.
 
-#### Twitter Sentiment
+### Twitter Sentiment
 
 The Twitter sentiment is a small manually annotated dataset by the Alexandra Institute. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ]. It is split in train and test part. Due to Twitters privacy policy, it is only allowed to display the "tweet ID" and not the actually text. This allows people to delete their tweets. Therefore, to download the actual tweet text one need a Twitter development account and to generate the sets of login keys, read how to get started [here](https://python-twitter.readthedocs.io/en/latest/getting_started.html). Then the dataset can be loaded with the DaNLP package by setting the following environment variable for the keys:
 
@@ -86,7 +87,7 @@ The dataset can also be downloaded directly with the labels and tweet id:
 
 [Download TwitterSent](https://danlp.alexandra.dk/304bd159d5de/datasets/twitter.sentiment.zip) 
 
-#### Europarl Sentiment1
+### Europarl Sentiment1
 
 The [Europarl Sentiment1](https://github.com/fnielsen/europarl-da-sentiment) dataset contains sentences from 
 the [Europarl](http://www.statmt.org/europarl/) corpus which has been annotated manually by Finn Årup Nielsen.
@@ -101,7 +102,7 @@ eurosent = EuroparlSentiment1()
 df = eurosent.load_with_pandas()
 ```
 
-#### Europarl Sentiment2
+### Europarl Sentiment2
 
 The dataset consist of  957 manually annotation by Alexandra institute on sentences from Eruroparl. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ]. 
 The dataset can be loaded with the DaNLP package:
@@ -113,9 +114,7 @@ eurosent = EuroparlSentiment2()
 df = eurosent.load_with_pandas()
 ```
 
-#### 
-
-#### LCC Sentiment
+### LCC Sentiment
 
 The [LCC Sentiment](https://github.com/fnielsen/lcc-sentiment) dataset contains sentences from Leipzig Copora Collection [(Quasthoff et al. 2006)](https://www.aclweb.org/anthology/L06-1396/) 
 which has been manually annotated by Finn Årup Nielsen.  

diff --git a/docs/models/dependency.md b/docs/models/dependency.md
@@ -31,7 +31,7 @@ We provide a convertion function -- from dependencies to NP-chunks -- thus depen
 
 
 
-## :wrench:SpaCy
+## 🔧 SpaCy
 
 Read more about the SpaCy model in the dedicated [SpaCy docs](<https://github.com/alexandrainst/danlp/blob/master/docs/spacy.md>) , it has also been trained using the [Danish Dependency Treebank](<https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane>) dataset. 
 

diff --git a/docs/models/embeddings.md b/docs/models/embeddings.md
@@ -1,5 +1,6 @@
 Pretrained Danish embeddings
 ============================
+
 This repository keeps a list of pretrained word embeddings publicly available in Danish. The `download_embeddings.py`
 and `load_embeddings.py` provides functions for downloading the embeddings as well as prepare them for use in 
 popular NLP frameworks.

diff --git a/docs/models/ner.md b/docs/models/ner.md
@@ -16,7 +16,7 @@ and made available through the DaNLP library.
 | [Polyglot](https://polyglot.readthedocs.io/en/latest/POS.html/#) | Wikipedia | Polyglot | PER, ORG, LOC | ❌ |
 | [daner](https://github.com/ITUnlp/daner) | [Derczynski et al. (2014)](https://www.aclweb.org/anthology/E14-2016) | [ITU NLP](https://nlp.itu.dk/) | PER, ORG, LOC | ❌ |
 
-#### BERT
+#### 🔧 BERT
 The BERT [(Devlin et al. 2019)](https://www.aclweb.org/anthology/N19-1423/) NER model is based on the pre-trained [Danish BERT](https://github.com/botxo/nordic_bert) representations by BotXO which 
 has been finetuned on the [DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane) 
 dataset [(Hvingelby et al. 2020)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.565.pdf). The finetuning has been done using the [Transformers](https://github.com/huggingface/transformers) library from HuggingFace.
@@ -32,7 +32,7 @@ print(" ".join(["{}/{}".format(tok,lbl) for tok,lbl in zip(tokens,labels)]))
 ```
 
 
-#### Flair
+#### 🔧 Flair
 The Flair [(Akbik et al. 2018)](https://www.aclweb.org/anthology/C18-1139/) NER model
 uses pretrained [Flair embeddings](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md#-training-details-for-flair-embeddings)
 in combination with fastText word embeddings. The model is trained using the [Flair](https://github.com/flairNLP/flair)
@@ -52,7 +52,7 @@ flair_model.predict(sentence)
 print(sentence.to_tagged_string())
 ```
 
-#### spaCy
+#### 🔧 spaCy
 The [spaCy](https://spacy.io/) model is trained for several NLP tasks [(read more here)](https://github.com/alexandrainst/danlp/blob/master/docs/spacy.md) uing the [DDT and DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane) annotations.
 The spaCy model can be loaded with DaNLP to do NER predictions in the following way.
 ```python

diff --git a/docs/models/pos.md b/docs/models/pos.md
@@ -16,7 +16,7 @@ A medium blog using Part of Speech tagging on Danish, can be found  [here](<http
 
 ![](../imgs/postag_eksempel.gif)
 
-##### :wrench:Flair
+##### 🔧 Flair
 
 This project provides a trained part of speech tagging model for Danish using the [Flair](<https://github.com/flairNLP/flair>) framework from Zalando, based on the paper [Akbik et. al (2018)](<https://alanakbik.github.io/papers/coling2018.pdf>). The model is trained using the data [Danish Dependency Treebank](<https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane>)  and by using FastText word embeddings and Flair contextual word embeddings trained in this project on data from Wikipedia and EuroParl corpus, see [here](<https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md>).
 
@@ -45,7 +45,7 @@ print(sentence.to_tagged_string())
 
 
 
-##### :wrench:SpaCy
+##### 🔧 SpaCy
 
 Read more about the spaCy model in the dedicated [spaCy docs](<https://github.com/alexandrainst/danlp/blob/master/docs/spacy.md>) , it has also been trained using the [Danish Dependency Treebank](<https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane>) data. 
 

diff --git a/docs/models/sentiment_analysis.md b/docs/models/sentiment_analysis.md
@@ -1,5 +1,5 @@
 Sentiment Analysis
-============================
+==================
 
 Sentiment analysis is a broad term for a set of tasks with the purpose of identifying an emotion or opinion in a text.
 
@@ -9,8 +9,8 @@ In this repository we provide an overview of open sentiment analysis models and
 | ------------------------------------------------------------ | -------- | ------------------------------------------------------------ | --------------------------------------------------------- | ------------------ | ------------------------------------------------------------ | ----- |
 | [AFINN](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#afinn) | Wordlist | [Apache 2.0](https://github.com/fnielsen/afinn/blob/master/LICENSE) | Finn Årup Nielsen                                         | Polarity           | Score (integers)                                             | ❌     |
 | [Sentida](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#sentida) | Wordlist | [GPL-3.0](https://github.com/esbenkc/emma/blob/master/LICENSE) | Jacob Dalsgaard, Lars Kjartan Svenden og Gustav Lauridsen | Polarity           | Score (continuous)                                           | ❌     |
-| [Bert Emotion](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-emotion) | BERT     | CC-BY_4.0                                                    | Alexandra Institute                                       | Emotions           | glæde/sindsro, forventning/interesse, tillid/accept,  overraskelse/forundring, vrede/irritation, foragt/modvilje, sorg/skuffelse, frygt/bekymring, No emotion | ✔️     |
-| [Bert Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) | BERT     | CC-BY_4.0                                                    | Alexandra Institute                                       | Polarity, Analytic | ['postive', 'neutral', 'negative'] and ['subjective', 'objective] | ✔️     |
+| [BERT Emotion](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-emotion) | BERT     | CC-BY_4.0                                                    | Alexandra Institute                                       | Emotions           | glæde/sindsro, forventning/interesse, tillid/accept,  overraskelse/forundring, vrede/irritation, foragt/modvilje, sorg/skuffelse, frygt/bekymring, No emotion | ✔️     |
+| [BERT Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) | BERT     | CC-BY_4.0                                                    | Alexandra Institute                                       | Polarity, Analytic | ['postive', 'neutral', 'negative'] and ['subjective', 'objective] | ✔️     |
 | [SpaCy Sentiment](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrench-spacy-sentiment) (beta) | spaCy    | MIT                                                          | Alexandra Institute                                       | Polarity           | 'postive', 'neutral', 'negative'                             | ✔️     |
 
 
@@ -25,11 +25,11 @@ The tool scores texts with an integer where scores <0 are negative, =0 are neutr
 The tool Sentida  [(Lauridsen et al. 2019)](https://tidsskrift.dk/lwo/article/view/115711)
 uses a lexicon based approach to sentiment analysis. The tool scores texts with a continuous value. There exist both an R version and an implementation in Python.  In these documentations we evaluate the python version from [sentida](https://github.com/guscode/sentida). 
 
-#### :wrench:Bert Emotion
+#### 🔧 BERT Emotion
 
 The emotion classifier is developed in a collaboration with Danmarks Radio, which has granted access to a set of social media data. The data has been manual annotated first to distinguish between a binary problem of emotion or no emotion, and afterwards tagged with 8 emotions. The BERT  [(Devlin et al. 2019)](https://www.aclweb.org/anthology/N19-1423/) emotion model is finetuned on this data using the [Transformers](https://github.com/huggingface/transformers) library from HuggingFace, and it is based on a pretrained  [Danish BERT](https://github.com/botxo/nordic_bert) representations by BotXO . The model to classify the eight emotions achieves an accuracy on 0.65 and a macro-f1 on 0.64 on the social media test set from DR's Facebook containing 999 examples. We do not have permission to distributing the data. 
 
- Below is a small snippet for getting started using the Bert Emotion model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning. 
+ Below is a small snippet for getting started using the BERT Emotion model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning. 
 
 ```python
 from danlp.models import load_bert_emotion_model
@@ -50,11 +50,11 @@ classifier._classes()
 
 
 
-#### :wrench:Bert Tone
+#### 🔧 BERT Tone
 
 The tone analyzer consists of two BERT  [(Devlin et al. 2019)](https://www.aclweb.org/anthology/N19-1423/)  classification models, and the first is recognizing the following tags positive, neutral and negative and the  second model  the tags: subjective and objective. This is a first version of the models, and work should be done to improve performance. Both models is finetuned on annotated twitter data using the [Transformers](https://github.com/huggingface/transformers) library from HuggingFace, and it is based on a pretrained  [Danish BERT](https://github.com/botxo/nordic_bert) representations by BotXO .  The data used is manually annotated data from Twitter Sentiment (train part)([see here](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#twitter-sentiment) ) and EuroParl sentiment 2 ([se here](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#europarl-sentiment2)), both datasets can be loaded with the DaNLP package.  
 
- Below is a small snippet for getting started using the Bert Tone model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning. 
+ Below is a small snippet for getting started using the BERT Tone model. Please notice that the BERT model can maximum take 512 tokens as input, however the code allows for overfloating tokens and will therefore not give an error but just a warning. 
 
 ```python
 from danlp.models import load_bert_tone_model
@@ -73,11 +73,11 @@ classifier._clases()
 
 
 
-#### :wrench: SpaCy Sentiment
+#### 🔧 SpaCy Sentiment
 
 SpaCy sentiment is a text classification model trained using spacy built in command line interface. It uses the CoNLL2017 word vectors, read about it [here](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md) .
 
-The model is trained using hard distil of the [Bert Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) - Meaning,  the Bert Tone model is used to make predictions on 50.000 sentences from Twitter and 50.000 sentences from [Europarl7](http://www.statmt.org/europarl/). These data is then used to trained a spacy model. Notice the dataset has first been balanced between the classes by oversampling. The model recognizes the classses: 'positiv', 'neutral' and 'negative'.
+The model is trained using hard distil of the [BERT Tone](https://github.com/alexandrainst/danlp/blob/master/docs/models/sentiment_analysis.md#wrenchbert-tone) (beta) - Meaning,  the BERT Tone model is used to make predictions on 50.000 sentences from Twitter and 50.000 sentences from [Europarl7](http://www.statmt.org/europarl/). These data is then used to trained a spacy model. Notice the dataset has first been balanced between the classes by oversampling. The model recognizes the classses: 'positiv', 'neutral' and 'negative'.
 
 It is a first version. 
 
@@ -124,7 +124,7 @@ In the table we consider the accuracy and macro-f1 in brackets, but to get the s
 | ---- | ------------------ | ------------- | ---- |
 | AFINN | 0.68 (0.68) | 0.66 (0.61) | 0.48 (0.46) |
 | Sentida (version 0.5.0) | 0.67 (0.65) | 0.58 (0.55) | 0.44 (0.44) |
-| Bert Tone (polarity, version 0.0.1) | **0.79** (0.78) | **0.74** (0.67) | **0.73** (0.70) |
+| BERT Tone (polarity, version 0.0.1) | **0.79** (0.78) | **0.74** (0.67) | **0.73** (0.70) |
 | spaCy sentiment (version 0.0.1) | 0.74 (0.73) | 0.66 (0.61) | 0.66 (0.60) |
 
 **Benchmark of subjective versus objective classification**
@@ -137,7 +137,7 @@ The script for the benchmarks can be found [here](https://github.com/alexandrain
 
 | Model                               | Twitter sentiment (analytic) |
 | ----------------------------------- | ---------------------------- |
-| Bert Tone (analytic, version 0.0.1) | 0.90 (0.77)                  |
+| BERT Tone (analytic, version 0.0.1) | 0.90 (0.77)                  |
 
 
 

diff --git a/docs/spacy.md b/docs/spacy.md
@@ -1,4 +1,5 @@
-# SpaCy model in Danish 
+SpaCy model in Danish
+=====================
 
 SpaCy is an industrial friendly open source framework for doing NLP, and you can read more about it on their [homesite](https://spacy.io/) or [gitHub](https://github.com/explosion/spaCy).
 
@@ -32,7 +33,7 @@ The following lists the  performance scores of the spaCy model provided in DaNLP
 
 
 
-## :hatching_chick: Getting started with the spaCy model
+### 🐣 Getting started with the spaCy model
 
 Below is some small snippets to get started using the spaCy model within the DaNLP package. More information about using spaCy can be found on spaCy's own [page](https://spacy.io/).  
 
@@ -104,7 +105,7 @@ Alexandra ORG
 Instituttet ORG
 ```
 
-## :hatching_chick: Start training your own text classification model
+### 🐣 Start training your own text classification model
 
 The spaCy framework provides an easy command line tool for training an existing model, for example by adding a text classifier.  This short example shows how to do so using your own annotated data. It is also possible to use any static embedding provided in the DaNLP wrapper.