# Tutorial: Danish Part of Speech Taggers

This tutorial will show you how to get started using Polyglots opens source POS taggger and a POS tagger that is trained with the framework from Flair (Zalando). The Flair POS-tagger is trained by this project and is included in the DaNLP packgage.  

#### Credits:  
Polyglot: https://polyglot.readthedocs.io/en/latest/POS.html  
Flair: https://github.com/zalandoresearch/flair  
Data from UD_Danish: https://github.com/UniversalDependencies/UD_Danish-DDT/tree/master

### Lets get started

#### Installation
To run all the example in this notebook the following python packages are required, which can be installed using pip:

```pip install danlp``` Read more about the package on the front page of this Github repositorie. 

```pip install polyglot==16.7.4``` 
Read more about Polyglot installation [here](https://polyglot.readthedocs.io/en/latest/Installation.html). Notice that polyglot requires you to install other dependencies, such as pyicu, pycld2 and Morfessor. Note that the polyglot package is of older date.

```pip install flair==0.4.2 ```
Read more about Flair installation [here](https://pypi.org/project/flair/)

It is recommended to install the packages in an virtual envoriment usign e.g. pip virtual envoriment. Read more about it [here](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)



#### First load and import
1) Import polyglot  and download the Polyglot part of speech tagging model through polyglot

2) Import flair and download the Flair part of speech tagging model trained in DaNLP

In [1]:
# POLYGLOT import libraries and load model
from polyglot.text import Text
# download the Danish part of speech tagger from Polyglot
from polyglot.downloader import downloader
downloader.download("embeddings2.da")
downloader.download("pos2.da")

[polyglot_data] Downloading package embeddings2.da to
[polyglot_data]     /root/polyglot_data...
[polyglot_data] Downloading package pos2.da to /root/polyglot_data...


True

In [2]:
# FLAIR import libraries and load model
from danlp.models.pos_taggers import load_pos_tagger_with_flair
from flair.data import Sentence, Token
from flair.data_fetcher import NLPTaskDataFetcher
from segtok.segmenter import split_single

# Load the POS tagger using the DaNLP wrapper
flair_model = load_pos_tagger_with_flair()

2019-07-10 09:20:00,967 loading file /root/.danlp/flair.pos.pt


### Simply.. 
This shows the simple use of the two frameworks.

In [9]:
# Giv a sentence to try it on
example='jeg hopper på en bil som er rød sammen med Jens-Peter E. Hansen'

# The Flair model
print('\x1b[1;30m'+'The flair Model:' +'\x1b[0m')
sentence = Sentence(example) 
flair_model.predict(sentence) 
print(sentence.to_tagged_string())
print()

# The Polyglot model
print('\x1b[1;30m'+'The Polyglot model: ' +'\x1b[0m')
text = Text(example, hint_language_code='da')
print(text.pos_tags)


[1;30mThe flair Model:[0m
jeg <PRON> hopper <VERB> på <ADP> en <DET> bil <NOUN> som <ADP> er <AUX> rød <ADJ> sammen <ADV> med <ADP> Jens-Peter <PROPN> E. <PROPN> Hansen <PROPN>

[1;30mThe Polyglot model: [0m
[('jeg', 'PRON'), ('hopper', 'VERB'), ('på', 'ADP'), ('en', 'PRON'), ('bil', 'NOUN'), ('som', 'PART'), ('er', 'VERB'), ('rød', 'ADJ'), ('sammen', 'ADV'), ('med', 'ADP'), ('Jens', 'PROPN'), ('-', 'PUNCT'), ('Peter', 'PROPN'), ('E', 'PROPN'), ('.', 'PUNCT'), ('Hansen', 'PROPN')]


From the example above you might have notice that the two models are not alligned. The polyglot is for example splitting the name 'Jens-Peter' into three tokens where the flair model is not. Futhermore, it looks like the flair Model have som tags like 'AUX', as the Polyglot model does not use.  

The Flair model is trained on the Danish Dependency Treebank ( UD_Danish) dataset, the Polyglot documentations say this as well, but a suggestion is that the polyglot might have been trained on a older version with a slighly different tokenization, and perhaps 14 tags instead og 17 tags. 

You can load the test (or training) data in the following way, and eximine the ground truth. 


In [7]:
# Download the the Danish UD treebank  (if you wnat to run nest cell)
!git clone  https://github.com/UniversalDependencies/UD_Danish-DDT.git

Cloning into 'UD_Danish-DDT'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 185 (delta 1), reused 4 (delta 1), pack-reused 180[K
Receiving objects: 100% (185/185), 4.07 MiB | 6.12 MiB/s, done.
Resolving deltas: 100% (106/106), done.


In [8]:
# read the test dataset from the Danish UD treebank 
sentences=NLPTaskDataFetcher.read_conll_ud('UD_Danish-DDT/da_ddt-ud-test.conllu') 

# Eksemple
eksemple=sentences[97].to_plain_string()
print(eksemple)

# the ground truth 
print()
print('\x1b[1;30m'+'The ground truth from the annotated data'+'\x1b[0m')
print(sentences[97].to_tagged_string('upos'))

# Polyglot
print()
print('\x1b[1;30m'+'Polyglots predictions:'+'\x1b[0m')
poly_sentence = Text(eksemple, hint_language_code='da')
print(poly_sentence.pos_tags)

#Flair
print()
print('\x1b[1;30m'+'Flairs predictions:'+'\x1b[0m')
flair_sentence = Sentence(eksemple)
flair_model.predict(flair_sentence)
print(flair_sentence.to_tagged_string())

  


To spanske TV-folk blev gennembanket af Sovjet-soldater , fordi de havde filmet en såret litauer , der blev hjulpet væk fra slagmarken .

[1;30mThe ground truth from the annotated data[0m
To <NUM> spanske <ADJ> TV-folk <NOUN> blev <AUX> gennembanket <VERB> af <ADP> Sovjet-soldater <NOUN> , <PUNCT> fordi <SCONJ> de <PRON> havde <AUX> filmet <VERB> en <DET> såret <VERB> litauer <NOUN> , <PUNCT> der <PRON> blev <AUX> hjulpet <VERB> væk <ADV> fra <ADP> slagmarken <NOUN> . <PUNCT>

[1;30mPolyglots predictions:[0m
[('To', 'NUM'), ('spanske', 'ADJ'), ('TV', 'NOUN'), ('-', 'PUNCT'), ('folk', 'NOUN'), ('blev', 'VERB'), ('gennembanket', 'ADJ'), ('af', 'ADP'), ('Sovjet', 'PROPN'), ('-', 'PUNCT'), ('soldater', 'NOUN'), (',', 'PUNCT'), ('fordi', 'SCONJ'), ('de', 'PRON'), ('havde', 'VERB'), ('filmet', 'VERB'), ('en', 'PRON'), ('såret', 'ADJ'), ('litauer', 'ADV'), (',', 'PUNCT'), ('der', 'PART'), ('blev', 'VERB'), ('hjulpet', 'VERB'), ('væk', 'ADV'), ('fra', 'ADP'), ('slagmarken', 'NOUN'), ('.', 

### Lets play a bit

In [10]:
# create dictornary

color_dict = {
    'ADJ': "\033[1;30;43m",
    'ADP': "\033[1;30;45m",
    'ADV': "\033[1;30;105m",
    'AUX': "\033[1;30;101m",
    'CONJ': "\033[1;30;103m",
    'DET': "\033[1;30;46m",
    'INTJ': "\033[1;37;46m",
    'NOUN': "\033[1;30;42m",
    'NUM': "\033[1;37;100m",
    'PART':  "\033[1;30;47m",
    'PRON': "\033[1;30;104m",
    'PROPN': "\033[1;30;44m",
    'PUNCT':"\033[1;30;102m",
    'SCONJ': "\033[1;37;103m",
    'SYM': "\033[1;37;102m",
    'VERB': "\033[1;30;41m",
    'X': "\033[1;37;47m"
}

explanation_dict = {
    'ADJ': 'Adjective',
    'ADP': 'Adposition',
    'ADV': 'Adverb',
    'AUX': 'Auxiliary verb',
    'CONJ': 'Coordinating conjunction',
    'DET': 'Determiner',
    'INTJ': 'Interjection',
    'NOUN': 'Noun',
    'NUM': 'Numeral',
    'PART':  'Particle',
    'PRON': 'Pronoun',
    'PROPN': 'Proper noun',
    'PUNCT':'Punctuation',
    'SCONJ': 'Subordinating conjunction',
    'SYM': 'Symbol',
    'VERB': 'Verb',
    'X': 'Other'
}

In [11]:
def show_the_tags(sentence, color_dict, explanation_dict,flair_model):
    # flair prediction
    flair_sentence = Sentence(sentence)
    flair_model.predict(flair_sentence)
    
    # Showing in color
    print('\033[1;30;m\n The tagged sentence:\n')
    pos_color_list = [color_dict.get(str(token.tags['upos']).split(' ')[0]) + token.text for token in  flair_sentence]
    print(' '.join(pos_color_list)) 

    # list of the used tags
    tag_list = [str(token.tags['upos']).split(' ')[0] for token in flair_sentence]

    # get the explanation in colors
    exp_color_list = [color_dict.get(tag) + explanation_dict.get(tag) for tag in  tag_list]

    # remove duplicate explanation but keep the order
    exp_color_list = sorted(set(exp_color_list), key=lambda x: exp_color_list.index(x))
    print('\033[1;30;m\n Explanations:\n')
    # print the explanation
    print('\033[1;30;m '.join(exp_color_list))  





In [12]:
show_the_tags('vis mig alle ordklasserne', color_dict, explanation_dict, flair_model)

[1;30;m
 The tagged sentence:

[1;30;41mvis [1;30;104mmig [1;30;43malle [1;30;42mordklasserne
[1;30;m
 Explanations:

[1;30;41mVerb[1;30;m [1;30;104mPronoun[1;30;m [1;30;43mAdjective[1;30;m [1;30;42mNoun


## Surface variabels

Count the pos tags in a text. This can be used as features.

In [13]:
# Look also a the tutorials on Flairs github page:
# https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md

# many sentences
text = "Dette her er en sætning. Detter her er en anden. Find hvormange af hver ordklasserne der er i denne tekst."
# or load form for example from file

# split the text into sentences for the flair framwork
sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)]

# predict the pos tags for the list of sentences
flair_model.predict(sentences) # for large text set the size of mini batches 
all_tags=[str(token.tags['upos']).split(' ')[0] for sent in sentences for token in sent]

# initialize a dictonary to for counting
count_dict = {
    'ADJ': 0,
    'ADP': 0,
    'ADV': 0,
    'AUX': 0,
    'CONJ': 0,
    'DET': 0,
    'INTJ': 0,
    'NOUN': 0,
    'NUM': 0,
    'PART':  0,
    'PRON': 0,
    'PROPN': 0,
    'PUNCT':0,
    'SCONJ': 0,
    'SYM': 0,
    'VERB': 0,
    'X': 0
}
# count the tags and save in the dictonary
count_dict={k:all_tags.count(k) for (k,v) in count_dict.items()}
count_dict

{'ADJ': 0,
 'ADP': 2,
 'ADV': 5,
 'AUX': 2,
 'CONJ': 0,
 'DET': 2,
 'INTJ': 0,
 'NOUN': 3,
 'NUM': 0,
 'PART': 0,
 'PRON': 4,
 'PROPN': 0,
 'PUNCT': 3,
 'SCONJ': 0,
 'SYM': 0,
 'VERB': 2,
 'X': 0}