# Getting started with DaNLP

This tutorial provides you with code for getting started with the DaNLP package for the different tasks we cover. 
More information can be found in the docs folder for each model/dataset. 
This tutorial reuses the code snippets from the documentation as minimal examples.

Overview: 

1. Models
2. Datasets


## 1. Models

    1.1. Word embeddings
    1.2. Part-of-speech tagging
    1.3. Named entity recognition
    1.4. Dependency Parsing & Noun Phrase Chunking
    1.5. Sentiment Analysis

### 1.1 Word embeddings

You can choose between using static or dynamic word embeddings. 

Below is an example of how to download and load pretrained static word embeddings with gensim or spaCy. 

In [None]:
from danlp.models.embeddings  import load_wv_with_gensim

# Load with gensim
word_embeddings = load_wv_with_gensim('conll17.da.wv')


In [None]:
# test
word_embeddings.most_similar(positive=['københavn', 'england'], negative=['danmark'], topn=1)
word_embeddings.doesnt_match("vand sodavand brød vin juice".split())
word_embeddings.similarity('københavn', 'århus')
word_embeddings.similarity('københavn', 'esbjerg')

In [None]:
from danlp.models.embeddings  import load_wv_with_spacy
# Load with spacy
word_embeddings = load_wv_with_spacy('conll17.da.wv')

Here is an example of how to load the pretrained dynamic flair embeddings.

In [None]:
from danlp.models.embeddings import load_context_embeddings_with_flair


# Use the wrapper from DaNLP to download and load embeddings with Flair
# You can combine it with on of the static emebdings
stacked_embeddings = load_context_embeddings_with_flair(word_embeddings='wiki.da.wv')


In [None]:
from flair.data import Sentence
# test 

# Embedd two different sentences
sentence1 = Sentence('Han fik bank')
sentence2 = Sentence('Han fik en ny bank')
stacked_embeddings.embed(sentence1)
stacked_embeddings.embed(sentence2)

# Show that it is contextual in the sense that 'bank' has different embedding after context
print('{} dimensions out of {} is equal'.format(int(sum(sentence2[4].embedding==sentence1[2].embedding)), len(sentence1[2].embedding)))

Here is an example of how to use BERT for embedding tokens and sentences.

In [None]:
from danlp.models import load_bert_base_model
model = load_bert_base_model()

In [None]:
vecs_embedding, sentence_embedding, tokenized_text = model.embed_text('Han sælger frugt')

### 1.2 Part-of-speech tagging

We provide two models for Part-of-speech tagging. Depending on your needs, you might want to use the flair model (better accuracy) or the spaCy model (higher speed). 

The following snippet shows how to load and use the flair model.


In [None]:
from danlp.models import load_flair_pos_model

# Load the POS tagger using the DaNLP wrapper
tagger = load_flair_pos_model()

In [None]:
from flair.data import Sentence

# Using the flair POS tagger
sentence = Sentence('Jeg hopper på en bil , som er rød sammen med Niels .') 
tagger.predict(sentence) 
print(sentence.to_tagged_string())

The following snippet shows how to load and use the spaCy model.


In [None]:
from danlp.models import load_spacy_model

#Load the POS tagger using the DaNLP wrapper
nlp = load_spacy_model()

In [None]:
# Using the spaCy POS tagger
doc = nlp('Jeg hopper på en bil, som er rød sammen med Niels.')
pred=''
for token in doc:
    pred += '{} <{}> '.format(token.text, token.pos_)
print(pred)

### 1.3 Named entity recognition

We provide 3 models for Named Entity Recognition (NER). 

Here is an example of how to use the BERT NER model. 

In [None]:
# load BERT NER
from danlp.models import load_bert_ner_model
bert = load_bert_ner_model()

In [None]:
# Get lists of tokens and labels in BIO format
tokens, labels = bert.predict("Jens Peter Hansen kommer fra Danmark")
print(" ".join(["{}/{}".format(tok,lbl) for tok,lbl in zip(tokens,labels)]))

In [None]:
# To get a correct tokenization, you have to provide it yourself to BERT  by providing a list of tokens
# (for example SpaCy can be used for tokenization)
# With this option, output can also be choosen to be a dict with tags and position instead of BIO format
tekst_tokenized = ['Han', 'hedder', 'Anders', 'And', 'Andersen', 'og', 'bor', 'i', 'Århus', 'C']
bert.predict(tekst_tokenized, IOBformat=False)

Below is an example for using the flair NER tagger.

In [None]:
from danlp.models import load_flair_ner_model

# Load the NER tagger using the DaNLP wrapper
flair_model = load_flair_ner_model()

In [None]:
from flair.data import Sentence

# Using the flair NER tagger
sentence = Sentence('Jens Peter Hansen kommer fra Danmark') 
flair_model.predict(sentence) 
print(sentence.to_tagged_string())

Here is an example for NER with spaCy. 

In [None]:
# load the model
from danlp.models import load_spacy_model

nlp = load_spacy_model()

In [None]:
# use spaCy for NER
doc = nlp('Jens Peter Hansen kommer fra Danmark') 
for tok in doc:
    print("{} {}".format(tok,tok.ent_type_))

### 1.4. Dependency Parsing & Noun Phrase Chunking

We provide Dependency parsing with our spaCy model, as well as a wrapper for deducing NP-chunks from dependencies. 

In [None]:
# load the model
from danlp.models import load_spacy_model

nlp = load_spacy_model()

In [None]:
# use the spaCy model for dependency parsing only

text = 'Et syntagme er en gruppe af ord, der hænger sammen'

doc = nlp(text)

In [None]:
# and/or use our wrapper for deducing NP-chunks
from danlp.models import load_spacy_chunking_model

# Load the chunker using the DaNLP wrapper
chunker = load_spacy_chunking_model(nlp)

# Using the chunker to predict BIO tags
np_chunks = chunker.predict(text)


In [None]:
# print dependency and chunks features for each token

syntactic_features=['Id', 'Text', 'Head', 'Dep', 'NP-chunk']
head_format ="\033[1m{!s:>11}\033[0m" * (len(syntactic_features) )
row_format ="{!s:>11}" * (len(syntactic_features) )

print(head_format.format(*syntactic_features))
# Printing dependency and chunking features for each token 
for token, nc in zip(doc, np_chunks):
    print(row_format.format(token.i, token.text, token.head.i, token.dep_, nc))

### 1.5. Sentiment Analysis

With the DaNLP package, we provide 2 BERT models for detecting emotions and tone in texts and a spaCy model for predicting the polarity of a sentence. 

Below is some code for using BERT for detecting emotions. 

In [None]:
# load the model
from danlp.models import load_bert_emotion_model
classifier = load_bert_emotion_model()

In [None]:
# using the classifier
print(classifier.predict('bilen er flot'))
print(classifier.predict('jeg ejer en rød bil og det er en god bil'))
print(classifier.predict('jeg ejer en rød bil men den er gået i stykker'))

In [None]:
# get probabilities and matching classes names
proba = classifier.predict_proba('jeg ejer en rød bil men den er gået i stykker', no_emotion=False)[0]
classes = classifier._classes()[0]
for cl, pb in zip(classes, proba):
    print(cl,'\t', pb)

Here is an example for using BERT for tone detection.

In [None]:
# load the model
from danlp.models import load_bert_tone_model
classifier = load_bert_tone_model()

In [None]:
# using the classifier
print(classifier.predict('Analysen viser, at økonomien bliver forfærdelig dårlig'))
print(classifier.predict('Jeg tror alligvel, det bliver godt'))

In [None]:
# get probabilities and matching classes names
proba = classifier.predict_proba('Analysen viser, at økonomien bliver forfærdelig dårlig')[0]
classes = classifier._classes()[0]
for cl, pb in zip(classes, proba):
    print(cl,'\t', pb)

Here is how to use spaCy for sentiment analysis.

In [None]:
# load the model
from danlp.models import load_spacy_model

nlp = load_spacy_model(textcat='sentiment', vectorError=True) 
# if you got an error saying da.vectors not found, try setting vectorError=True as follow:
#nlp = load_spacy_model(textcat='sentiment', vectorError=True) 

In [None]:
import operator
# use the model for predicting the polarity of a sentence
doc = nlp("Vi er glade for spacy!")
max(doc.cats.items(), key=operator.itemgetter(1))[0]

## 2. Datasets

    2.1. Danish Dependency Treebank (DaNE)
    2.2. Dacoref
    2.3. WikiANN
    2.4. Sentiment datasets
    2.5. Word similarity datasets
    2.6. DanNet

### 2.1. Danish Dependency Treebank (DaNE)

The DaNE dataset contains annotations for PoS-tagging, Named Entity Recognition and Dependency Parsing.

In [None]:
from danlp.datasets import DDT
ddt = DDT()

spacy_corpus = ddt.load_with_spacy()
flair_corpus = ddt.load_with_flair()
conllu_format = ddt.load_as_conllu()

### 2.2. Dacoref

Dacoref can be used for training and testing models for coreference resolution.

In [None]:
from danlp.datasets import Dacoref
dacoref = Dacoref()
# The corpus can be loaded with or without splitting into train, dev and test in a list in that order
corpus = dacoref.load_as_conllu(predefined_splits=True) 

### 2.3. WikiANN

WikiANN is annotated with named entity tags.

In [None]:
from danlp.datasets import WikiAnn
wikiann = WikiAnn()

spacy_corpus = wikiann.load_with_spacy()
flair_corpus = wikiann.load_with_flair()

### 2.4. Sentiment datasets

Europarl Sentiment 1 is annotated with polarity scores (from -5 to 5), while Europarl Sentiment 2 is annotated with polarity tags (‘positive’, ‘neutral’, ‘negative’) and analytics (‘subjective’ , ‘objective’). 

In [None]:
from danlp.datasets import EuroparlSentiment1
eurosent = EuroparlSentiment1()

df = eurosent.load_with_pandas()

In [None]:
from danlp.datasets import EuroparlSentiment2
eurosent = EuroparlSentiment2()

df = eurosent.load_with_pandas()

As well as Europarl Sentiment 1, LCC Sentiment is annotated with polarity scores (from -5 to 5).

In [None]:
from danlp.datasets import LccSentiment
lccsent = LccSentiment()

df = lccsent.load_with_pandas()

### 2.5 Word similarity datasets

The word similarity datasets contain lists of words annotated with similarity scores (from 1 to 10). They can be used for evaluating word embedings.

In [None]:
from danlp.datasets import DSD

dsd = DSD()
dsd.load_with_pandas()

In [None]:
from danlp.datasets import WordSim353Da

ws353 = WordSim353Da()
ws353.load_with_pandas()

### 2.6 DanNet

DanNet is a lexical database such as Wordnet. 
You can download the database or use our wrapper for finding synonyms and other type of relation between words in Danish. 

In [None]:
from danlp.datasets import DanNet

dannet = DanNet()

# you can load the databases if you want to look into the databases by yourself
words, wordsenses, relations, synsets = dannet.load_with_pandas()

In [None]:
# or use our functions to search for synonyms, hyperonyms, hyponyms and domains 

word = "myre"
print(word)
print("synonyms : ", dannet.synonyms(word))
print("hypernyms : ", dannet.hypernyms(word))
print("hyponyms : ", dannet.hyponyms(word))
print("domains : ", dannet.domains(word))
print("meanings : ", dannet.meanings(word))

# to help you dive into the databases
# we also provide the following functions: 

print("part-of-speech : ", dannet.pos(word))
print("wordnet relations : ", dannet.wordnet_relations(word, eurowordnet=True))
print("word ids : ", dannet._word_ids(word))
print("synset ids : ", dannet._synset_ids(word))
i = 11034863
print("word from id =",i, ":", dannet._word_from_id(i))
i = 3514
print("synset from id =", i, ":", dannet._synset_from_id(i))