# Introduction to DaCy and SpaCy

----
Before we start we assume you have installed DaCy and SpaCy if not you can run the following:


In [1]:
# !pip install git+https://github.com/KennethEnevoldsen/DaCy


----

Let's start of by loading DaCy as well as the smallest of the two models:

In [2]:
import dacy

# to see available models
for model in dacy.models():
    print(model)

# loading the smallest model
nlp = dacy.load("da_dacy_medium_tft-0.0.0")

da_dacy_medium_tft-0.0.0
da_dacy_large_tft-0.0.0


# Examining the SpaCy's Classes

In [3]:
print(type(nlp))

doc = nlp("EU-landene Frankrig, Italien, Spanien og Tyskland har indgået vaccine-aftale med Rusland")

print(type(doc))

print(type(doc[0]))


<class 'spacy.lang.da.Danish'>
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.token.Token'>


In [4]:
# what can we do with the token class 
print(dir(doc[0]))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_dep', 'has_extension', 'has_head', 'has_morph', 'has_vector', 'head', 'i', 'idx', 'iob_strings', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex', 'lex_id', 'like_email', 'like

In [18]:
# Extracting things from the document and the token class.

for token in doc:
    print(f"{token}: \n\tPOS-tag: {token.tag_}, \n\tNER: {token.ent_type_} - {token.ent_type_}")

# Why the underscore '_'? Hint: Efficient data structures

# you can also extract things directly from the document class:
doc.ents

EU-landene: 
	POS-tag: NOUN, 
	NER: MISC - MISC
Frankrig: 
	POS-tag: PROPN, 
	NER: LOC - LOC
,: 
	POS-tag: PUNCT, 
	NER:  - 
Italien: 
	POS-tag: PROPN, 
	NER: LOC - LOC
,: 
	POS-tag: PUNCT, 
	NER:  - 
Spanien: 
	POS-tag: PROPN, 
	NER: LOC - LOC
og: 
	POS-tag: CCONJ, 
	NER:  - 
Tyskland: 
	POS-tag: PROPN, 
	NER: LOC - LOC
har: 
	POS-tag: AUX, 
	NER:  - 
indgået: 
	POS-tag: VERB, 
	NER:  - 
vaccine-aftale: 
	POS-tag: NOUN, 
	NER:  - 
med: 
	POS-tag: ADP, 
	NER:  - 
Rusland: 
	POS-tag: PROPN, 
	NER: LOC - LOC


(EU-landene, Frankrig, Italien, Spanien, Tyskland, Rusland)

# Visualization of Predictions

In [6]:
from spacy import displacy

displacy.render(doc, style="ent")

In [7]:
displacy.render(doc, style="dep")

# Expanding SpaCy
---
We will now briefly examine how to expand upon SpaCy for our own goals. We will do two things.

- 1) Add a readability measure, and
- 2) a NER and dependency based task using DaCy

to do this we will first need some data. I will first need some data. For this we will use the speeches by Mette Frederiksen:


In [8]:
import pandas as pd

In [9]:
df = pd.read_csv("../data/speeches.csv")

speeches = df[df["person"] == "Mette Frederiksen"]["text"].tolist()

In [74]:
print(speeches[3][:300])
print("---")
print(speeches[5][:300])

Deres MajestætKære formand, overborgmester og borgmester.Kære alle sammen.Kære København, hovedstad af Danmark. Kæmpe stort tillykke med i dag. Så kom dagen. Efter mere end næsten 10 år med byggerod. Cityringen står klar. Det største anlægsprojekt i København siden Christian den Fjerde. Jeg har lige
---
Kære kongres. Tak for invitationen. Som jeg har forstået det, er det første gang en politiker får lov til at stå her på talerstolen. Jeg er stolt over, at det blev mig. Jeg har glædet mig til at komme. Vi har mange fælles sager. Og en af de absolut vigtigste gælder vores velfærdssamfund – det dyreba


In [80]:
# a nice bonus of using SpaCy is you get a lot of "free stuff"
doc = nlp(speeches[3][:300])

for sent in doc.sents:
    print(sent)

Deres MajestætKære formand, overborgmester og borgmester.
Kære alle sammen.
Kære København, hovedstad af Danmark.
Kæmpe stort tillykke med i dag.
Så kom dagen.
Efter mere end næsten 10 år med byggerod.
Cityringen står klar.
Det største anlægsprojekt i København siden Christian den Fjerde.
Jeg har lige


## Measuring readability
In Danish a simple measure of readability is LIX. it is a by no means the best, but it is a good heuristic.

LIX is given as follows:

$$
LIX = \frac{O}{P} + \frac{L \cdot 100}{O}
$$

where;

$O$: Number of words

$P$: Number of full stops (I will use number of sentences instead)

$L$: Number of long words (bigger than 6)  


In [89]:
from spacy.tokens import Doc

O = len(doc)
P = len(list(doc.sents))
L = len([t for t in doc if len(t)>6])

LIX = O/P + L*100/O
LIX

31.22222222222222

We naturally don't want to run this every time we need it. Thus is might be ideal to add a getter.

Why a getter and not a function? Well the getter is a function ;), but more than that the getter only runs the function when the variable is needed, which makes it very efficient for simple tasks such as this. If you want to add more explicit variables you might want to add a pipe instead.

In [96]:
# adding it to the doc:

def LIX_getter(doc):
    """
    extract LIX
    """
    O = len(doc)
    P = len(list(doc.sents))
    L = len([t for t in doc if len(t)>6])

    LIX = O/P + L*100/O
    return LIX

Doc.set_extension("LIX", getter=LIX_getter)

In [97]:
# testing it out on a doc
doc = nlp(speeches[0])

doc._.LIX

23.03928960991741


## Using NER and Dependency Parsing
To start this of let us first look at what entities Mette Frederiksen describes in her speeches:

In [None]:
docs = nlp.pipe(speeches)  # only use this for large amount of documents (not like this)

for doc in docs:
    print(doc.ents)

Oh well, look at that `Danmark` seems quite popular. Well given that let us examine how Mette describes Denmark. Let's first make a simple example:

In [90]:
doc = nlp("velkommen til skønne Danmark")

displacy.render(doc)

[skønne]

Notice how DK is describes using the adjective *'skønne'* and that this captured by the parsing tag *amod*. This can be extracted quite easily as follows:

In [92]:
[t for t in doc[3].subtree if t.dep_ == "amod"] # doc[3] corresponds to Danmark

[skønne]

Similarly to before we can now add a method for doing this for all docs. Notice this function is only ever called when you extract the variable. This it is not really running before you need it.

In [69]:

def ent_desc_getter(doc, entity="danmark"):
    """
    return words which describes the entity
    
    assumes entity is length 1
    """
    for ent in doc.ents:
        if ent.text.lower() == entity:
            out = [t for t in doc[ent.start].subtree if t.dep_ == "amod"]
            if out:
                for i in out:
                    yield i

Doc.set_extension("dk_desc", getter=ent_desc_getter)


In [70]:
# Testing it out on one speech
doc = nlp(speeches[0])
list(doc._.dk_desc)

[hele]

In [73]:
# testing it out on all the speeches
docs = nlp.pipe(speeches)

for doc in docs:
    print(list(doc._.dk_desc))

[hele]
[hele, hele]
[]
[grønnere]
[grønnere, hele, grønt]
[solidarisk, store]
[]
[]
[]
[hele]
[hele]
[Hele, mange]
[hele, alle, hele]
[grønt]
[]
[hele]
[]
[]


Naturally, one could extend this. One might wish to filter by the tag as well e.g. by only showing adjectives. Similarly, this approach does not catch even simple cases such as *"Danmark er det skønneste land"*. In which case you can either parse the tree further and/or use coreference resolution.

This conludes the tutorial. If you wish to work more on Danish NLP and DaCy feel free to contribute to its development.

# How to Contribute
---

DaCy is by no means perfect and there is still some notable limitaitons:
- Lemmatization: It currently uses a lookup table for lemmatization based on the training corpus, a more viable solution is to use the `lemmy` package for SpaCy v2 but it need to be updated.
- POS-tags: Currently POS-tags are assigned to the `tag_` not the `pos_` label. This needs to be fixed
- DaCy is trained on a fairly small training corpus, any data augmentation and/or increase in training will likely results in improved performance. 
- DaCy notable does not include a sentiment analysis component. There is multiple reasons for this, the primary being that DaNE is not tagged for sentiment and sentiment analysis still lacks a clear definition.

If you make progress in any of these (or something else which you find relevant), please feel free to reach out.
