# Getting started

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KennethEnevoldsen/DaCy/blob/master/docs/tutorials/basic.ipynb)

DaCy is built on [SpaCy] and uses the same pipeline structure. This means that if you are familiar with SpaCy
using DaCy should be easy. It also allows you to use other SpaCy models and components with DaCy.
Don't worry if you are not familiar with SpaCy using DaCy is still easy.


Before we start we assume you have installed DaCy and SpaCy if not please check out the [installation] page.

To use the model you first have to download either the small, medium or large model. To see a list
of all available models:

[spacy]: https://spacy.io/
[installation]: https://centre-for-humanities-computing.github.io/DaCy/installation.html

In [1]:
import dacy
for model in dacy.models():
    print(model)

da_dacy_small_trf-0.2.0
da_dacy_medium_trf-0.2.0
da_dacy_large_trf-0.2.0
small
medium
large
da_dacy_small_ner_fine_grained-0.1.0
da_dacy_medium_ner_fine_grained-0.1.0
da_dacy_large_ner_fine_grained-0.1.0


```{note}
The name of the indicated language (`da`), framework (`dacy`), model size (e.g.
`small`), model type (`trf`), and model version (e.g. `0.1.0`). Using a larger model
size will increase the accuracy of the model, but also increase the memory and
time needed to run the model.
```

From here we can now download a model using:

In [2]:
# get the latest medium model:
nlp = dacy.load("small")

Which will download the model and install the model. If the model is already downloaded the model will just be loaded in. Once loaded, DaCy works exactly like any other SpaCy model.

Using this we can now apply DaCy to text with conventional SpaCy syntax where we pass the text through all the components of the `nlp` pipeline.


```{seealso}
DaCy is built using SpaCy, hence you will be able to find a lot of the required documentation for
using the pipeline in their very well written [documentation](https://spacy.io).
```

In [3]:
doc = nlp("DaCy-pakken er en hurtig og effektiv pipeline til dansk sprogprocessering.")

# Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying named entities in a text. A named entity is a “real-world object” that's assigned a name - for example, a person, a country, a product or a book title. 
DaCy can recognize organizations, persons, and location, as well as other miscellaneous entities.

In [4]:
for entity in doc.ents:
    print(entity, ":", entity.label_)

DaCy-pakken : MISC
dansk : MISC


We can also plot these using:

In [5]:
from spacy import displacy

displacy.render(doc, style="ent")

While at the time of its release DaCy achieved state-of-the-art performance it has since been outperformed by the [NER model](https://huggingface.co/saattrupdan/nbailab-base-ner-scandi) by Dan Nielsen. To allow users to access the best model for their use-case DaCy allows you to easily
switch the NER component to obtain a state-of-the-art model.

To do this you can simply load the model using:

In [6]:
# load the small dacy model excluding the NER component
nlp = dacy.load("small", exclude=["ner"])
# or use an empty spacy model if you only want to do NER
# nlp = spacy.blank("da")

# add the ner component from the state-of-the-art model
nlp.add_pipe("dacy/ner")

<spacy_wrap.pipeline_component_tok_clf.TokenClassificationTransformer at 0x1758db040>

In [7]:
doc = nlp("Denne NER model er trænet af Dan fra Alexandra Instituttet")

displacy.render(doc, style="ent")

```{warning}
Note that this will add an additonal model to your pipeline, which will slow down the inference speed.
```

# Named Entity Linking
As you probably already saw the named entities are annotated with a unique identifier. This is because DaCy also supports named entity linking.

[Named entity linking](https://en.wikipedia.org/wiki/Entity_linking) is the task of linking a named entity to a knowledge base. This is done by assigning a unique identifier to each entity. This allows us to link entities to other entities and extract information from the knowledge base. For example, we can link the entity "Barack Obama" to the Wikipedia or wikidata page about Barack Obama. Named entity linking is also known as named entity disambiguation, though this term could also refer to the task of distinguishing between entities with the same name without linking to a knowledge base.

```{admonition} Beta feature
Named entity linking is currently in beta and is not yet fully tested. If you find any bugs please report them on [github](https://github.com/centre-for-humanities-computing/DaCy/issues).
We are working on expanding the knowledge-base as well as correcting the annotations, which currently annotates unknown persons using the QID for the correspondig name. For instance in the sentence *Rutechef Ivan Madsen: "Jeg ved ikke hvorfor...* the name *Ivan Madsen* is annotated using two QID's Q830350 (Ivan, male name) and Q16876242 (Madsen, family name), which we believe is incorrect as the person is not referring to the last name *Madsen*, but rather the person with the full name *Ivan Madsen*.
The knowledge is also currently limited and thus while the links you do obtain are often correct the model will often not be able to link all entities to the knowledge base.
```

In DaCy the `small`, `medium`, and `large` model slhave a named entity linking component. This component uses a neural entity linking to match the entity to a specifc entity in the knowledge base. The knowledge base DaCy uses is currently a combination of Danish and English Wikidata.

In [8]:
from wikidata.client import Client

nlp = dacy.load("small")
text = "Danmarks dronning bor i København"

doc = nlp(text)

displacy.render(doc, style="ent")


client = Client() # start wikidata client
for entity in doc.ents:
    print(entity, ":", entity.kb_id_)

    # print the short description derived from wikidata
    wikidata_entry  = client.get(entity.kb_id_, load=True)
    print(wikidata_entry.description.get("en"))
    print(wikidata_entry.description.get("da"))
    print(" ")
        


Danmarks : Q35
country in Northern Europe
nordeuropæisk land
 
København : Q1748
capital city of Denmark
Danmarks hovedstad
 


You can even do more things e.g. extract the information from the knowledge base such as images, associated wikipedia article and so on.

## Fine-grained NER

DaCy also features models with a more fine-grained Named Entity Recognition component.
This has been trained on the [DANSK](https://huggingface.co/datasets/chcaa/DANSK).
This allows for the detection of 18 classes - namely the following Named Entities:

|  Tag        |             Description                                         |
| -------- | ---------------------------------------------------- |
| PERSON   | People, including fictional                          |
| NORP     | Nationalities or religious or political groups       |
| FACILITY | Building, airports, highways, bridges, etc.          |
| ORGANIZATION | Companies, agencies, institutions, etc.              |
| GPE      | Countries, cities, states.                           |
| LOCATION | Non-GPE locations, mountain ranges, bodies of water  |
| PRODUCT  | Vehicles, weapons, foods, etc. (not services)        |
| EVENT    | Named hurricanes, battles, wars, sports events, etc. |
| WORK OF ART | Titles of books, songs, etc.                         |
| LAW      | Named documents made into laws                       |
| LANGUAGE | Any named language                                   |

As well as annotation for the following concepts:

|   Tag       |   Description                                         |
| -------- | ------------------------------------------- |
| DATE     | Absolute or relative dates or periods       |
| TIME     | Times smaller than a day                    |
| PERCENT  | Percentage (including "*"%)                |
| MONEY    | Monetary values, including unit             |
| QUANTITY | Measurements, as of weight or distance      |
| ORDINAL  | "first", "second"                           |
| CARDINAL | Numerals that do no fall under another type |

The fine-grained NER component may be utilized in an existing pipeline in the following fashion:


In [9]:
# load the small dacy model excluding the NER component
nlp = dacy.load("small", exclude=["ner"])

# add the ner component from the state-of-the-art fine-grained model
nlp.add_pipe("dacy/ner-fine-grained", config={"size": "small"})
# or if you only want to do just NER
# nlp = dacy.load("da_dacy_small_ner_fine_grained-0.1.0")


<spacy.pipeline.ner.EntityRecognizer at 0x177f79d90>

In [10]:
doc = nlp("Denne model samt 3 andre blev trænet d. 7. marts af Center for Humantities Computing i Aarhus kommune")

displacy.render(doc, style="ent")

# Parts-of-speech Tagging
[Part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) (POS) is the task of assigning a part of speech to each word in a text. The part of speech is the grammatical role of a word in a sentence. For example, the word “run” is a verb, and the word “book” is a noun. 

After tokenization, DaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

In [11]:
print("Token POS-tag")
for token in doc:
    print(f"{token}:\t {token.pos_}")



Token POS-tag
Denne:	 DET
model:	 NOUN
samt:	 CCONJ
3:	 NUM
andre:	 PRON
blev:	 AUX
trænet:	 VERB
d.:	 ADV
7.:	 ADJ
marts:	 NOUN
af:	 ADP
Center:	 NOUN
for:	 ADP
Humantities:	 PROPN
Computing:	 PROPN
i:	 ADP
Aarhus:	 PROPN
kommune:	 NOUN


```{seealso}
For more on Part-of-speech tagging see SpaCy's [documentation](https://spacy.io/usage/linguistic-features#pos-tagging).
````


# Dependency Parsing

[Dependency parsing](https://en.wikipedia.org/wiki/Dependency_grammar) is the task of assigning syntactic dependencies between tokens, i.e. identifying the head word of a phrase and the relation between the head and the word. For example, in the sentence “The quick brown fox jumps over the lazy dog”, the word “jumps” is the head of the phrase “quick brown fox”, and the relation between them is “nsubj” (nominal subject).

DaCy features a fast and accurate syntactic dependency parser. In DaCy this dependency parsing is also
used for sentence segmentation and detecting noun chunks.

You can see the dependency tree using:

```{seealso}
For more on Dependency parsing see SpaCy's [documentation](https://spacy.io/usage/linguistic-features#dependency-parse).
```

In [12]:
doc = nlp("DaCy er en effektiv pipeline til dansk fritekst.")
from spacy import displacy

displacy.render(doc)

# Sentence Segmentation
Sentence segmentation is the task of splitting a text into sentences. In DaCy this is done using the dependency parser. This makes it very accurate and allows for the detection of sentences that are not separated by a punctuations.

In [13]:
doc = nlp("Sætnings segmentering er en vigtig del af sprogprocessering - Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning.")

for sent in doc.sents:
    print(sent)

Sætnings segmentering er en vigtig del af sprogprocessering
- Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning.


# Noun Chunks
[Noun chunks](https://en.wikipedia.org/wiki/Noun_phrase) are "base noun phrases" – flat phrases that have a noun as their head. For example, "the big yellow taxi" and "the quick brown fox" are noun chunks. Noun chunks are "noun-like" words, such as a noun, a pronoun, a proper noun, or a noun phrase, that function as the head of a noun phrase.

Noun chunks are for example used for information extraction, and for finding subjects and objects of verbs.

In [14]:
doc = nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering.")

for nc in doc.noun_chunks:
    print(nc)

DaCy
en hurtig og effektiv pipeline
dansk sprogprocessering


# Lemmatization

[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the task of grouping together the inflected forms of a word so they can be analysed as a single item. For example, the verb “to run” has the base form “run”, and the verb “ran” has the base form “run”.

Lemmatization is for example used for text normalization before training a machine learning model to reduce the number of unique tokens in the training data.

In [15]:
doc = nlp("Normalisering af tekst kan være en god idé.")

for token in doc:
    print(token, token.lemma_)

Normalisering Normalisering
af af
tekst tekst
kan kunne
være være
en en
god god
idé idé
. .


# Coreference Resolution

[Coreference resolution](https://en.wikipedia.org/wiki/Coreference) is the task of finding all expressions that refer to the same entity in a text. For example, in the sentence “The dog chased the ball because it was shiny”, “it” is referring to the “ball”.


Coreference resolution is for example used for question answering, summarization, conversational agents/chatbots and information extraction where such resolved references can lead to a better semantic representation.

```{admonition} Beta feature
Coreference resolution is currently an experimental feature from spaCy. This is thus only a beta feature in DaCy. We are currently working on improving the performance of the model.
```

In [16]:
text = "Den 4. november 2020 fik minkavler Henning Christensen og hele familien et chok. Efter et pressemøde, fik han at vide at alle mink i Danmark skulle aflives. Dermed fik han fjernet hans livsgrundlag"
doc = nlp(text)
print("Coreference clusters:")
print(doc.spans)


Coreference clusters:
{'coref_clusters_1': [minkavler Henning Christensen, han, han, hans]}
