# Getting started

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KennethEnevoldsen/DaCy/blob/master/docs/tutorials/basic.ipynb)

DaCy is built on [SpaCy] and uses the same pipeline structure. This means that if you are familiar with SpaCy
using DaCy should be easy. It also allows you to use other SpaCy models and components with DaCy.
Don't worry if you are not familiar with SpaCy using DaCy is still easy.


Before we start we assume you have installed DaCy and SpaCy if not please check out the [installation] page.

To use the model you first have to download either the small, medium or large model. To see a list
of all available models:

[spacy]: https://spacy.io/
[installation]: https://centre-for-humanities-computing.github.io/DaCy/installation.html

In [1]:
import dacy
for model in dacy.models():
    print(model)

da_dacy_small_tft-0.0.0
da_dacy_medium_tft-0.0.0
da_dacy_large_tft-0.0.0
da_dacy_small_trf-0.1.0
da_dacy_medium_trf-0.1.0
da_dacy_large_trf-0.1.0
da_dacy_small_trf-latest
da_dacy_medium_trf-latest
da_dacy_large_trf-latest


```{note}
The name of the indicated language (`da`), framework (`dacy`), model size (e.g.
`small`), model type (`trf`), and model version (e.g. `0.1.0`). Using a larger model
size will increase the accuracy of the model, but also increase the memory and
time needed to run the model.
```

From here we can now download a model using:

In [12]:
# get the latest small model:
nlp = dacy.load("small")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting da-dacy-small-trf==any
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/main/da_dacy_small_trf-any-py3-none-any.whl (57.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting da-dacy-small-trf==any
  Using cached https://huggingface.co/chcaa/da_dacy_small_trf/resolve/main/da_dacy_small_trf-any-p

Which will download the model and install the model. If the model is already downloaded the model will just be loaded in. Once loaded, DaCy works exactly like any other SpaCy model.

Using this we can now apply DaCy to text with conventional SpaCy syntax where we pass the text through all the components of the `nlp` pipeline.


```{seealso}
DaCy is built using SpaCy, hence you will be able to find a lot of the required documentation for
using the pipeline in their very well written [documentation](https://spacy.io).
```

In [13]:
doc = nlp(
    "DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering."
)

# Named Entity Recognition
Named Entity Recognition (NER)is the task of identifying named entities in a text. A named entity is a “real-world object” that's assigned a name - for example, a person, a country, a product or a book title. 
DaCy can recognize organizations, persons, and location, as well as other miscellaneous entities.

In [14]:
for entity in doc.ents:
    print(entity, ":", entity.label_)

DaCy : PER
dansk : MISC


We can also plot these using:

In [9]:
from spacy import displacy

displacy.render(doc, style="ent")

While at the time of its release DaCy achieved state-of-the-art performance it has since been outperformed by the [NER model](https://huggingface.co/saattrupdan/nbailab-base-ner-scandi) by Dan Nielsen. To allow users to access the best model for their use-case DaCy allows you to easily
switch the NER component to obtain a state-of-the-art model.

To do this you can simply load the model using:

In [17]:
# load the small dacy model excluding the NER component
nlp = dacy.load("da_dacy_small_trf-latest", exclude=["ner"])
# or use an empty spacy model if you only want to do NER
# nlp = spacy.blank("da")

# add the ner component from the state-of-the-art model
nlp.add_pipe("dacy/ner")

doc = nlp("Denne NER model er trænet af Dan fra Alexandra Instituttet")

displacy.render(doc, style="ent")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting da-dacy-small-trf==any
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/main/da_dacy_small_trf-any-py3-none-any.whl (57.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h

```{warn}
Note that this will add an additonal model to your pipeline, which will slow down the inference speed.
```

# Parts-of-speech Tagging
[Part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) (POS) is the task of assigning a part of speech to each word in a text. The part of speech is the grammatical role of a word in a sentence. For example, the word “run” is a verb, and the word “book” is a noun. 

After tokenization, DaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

In [18]:
print("Token POS-tag")
for token in doc:
    print(f"{token}: {token.pos_}")



Token POS-tag
Denne: DET
NER: PROPN
model: NOUN
er: AUX
trænet: VERB
af: ADP
Dan: PROPN
fra: ADP
Alexandra: PROPN
Instituttet: NOUN


```{seealso}
For more on Part-of-speech tagging see SpaCy's [documentation](https://spacy.io/usage/linguistic-features#pos-tagging).
````


# Dependency Parsing

[Dependency parsing](https://en.wikipedia.org/wiki/Dependency_grammar) is the task of assigning syntactic dependencies between tokens, i.e. identifying the head word of a phrase and the relation between the head and the word. For example, in the sentence “The quick brown fox jumps over the lazy dog”, the word “jumps” is the head of the phrase “quick brown fox”, and the relation between them is “nsubj” (nominal subject).

DaCy features a fast and accurate syntactic dependency parser. In DaCy this dependency parsing is also
used for sentence segmentation and detecting noun chunks.

You can see the dependency tree using:

```{seealso}
For more on Dependency parsing see SpaCy's [documentation](https://spacy.io/usage/linguistic-features#dependency-parse).
```

In [19]:
doc = nlp("DaCy er en effektiv pipeline til dansk fritekst.")
from spacy import displacy

displacy.render(doc)

  matches = self.matcher(doc, allow_missing=True, as_spans=False)


# Sentence Segmentation
Sentence segmentation is the task of splitting a text into sentences. In DaCy this is done using the dependency parser. This makes it very accurate and allows for the detection of sentences that are not separated by a punctuations.

In [24]:
doc = nlp("Sætnings segmentering er en vigtig del af sprogprocessering - Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning.")

for sent in doc.sents:
    print(sent)

Sætnings segmentering er en vigtig del af sprogprocessering
- Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning.


# Noun Chunks
[Noun chunks](https://en.wikipedia.org/wiki/Noun_phrase) are "base noun phrases" – flat phrases that have a noun as their head. For example, "the big yellow taxi" and "the quick brown fox" are noun chunks. Noun chunks are "noun-like" words, such as a noun, a pronoun, a proper noun, or a noun phrase, that function as the head of a noun phrase.

Noun chunks are for example used for information extraction, and for finding subjects and objects of verbs.

In [26]:
doc = nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering.")

for nc in doc.noun_chunks:
    print(nc)

DaCy
en hurtig og effektiv pipeline
dansk sprogprocessering


# Lemmatization

[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the task of grouping together the inflected forms of a word so they can be analysed as a single item. For example, the verb “to run” has the base form “run”, and the verb “ran” has the base form “run”.

Lemmatization is for example used for text normalization before training a machine learning model to reduce the number of unique tokens in the training data.

In [30]:
doc = nlp("Normalisering af tekst kan være en god idé.")

for token in doc:
    print(token, token.lemma_)

Normalisering Normalisering
af af
tekst tekst
kan kunne
være være
en en
god god
idé idé
. .
