# Introduction to spaCy/DaCy and Named Entity Recognition (NER)
This notebook is an introduction to the spaCy/DaCy universe, and will introduce some of its basic functionality such as  extracting named entities (e.g. people, places, locations) from text.

In this notebook we will focus on small examples, but the methods carry over to the data that we have prepared for you on UCloud. 

This notebook is mainly meant as a reference to consult when working on the next task. Read through it, run the code, and try to think of ways these methods might be useful for gaining insights from the web/Twitter data. The next notebook contains exercises specific to your data.

## Why NER?

Named Entity Recognition (NER) is the task of identifying named entities in a text. A named entity is a “real-world object” that’s assigned a name - for example, a person, a country, a product or a book title.
NER is extremely usable for a wide range of tasks:
1. Anonymising documents (replacing named entities with a pseudonym)
2. Information extraction: finding important actors/entities within a document. This could be used to e.g. automatically link to an employee's profile or to create location tags.
3. Categorizing documents based on the occurence of certain entities

## NER in Danish

There are multiple tools for Danish NER. The fastest uses the [spaCy](https://spacy.io/) library, and the most accurate uses [DaCy](https://github.com/centre-for-humanities-computing/DaCy), which was developed here at CHC.


### Getting started

First off, we need to import the libraries that we intend to use. Let's load spaCy and DaCy.

In [1]:
import dacy
import spacy

  from .autonotebook import tqdm as notebook_tqdm


To use the language models, we have to load them. Let's see an example with spaCy.

In [2]:
# load a Danish spacy model
nlp = spacy.load("da_core_news_lg")

Using the model is as simple as supplying a text to the `nlp` object.

In [3]:
text = "Joe Biden omtalte Kinas præsident som diktator: 'Nu kommer den rigtige test af, om de vil forbedre forholdet'" 
doc = nlp(text)

We can now use spaCy/daCy to analyse the text. For instance, we can look at the individual sentences in the text:

In [4]:
for sentence in doc.sents:
    print(sentence.text)

Joe Biden omtalte Kinas præsident som diktator: '
Nu kommer den rigtige test af, om de vil forbedre forholdet'


Or find the named entities:

In [5]:
for named_entity in doc.ents:
    print(named_entity.text, named_entity.label_)

Joe Biden PER
Kinas LOC


We can also visualize the named entities using the built-in visualizer.

In [6]:
from spacy import displacy

displacy.render(doc, style="ent")

### Fine-grained NER with DaCy
While the model for named entity recognition in spaCy is fast, it's limited to only people (PER), locations (LOC), and organizations (ORG). The DaCy model allows us to look at more fine-grained entities as illustrated in the table below:

|  Tag        |             Description                                         | 
| -------- | ---------------------------------------------------- | 
| PERSON   | People, including fictional                          | 
| NORP     | Nationalities or religious or political groups       | 
| FACILITY | Building, airports, highways, bridges, etc.          | 
| ORGANIZATION | Companies, agencies, institutions, etc.              | 
| GPE      | Countries, cities, states.                           | 
| LOCATION | Non-GPE locations, mountain ranges, bodies of water  | 
| PRODUCT  | Vehicles, weapons, foods, etc. (not services)        | 
| EVENT    | Named hurricanes, battles, wars, sports events, etc. | 
| WORK OF ART | Titles of books, songs, etc.                         | 
| LAW      | Named documents made into laws                       | 
| LANGUAGE | Any named language                                   | 
     
As well as annotations for the following concepts: 


|   Tag       |   Description                                         | 
| -------- | ------------------------------------------- | 
| DATE     | Absolute or relative dates or periods       | 
| TIME     | Times smaller than a day                    | 
| PERCENT  | Percentage (including '\*'\%)                | 
| MONEY    | Monetary values, including unit             | 
| QUANTITY | Measurements, as of weight or distance      | 
| ORDINAL  | "first", "second"                           | 
| CARDINAL | Numerals that do no fall under another type | 


Let's try to use it.

In [7]:
# load the small dacy model excluding the NER component
dacy_nlp = dacy.load("small", exclude=["ner"])

# add the ner component from the fine-grained model
dacy_nlp.add_pipe("dacy/ner-fine-grained", config={"size": "small"})

Defaulting to user installation because normal site-packages is not writeable
Collecting da-dacy-small-trf==any
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 16.9 MB/s eta 0:00:00
[?25hInstalling collected packages: da-dacy-small-trf
Successfully installed da-dacy-small-trf-0.2.0
Defaulting to user installation because normal site-packages is not writeable
Collecting da-dacy-small-ner-fine-grained==any
  Downloading https://huggingface.co/chcaa/da_dacy_small_ner_fine_grained/resolve/43fedc5a1b1c1d193f461d13225f217f2ced507d/da_dacy_small_ner_fine_grained-any-py3-none-any.whl (82.7 MB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.7/82.7 MB 21.5 MB/s eta 0:00:00
[?25hInstalling collected packages: da-dacy-small-ner-fine-grained
Successfully installed da-dacy-small-ner-fine-grained-0.1.0


<spacy.pipeline.ner.EntityRecognizer at 0x7f3b6b18f990>

Let's give it a try!

In [8]:
doc = dacy_nlp("Denne model samt 3 andre blev trænet d. 7. marts af Center for Humantities Computing i Aarhus kommune")

displacy.render(doc, style="ent")

### Running spaCy/DaCy on multiple texts
To analyze multiple texts at once, you can use the `nlp.pipe` method. Here's an example.

In [9]:
texts = [
    "Her er det 1. tekststykke. Det er kort, og uden personer.",
    "Her er det 2. tekststykke. Det er lidt længere, og indeholder en person. Vi kunne kalde ham Kristoffer.",
]

docs = dacy_nlp.pipe(texts)
# iterate over the documents one by one
for doc in docs:
    print("First document!")
    displacy.render(doc, style="ent")

First document!


First document!


### Lemmatization

Lemmatization is the act of grouping together the inflected forms of a word so they can be analysed as a single item. For example, the verb “to run” has the base form “run”, and the verb “ran” has the base form “run”.

Lemmatization is for example used for text normalization before training a machine learning model to reduce the number of unique tokens in the training data. Let's see an example.

In [10]:
doc = nlp("Normalisering af tekst kan være en god idé.")

for token in doc:
    print(token, token.lemma_)

Normalisering Normalisering
af af
tekst tekst
kan kunne
være være
en en
god god
idé idé
. .


### Other linguistic features

SpaCy/DaCy is not limited to extracting NER or doing lemmatization. You can perform many complex linguistic analysis, such as investigating part-of-speech tags, or the dependency relations between words.

In [11]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Normalisering NOUN nsubj idé
af ADP case tekst
tekst NOUN nmod Normalisering
kan AUX aux idé
være AUX cop idé
en DET det idé
god ADJ amod idé
idé NOUN ROOT idé
. PUNCT punct idé


For more examples, check out the [DaCy tutorials](https://centre-for-humanities-computing.github.io/DaCy/tutorials.html) or the [spaCy 101](https://spacy.io/usage/spacy-101).