# Information Extraction
We sometimes need to extract data from documents.
For example, most documents contain dates and names of people or places.
We can collect this information for example for statistics or further processing.


## Named Entity Recognition
We can use Named Entity Recognition (NER) to extract names of people, companies, places and other entities.
Most NER systems also extract numbers.
We will use the NER module in [spaCy]( https://spacy.io/).
spaCy is a library for Natural Language Processing in Python.

First, we import the spaCy library and its module displaCy.

In [None]:
import spacy
from spacy import displacy

spaCy can use models for many different languages.
The first time we use NER in spaCy we must download the data files for the English language.

In [None]:
!python -m spacy download en_core_web_sm

We load a short text document to run the Named Entity Recognizer on.

In [None]:
filename = 'LO-NTF-v-Norway.txt'
with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

We load the English NLP model:

In [None]:
nlp = spacy.load("en_core_web_sm")

Next, we process the text with the NLP model.

In [None]:
document = nlp(text)

We can extract the entities and their labels:

In [None]:
entities = [(ent.text, ent.label_) for ent in document.ents]

Let's look at the data:

In [None]:
for entity in entities:
    print(f"Entity: {entity[0]}, Label: {entity[1]}")

We can also get the entity types:

In [None]:
identity_types = set(ent.label_ for ent in document.ents)
print(f"Identity types: {identity_types}")

Finally, we can display the tagged text.

In [None]:
# Visualize text with named entities as tags
displacy.render(document, style="ent", jupyter=True)