# NLP Processing With spaCy

SpaCy is a Python software library for automatic processing of many languages. It is an essential toolbox for the computational analysis of text corpora.

This course is an introduction to its main features, including:

- tokenization
- lemmatization
- named-Entities recognition and linking

Some reading material:

- https://spacy.io/
- Avanced NLP with Spacy : https://course.spacy.io/en/
- https://github.com/mchesterkadwell/named-entity-recognition/blob/main/1-basic-text-mining-concepts.ipynb
- for NER : https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/12-Named-Entity-Recognition.html
- for Text-Mining (basics) : https://github.com/mchesterkadwell/named-entity-recognition/blob/main/1-basic-text-mining-concepts.ipynb 

This course includes parts of the excellent tutorial "Natural Language Processing With spaCy in Python": 
https://realpython.com/natural-language-processing-spacy-python

In [None]:
# Install and import Spacy
#!pip install -U spacy

In [None]:
import spacy

## Language Processing Pipelines & Trained Models

[https://spacy.io/usage/processing-pipelines](https://spacy.io/usage/processing-pipelines)

When you call nlp on a text, spaCy first tokenizes the text to produce a [Doc object](https://spacy.io/api/doc). The Doc is then processed in several different steps (ie the processing pipeline). Each pipeline component returns the processed Doc, which is then passed on to the next component.

![spacy_pipeline](img/spacy_pipeline.svg)

The capabilities of a processing pipeline always depend on the components, their models and how they were trained. For example, a pipeline for named entity recognition needs to include a trained named entity recognizer component.

Reference should be made to the documentation of the models made available : [https://spacy.io/models](https://spacy.io/models)



**Models available for French:**

|name|genre|size|use|components|
|----|-----|----|---|----------|
|[fr_core_news_sm](https://spacy.io/models/fr#fr_core_news_sm)|written text (news, media)|15 MB|CPU|tok2vec, morphologizer, parser, senter, attribute_ruler, lemmatizer, ner|
|[fr_core_news_md](https://spacy.io/models/fr#fr_core_news_md)|written text (news, media)|43 MB|CPU|tok2vec, morphologizer, parser, senter, ner, attribute_ruler, lemmatizer|
|[fr_core_news_lg](https://spacy.io/models/fr#fr_core_news_lg)|written text (news, media)|545 MN|CPU|tok2vec, morphologizer, parser, senter, ner, attribute_ruler, lemmatizer|
|[fr_dep_news_trf](https://spacy.io/models/fr#fr_dep_news_trf)|written text (news, media)|382 MB|GPU (camembert-base)|transformer, morphologizer, parser, attribute_ruler, lemmatizer|

## Install and import trained pipeline

In [None]:
#!conda install -c conda-forge spacy-model-fr_core_news_md

The load() function returns a [Language callable object](https://spacy.io/api/language), which is commonly assigned to a variable called nlp.

In [None]:
import fr_core_news_md
nlp = fr_core_news_md.load()

In [None]:
type(nlp)

In [None]:
print(nlp.pipe_names)

## Tokens

To start processing your input, you construct a [Doc object](https://spacy.io/api/doc). 

A Doc object is a sequence of [Token](https://spacy.io/api/token) objects representing a lexical token (~ a word). each token contains different features describing it (lemma, morpho-syntactic label, etc.).

**You can instantiate a Doc object by calling the Language object with the input string as an argument**:

In [None]:
quote = 'Quant aux gens que j’accuse, je ne les connais pas, je ne les ai jamais vus, je n’ai contre eux ni rancune ni haine. Ils ne sont pour moi que des entités, des esprits de malfaisance sociale. Et l’acte que j’accomplis ici n’est qu’un moyen révolutionnaire pour hâter l’explosion de la vérité et de la justice.\nJe n’ai qu’une passion, celle de la lumière, au nom de l’humanité qui a tant souffert et qui a droit au bonheur. Ma protestation enflammée n’est que le cri de mon âme. Qu’on ose donc me traduire en cour d’assises et que l’enquête ait lieu au grand jour !\nJ’attends.'
document = nlp(quote)

Chaque token a de nombreux [attributs](https://spacy.io/api/token#attributes) facilement accessibles.

In [None]:
document_tokens = []
for token in document:
    document_tokens.append(token.text)
print(document_tokens)

Or, more simply, by using a [List comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp):

In [None]:
[token.text for token in document]

In [None]:
'''
for token in document:
    print(token.text, token.lemma_, token.pos_, token.is_punct, token.is_stop, token.sent[:4])
'''
print(
    f"{'Index':9}"
    f"{'Text':15}"
    f"{'Lemma':15}"
    f"{'POS':10}"
    f"{'Punct?':10}"
    f"{'Stop Word?':15}"
    f"{'Sentence beginning'}"
)
for token in document[71:92]:
    print(
        f"{str(token.i):9}"
        f"{str(token.text):15}"
        f"{str(token.lemma_):15}"
        f"{str(token.pos_):10}"
        f"{str(token.is_punct):10}"
        f"{str(token.is_stop):15}"
        f"{str(token.sent[:6])+'…'}"
    )


You may need to store each token and its attributes in a dataframe:

In [None]:
# method 2: features 2 df
import pandas as pd
token_atts = []
for token in document:
    token_atts.append(
        [token.text, token.lemma_, token.pos_, token.is_punct, token.is_stop, f"{str(token.sent[:6])+'…'}"]
    )
token_atts_df = pd.DataFrame(token_atts)
token_atts_df.columns = ['Text', 'Lemma', 'POS', 'Is_Punct', 'Is_Stop_Word', 'Sentence_Begin']
token_atts_df.iloc[71:92]


You can customize the tokenizer by defining your own segmentation rules:

- https://spacy.io/usage/linguistic-features#native-tokenizers
- https://realpython.com/natural-language-processing-spacy-python/#tokens-in-spacy

More often than not, we need to load the textual content of a file to instantiate a Doc object.  
The [pathlib](https://docs.python.org/3/library/pathlib.html) module offers classes representing filesystem paths with semantics appropriate for different operating systems.

Let's try, to load Zola's famous text:

In [None]:
import pathlib

file_path = './data/zola_accuse_fr.txt'
zola_doc = nlp(pathlib.Path(file_path).read_text(encoding="utf-8"))
print([token.text for token in zola_doc])

## Sentence Detection

The following examples are taken from Real Python introduction: https://realpython.com/natural-language-processing-spacy-python/

Sentence detection is the process of locating where sentences start and end in a given text. This allows you to you divide a text into linguistically meaningful units.

In spaCy, the `.sents` property is used to extract sentences from the Doc object. Here’s how you would extract the total number of sentences and the sentences themselves for a given input:

In [None]:
# counting sentences
sentences = list(zola_doc.sents)
len(sentences)

There's no built-in sentence index: you need to iterate over sentences. Or you can use the [list()](https://www.w3schools.com/python/ref_func_list.asp) function to creates a list object. The list method `len()` returns the number of elements in this list. This also provides an index:

In [None]:
# display the 10th sentence
sentences[9]

In [None]:
# display all sentences
i=0
for sentence in sentences:
    print(f'{i}: {sentence[:10]}…')
    i+=1

**You can also customize sentence detection behavior** by using custom delimiters: https://spacy.io/usage/linguistic-features#sbd-custom
For example, to deal with Zola's particular use of the exclamation mark.

For the next example, you used the @Language.component("set_custom_boundaries") decorator to define a new function that takes a Doc object as an argument. The job of this function is to identify tokens in Doc that are the beginning of sentences and mark their `.is_sent_start` attribute to True. Once done, the function must return the Doc object again.

Then, you can add the custom boundary function to the Language object by using the `.add_pipe()` method.  
Parsing text with this modified Language object will not treat the exclamation mark as an end-of-sentence marker.

In [None]:
from spacy.language import Language
@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in ('!', '’', '«'):
            doc[token.i+1].is_sent_start = False
    return doc

custom_nlp = fr_core_news_md.load()
custom_nlp.add_pipe("set_custom_boundaries", before="parser")
zola_doc = custom_nlp(pathlib.Path(file_path).read_text(encoding="utf-8"))
for sentence in zola_doc.sents:
    print(f'{sentence[:10]}//')

## Stopwords

Stop words are typically defined as the most common words in a language.

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. For example, for **topic modeling**. But this is not always the case. Computational methods of **text attribution** (automatic author identification) rely precisely on the analysis of stop words only. In this case, you may wish to keep only these stop words and delete the others.

SpaCy stores a list of stop words for the different languages.  
See : https://machinelearningknowledge.ai/tutorial-for-stopwords-in-spacy/

In [None]:
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
print(fr_stop)

In [None]:
# use sorted() to sort the set
print(sorted(list(fr_stop))[:20])

This list is not the best… You may need to modify it: add/remove stop words:

In [None]:
# 'plupart', 'aucuns' and 'tantôt'  are not part of the default list; they are added.
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop_custom

fr_stop_custom.add('plupart')
fr_stop_custom |= {'aucuns','tantôt'}

In [None]:
# Delete word(s) from list: : parler, specifique
fr_stop_custom.remove('ouias')
fr_stop_custom -= {'parler', 'specifique'}
len(fr_stop_custom)

In [None]:
# You can easily keep the only stop words by making use of the .is_stop attribute of each token:
print([token for token in zola_doc[:500] if token.is_stop])

In [None]:
# Conversely, it's just as easy to keep only the full words that carry the semantics of the text.:
print([token for token in zola_doc[:500] if not token.is_stop and not token.is_punct and not token.is_space])

## Word Frenquency

From this, we can calculate words frequency lists.

In [None]:
from collections import Counter

words = [
    token.text
    for token in zola_doc
    if not token.is_punct and not token.is_space
]

print(Counter(words).most_common(20))

In [None]:
# !pip install matplotlib

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

words = [
    token.text
    for token in zola_doc
    if not token.is_punct and not token.is_space
]

donnees = Counter(words).most_common(20)

# Extract labels and values from tuples
etiquettes = [x[0] for x in donnees]
valeurs = [x[1] for x in donnees]

# x
indices = range(len(donnees))

# Draw histogram
plt.bar(indices, valeurs)

# Add lables
plt.xticks(indices, etiquettes)

# Display
plt.show()

It doesn't tell us much... 'de' is the most frequent word in the French language and frequency distribution conforms to the [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law).  
That's why we need stopwords!


In [None]:
words = [
    token.text
    for token in zola_doc
    if not token.is_stop and not token.is_punct and not token.is_space
]

print(Counter(words).most_common(20))

It's easy to see what we're talking about, and a historian would no doubt identify the text from this list alone.  But conjugation and plural further truncate the results. We need to lemmatize.

## Lemmatization

Lemmatization is the process of reducing inflected forms of a word. This reduced form (or root word = a dictionary entry), is called a lemma.

For example, loves, loved and loving are all forms of 'love' lemma. The inflection of a word allows you to express different grammatical categories, like tense (loved vs love), number (lover vs lovers), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

spaCy puts a `lemma_ attribute` on the Token class. This attribute has the lemmatized form of the token:

In [None]:
for token in zola_doc[36:150]:
    if str(token) != str(token.lemma_):
        print(f"{str(token):>20} : {str(token.lemma_)}")

**Let's do our sums.**

[Counter](https://docs.python.org/3/library/collections.html#collections.Counter) is a subclass of dict that's specially designed for counting hashable objects in Python. It's a dictionary that stores objects as keys and counts as values.

Just pass to the Counter the list of words to count, and then call the [.most_common()](https://docs.python.org/3/library/collections.html#collections.Counter.most_common) method and that's it!

In [None]:
words = [
    token.lemma_
    for token in zola_doc
    if not token.is_stop and not token.is_punct and not token.is_space
]

print(Counter(words).most_common(20))

Note, for example, that all conjugations of "vouloir" are reduced to its lemma. If you don't lemmatize the text, 'veux' and 'voulais' will be counted as different words, even though they both refer to the same concept. By lemmatizing, you can avoid duplicate words that may overlap conceptually.

This model for French isn't incredible: the counts would make more sense if the lemmatization were better. But still, it's a useful first indication of the topics addressed by the text: we understand that it's about Dreyfus, justice and truth.

With lemmatization, we can usually also recover morphosyntactic labels, which allow us to filter counts according to word (grammatical) category. For instance, what is the most frequent common noun?

## Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

In spaCy, POS tags are available as an attribute (`pos_`) on the Token object:

In [None]:
print(
    f"{'text':15}"
    f"{'lemma':15}"
    f"{'pos':10}"
    f"{'pos_explanation':10}"
)

for token in zola_doc[:20]:
    if not token.is_space:
        print(
            f"{str(token.text):15}"
            f"{str(token.lemma_):15}"
            f"{str(token.pos_):10}"
            f"{str(spacy.explain(token.pos_)):10}"
        )

In [None]:
nouns = [
    token.lemma_
    for token in zola_doc
    if token.pos_ == "NOUN"
]
print(Counter(nouns).most_common(20))

Not bad. But let's try to format our output better, by storing the counts in a dataframe that we can easily manipulate later:

In [None]:
import pandas as pd
nouns_tally = Counter(nouns)
nouns_df = pd.DataFrame(nouns_tally.most_common(), columns=['nouns', 'count'])
nouns_df.iloc[0:20]

Note that [displacy](https://spacy.io/usage/visualizers) makes it easy to build diagrams, which can look very serious in your thesis or article... 🙄

For instance the dependency visualizer, [dep](https://spacy.io/usage/visualizers#dep), shows part-of-speech tags and syntactic dependencies:

In [None]:
from spacy import displacy

In [None]:
# displaCy options : https://spacy.io/api/top-level#displacy_options

s=20 # a counter to select a sentence (here the 20th)
i=0
for sentence in zola_doc.sents:
    if i==s:
        displacy.render(
            sentence,
            style="dep",
            jupyter=True,
            options={'distance': 100, 'compact':False}
        )
    elif i>s:
        break
    i+=1

## Named-Entity Recognition

We've seen that counting common nouns is a good indicator of the topics. What if we could count the people or places mentioned?

Named-entity recognition (NER) is the process of locating named entities and then classifying them into predefined categories, such as person names, locations, organizations.

Let's see if the NER helps us to better understand the meaning of our text.  
spaCy has the property `.ents` on Doc objects. You can use it to extract named entities:

In [None]:
for ent in zola_doc[2100:2200].ents:
    print(
        f"""
        {ent.text = }
        {ent.label_ = }
        type = {spacy.explain(ent.label_)}"""
    )

In the above example, `ent` is a [Span object](https://spacy.io/api/span) with various attributes:

- `.text` gives the Unicode text representation (the string) of the entity.
- `.label_` gives the label of the entity.
- `.start_char` denotes the character offset for the start of the entity.
- `.end_char` denotes the character offset for the end of the entity.

`spacy.explain()` gives descriptive details about each entity label.

Counting people: for a better understanding, we propose 2 methods:

- The first with a `for` loop, to understand how to read the list of entities.
- La seconde (List comprehension) is more pythonic.

In [None]:
# method 1:
people = []
for named_entity in zola_doc.ents:
    if named_entity.label_ == "PER":
        people.append(named_entity.text)

print(Counter(people).most_common(10))

In [None]:
# method 2 (pythonic):
people = [
    entity.text
    for entity in zola_doc.ents
    if entity.label_ == 'PER'
]
people_df = pd.DataFrame(Counter(people).most_common(20), columns=['character', 'count'])
people_df.iloc[0:10]

 You can also use displaCy to visualize these entities. Here, we're only visualizing a few sentences (`list(zola_doc.sents)[116:120]`), but it is of course possible to annotate the entire text (`zola_doc`).

In [None]:
from spacy import displacy
displacy.render(list(zola_doc.sents)[116:120], style='ent')

## Named-Entity Linking

Named-entity recognition is very useful. It enables us to get entities. But we can't identify them. We know that a person is named 'Picquart', but how do we know who he is? How do we resolve the inevitable ambiguities? -Namesakes are common…

This is the purpose of linking: to try to assign a shared identifier to the entity (e.g. Wikidata). So we learn that 'Picquart' is 'Marie-Georges Picquart (1854-1914)', a key player in the Dreyfus affair. Thanks to this linkage, we can also automatically retrieve information about the person, via APIs!

Entity Linking is a difficult task, and there are many different strategies. We present here [spaCy fishing](https://github.com/Lucaterre/spacyfishing), a spaCy wrapper for entity-fishing, a tool for named entity recognition, linking and disambiguation against Wikidata.

In [None]:
#!pip install spacyfishing

### French

In [None]:
nlp_fr = spacy.load("fr_core_news_md")

In [None]:
# default (but the service is often down...)
nlp_fr.add_pipe("entityfishing", config={'language':'fr'})

In [None]:
# same, using huma-num instance:
'''
nlp_fr.add_pipe("entityfishing", config={
    'language':'fr',
    'api_ef_base': 'http://nerd.huma-num.fr/nerd/service'
})
'''

In [None]:
print(nlp_fr.pipe_names)

In [None]:
# check service
zola_doc_fr._.metadata

In [None]:
# import text
import pathlib
file_path = './data/zola_accuse_fr.txt'
zola_doc_fr = nlp_fr(pathlib.Path(file_path).read_text(encoding="utf-8"))

In [None]:
for ent in zola_doc_fr.ents:
    print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

spaCy fishing allows you to link several knowledge bases, [collecting information from Wikidata](https://github.com/Lucaterre/spacyfishing#get-extra-information-from-wikidata):

In [None]:
nlp_fr = spacy.load("fr_core_news_md")
nlp_fr.add_pipe("entityfishing", config={
    'language':'fr',
    'extra_info': True
})
for ent in zola_doc_fr.ents:
    if ent.label_ == 'PER':
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score, ent._.other_ids))

At last but not at least, you can display the result in a very convenient way:

In [None]:
options = {
    "ents": ["MISC", "LOC", "PER"],
    "colors": {"LOC": "#82e0aa", "PER": "#85c1e9", "MISC": "#f0b27a"}
}

params = {
    "text": zola_doc_fr.text,
    "ents": [
        {
            "start": ent.start_char,
            "end": ent.end_char,
            "label": ent.label_,
            "kb_id": ent._.kb_qid,
            "kb_url": ent._.url_wikidata
        }
        for ent in zola_doc_fr.ents
    ],
    "title": None
}

spacy.displacy.render(params, style="ent", manual=True, options=options, jupyter=True)

### English

In [None]:
#!conda install -c conda-forge spacy-model-en_core_web_md

In [None]:
nlp_en = spacy.load("en_core_web_md")
nlp_en.add_pipe("entityfishing")

In [None]:
import pathlib
file_path = './data/zola_accuse_en.txt'
zola_doc_en = nlp_en(pathlib.Path(file_path).read_text(encoding="utf-8"))

In [None]:
for ent in zola_doc_en.ents:
    if ent.label_ == 'PERSON':
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

## Preprocessing Functions

To bring your text into a format ideal for analysis, you can write preprocessing functions to encapsulate your cleaning process. For example, in this section, you’ll create a preprocessor that applies the following operations:

- Lowercases the text
- Lemmatizes each token
- Removes punctuation symbols
- Removes stop words

A preprocessing function converts text to an analyzable format. It’s typical for most NLP tasks.

In [None]:
# Preprocessing
def preprocess_lemma(token):
    return token.lemma_.strip().lower()

# Filter: a function that returns True or False for a token according to certain criteria
def is_token_allowed(token):
    return bool(
        token
        and str(token).strip()
        and not token.is_stop
        and not token.is_punct
    )

filtered_zola_lemmas = [
    preprocess_lemma(token)
    for token in zola_doc
    if is_token_allowed(token)
]

print(filtered_zola_lemmas)

In [None]:
print(
    f"{'text':15}"
    f"{'is_allowed':15}"
)

for token in zola_doc[:10]:
    print(
        f"{str(token.text):15}"
        f"{str(is_token_allowed(token)):15}"
    )

## Appendix. Import

In [None]:
# all imports
'''
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy import displacy
import fr_core_news_md
import pandas as pd
import pathlib
from collections import Counter
'''

In [None]:
#https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#sharing-an-environment
#!conda env export