# Processing texts using spaCy

*Content adapated from [Tuomo Hiipala'](https://www.mv.helsinki.fi/home/thiippal/) lectures

This section introduces you to basic tasks in natural language processing and how they can be performed using a Python library named spaCy.

After this section, you should:

 - know some of the key concepts and tasks in natural language processing
 - know how to perform simple natural language processing tasks using the spaCy library

In [None]:
# importing the spaCy library (if not installed, let's do it)
import spacy

To perform natural language processing tasks for a given language, we must load a _language model_ that has been trained to perform these tasks for the language in question. 

spaCy supports [many languages](https://spacy.io/usage/models#languages), but provides pre-trained [language models](https://spacy.io/models/) for fewer languages.

These language models come in different sizes and flavours. We will explore these models and their differences later. 

To get acquainted with basic tasks in natural language processing, we will start with a small language model for the English language.

Language models are loaded using spaCy's `load()` function, which takes the name of the model as input.

In [None]:
# loading the small language model for English and assign it to the variable 'nlp'
nlp = spacy.load('en_core_web_sm')

# Call the variable to examine the object
nlp

### What is a language model?

Most modern language models are based on *statistics* instead of human-defined rules. 

Statistical language models are based on probabilities, e.g.: 

 - What is the probability of a given sentence occurring in a language? 
 - How likely is a given word to occur in a sequence of words?

Consider the following sentences:

> From financial exchanges in `HIDDEN` Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day.

> Security precautions were being taken around the `HIDDEN` as the deadline for Iraq to withdraw from Kuwait neared.

You can probably make informed guesses on the `HIDDEN` words based on your knowledge of the English language and the world in general.

Similarly, creating a statistical language model involves observing the occurrence of words in large corpora and calculating their probabilities of occurrence in a given context. The language model can then be trained by making predictions and adjusting the model based on the errors made during prediction.

### How are language models trained?

The small language model for English, for instance, is trained on a corpus called [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), which features texts from different *genres* such as newswire text, broadcast news, broadcast and telephone conversations and blogs.

This allows the corpus to cover linguistic variation in both written and spoken English.

The OntoNotes 5.0 corpus consists of more than just *plain text*: the annotations include *part-of-speech tags*, *syntactic dependencies* and *co-references* between words.

This allows modelling not just the occurrence of particular words or their sequences, but their grammatical features as well.

In [None]:
# let's load the text again
with open('data/treaty_of_lisbon.txt', 'r', encoding='UTF-8') as f:
    text = f.read()[335:577] # just to be faster

In [None]:
print(text)

In [None]:
# now, we feed the 'text' to the language object under 'nlp' and
# store the result under the variable 'doc'
doc = nlp(text)

In [None]:
print(doc)

### Tokenization

What takes place first is a task known as *tokenization*, which breaks the text down into analytical units in need of further processing. 

In most cases, a *token* corresponds to words separated by whitespace, but punctuation marks are also considered as independent tokens. Because computers treat words as sequences of characters, assigning punctuation marks to their own tokens prevents trailing punctuation from attaching to the words that precede them.

The diagram below the outlines the tasks that spaCy can perform after a text has been tokenised, such as *part-of-speech tagging*, *syntactic parsing* and *named entity recognition*.

![The spaCy pipeline from https://spacy.io/usage/linguistic-features#section-tokenization](data/spacy_pipeline.png)

A spaCy _Doc_ object is consists of a sequence of *Token* objects, which store the results of various natural language processing tasks.

Let's print out each *Token* object stored in the _Doc_ object `doc`.

In [None]:
# looping over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:  
    
    # Print each token
    print(token)  

### Part-of-speech tagging

Part-of-speech (POS) tagging is the task of determining the word class of a token. This is crucial for *disambiguation*, because different parts of speech may have similar forms.

Consider the example: *The sailor dogs the hatch*.

The present tense of the verb *dog* (to fasten something with something) is precisely the same as the plural form of the noun *dog*: *dogs*.

To identify the correct word class, we must examine the context in which the word appears.

*spaCy* provides two types of part-of-speech tags, *coarse* and *fine-grained*, which are stored under the attributes `pos_` and `tag_`, respectively.

We can access the attributes of a Python object by inserting the *attribute* after the *object* and separating them with a full stop, e.g. `token.pos_`.

To access the results of POS tagging, let's loop over the *Doc* object `doc` and print each *Token* and its coarse and fine-grained part-of-speech tags.

In [None]:
# looping again over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:
    
    # Print the token and the POS tags
    print(token, token.pos_, token.tag_)

The coarse part-of-speech tags available under the `pos_` attribute are based on the [Universal Dependencies](https://universaldependencies.org/u/pos/all.html) tag set.

The fine-grained part-of-speech tags under `tag_`, in turn, are based on the OntoNotes 5.0 corpus introduced above.

In contrast to coarse part-of-speech tags, the fine-grained tags also encode [grammatical information](https://spacy.io/api/annotation#pos-en). The tags for verbs, for example, are distinguished by aspect and tense. 

### Morphological analysis

Morphemes constitute the smallest grammatical units that carry meaning. Two types of morphemes are generally recognised: *free* morphemes, which consist of words that can stand on their own, and *bound* morphemes, which inflect other morphemes. For the English language, bound morphemes include suffixes such as _-s_, which is used to indicate the plural form of a noun.

Put differently, morphemes shape the external *form* of a word, and these forms are associated with given grammatical *functions*.

spaCy performs morphological analysis automatically and stores the result under the attribute `morph` of a _Token_ object.

In [None]:
# looping once more over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:

    # Print the token and the results of morphological analysis
    print(f'{token} --> {token.morph}')

As the output shows, not all _Tokens_ have morphological information, because they consist of free morphemes.

To retrieve morphological information from a _Token_ object, we must use the `get()` method of the `morph` attribute.

We can use the brackets `[]` to access items in the _Doc_ object.

The following line retrieves morphological information about aspect for the 22nd _Token_ in the _Doc_ object.

In [None]:
# retrieving morphological information 
doc[16].morph.get('Aspect')

In [None]:
# token without morphological information
doc[10].morph.get('Aspect')

### Syntactic parsing

In [None]:
# looping over (again) items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:
    
    # Print the token and its dependency tag
    print(token, token.dep_)

Unlike part-of-speech tags that are associated with a single _Token_, dependency tags indicate a relation that holds between two *Tokens*.

To better understand the syntactic relations captured by dependency parsing, let's use some of the additional attributes available for each *Token*:

 1. `i`: the position of the *Token* in the *Doc*
 2. `token`: the *Token* itself
 3. `dep_`: a tag for the syntactic relation
 4. `head` and `i`: the *Token* that governs the current *Token* and its index
 
This illustrates how Python attributes can be used in a flexible manner: the attribute `head` points to another *Token*, which naturally has the attribute `i` that contains its index or position in the *Doc*. We can combine these two attributes to retrieve this information for any token by referring to `.head.i`.

In [None]:
for token in doc:
    
    # Print the index of current token, the token itself, the dependency, the head and its index
    print(token.i, token, token.dep_, token.head.i, token.head)

Although the output above helps to clarify the syntactic dependencies between tokens, they are generally much easier to perceive using diagrams.

spaCy provides a [visualisation tool](https://spacy.io/usage/visualizers) for visualising dependencies. This component of the spaCy library, _displacy_, can be imported using the following command.

In [None]:
from spacy import displacy

The `displacy` module has a method named `render()`, which takes a _Doc_ object as input.

To draw a dependency tree, we provide the _Doc_ object `doc` to the `render()` method with two arguments:

 1. `style`: The value `dep` instructs _displacy_ to draw a visualisation for syntactic dependencies.
 2. `options`: This argument takes a Python dictionary as input. We provide a dictionary with the key `compact` and Boolean value `True` to instruct _displacy_ to draw a compact tree diagram. Additional options for formatting the visualisation can be found in spaCy [documentation](https://spacy.io/api/top-level#displacy_options).

In [None]:
displacy.render(doc, style='dep', options={'compact': True})

In [None]:
doc_test = nlp('Last night we planned on meeting for Mexican food to celebrate a friends birthday.')
displacy.render(doc_test, style='dep', options={'compact': True})

The syntactic dependencies are visualised using lines that lead from the **head** *Token* to the *Token* governed by that head.

The dependency tags are based on [Universal Dependencies](https://universaldependencies.org/), a framework for describing morphological and syntactic features across languages (for a theoretical discussion of Universal Dependencies, see de Marneffe et al. [2021](https://doi.org/10.1162/coli_a_00402)).

If you don't know what a particular tag means, spaCy provides a function for explaining the tags, `explain()`, which takes a tag as input (note that the tags are case-sensitive).

In [None]:
spacy.explain('npadvmod')

### Lemmatization

A lemma is the base form of a word. Keep in mind that unless explicitly instructed, computers cannot tell the difference between singular and plural forms of words, but treat them as distinct tokens, because their forms differ.

If one wants to count the occurrences of words, for instance, a process known as lemmatization is needed to group together the different forms of the same token.

Lemmas are available for each Token under the attribute lemma_.


In [None]:
for token in doc:
    
    # Print the token and its dependency tag
    print(token, token.lemma_)

### Exercises

1. Load the `treaty_of_lisbon.txt` file  
2. Extract `Article 8 A` from there  
3. Feed text to the spaCy language model
4. Capture the verbs and the nouns
5. Find the frequency of the verbs and nouns (using lemmatization) in the article

### Solution

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

with open('data/treaty_of_lisbon.txt', 'r', encoding='UTF-8') as f:
    text = f.read()
    
ind_begin = text.find('Article 8 A')
ind_end = text.find('Article 8 B')

text = text[ind_begin:ind_end]

doc = nlp(text)

In [None]:
for token in doc:
    print(token, token.pos_, token.tag_)

In [None]:
nouns_and_verbs_dict = {'nouns':[], 'verbs':[]}

for token in doc:
    
    if token.pos_ == 'NOUN':
        nouns_and_verbs_dict['nouns'].append(token.lemma_)
    if token.pos_ == 'VERB':
        nouns_and_verbs_dict['verbs'].append(token.lemma_)
        
print(nouns_and_verbs_dict)

In [None]:
from collections import Counter
print(Counter(nouns_and_verbs_dict['nouns']))
print(Counter(nouns_and_verbs_dict['verbs']))