# Introduction to spaCy

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Whereas NLTK contains separate modules that need to be called individually, we now look at another toolkit *spaCy* that hides these steps in one processing module. Basically, *spaCy* takes plain text as input, applies a whole series of NLP modules to the text and outputs a complex object with the output of these modules stacked on top of the elements of the text.

So whereas we had to create a pipeline ourself in NLTK, [spaCy](https://spacy.io/) thus lets you call a prefabbed pipeline which performs tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER), among others. The advantage of spaCy is that it is really fast, and it has a reasonable accuracy. In addition, it currently supports multiple languages: https://spacy.io/models.

In this notebook, we will mainly show how to call this pipeline and access the result of processing a text. If you want to learn more, please visit spaCy's website; it has extensive documentation and provides excellent user guides. 

**At the end of this notebook, you will be able to get the output from spaCy after it applied the following NLP pipeline**:

* **Sentence splitting**: attribute **sents** of a `Doc` (of type *spacy.tokens.doc.Doc*)
* **Tokenization**: `Doc` contains a sequence of `Token` objects (of type *spacy.tokens.token.Token*)
* **Part-of-speech (POS) tagging**: attributes **pos_** and **tag_** of `Token`
* **Stop words recognition** attribute **is_stop** of `Token`
* **Stemming and lemmatization**: attribute **lemma_** of `Token`
* **Constituency/dependency parsing:** attributes **dep_** and **head**
* **Named Entity Recognition (NER):** attribute **ents** (of type *spacy.tokens.span.Span*) of `Doc` (of type *spacy.tokens.doc.Doc*). 


## Installing and loading spaCy

To install spaCy, check out the instructions [here](https://spacy.io/usage). It explains exactly how to install spaCy for your operating system, package manager and desired languages. Simply run the suggested commands in your terminal ([Anaconda Prompt](https://docs.anaconda.com/anaconda/user-guide/getting-started/) or cmd). Alternatively, you can probably also just run the following cells in this notebook:

**Tip**: comment out the next two commands after using them. You can comment out commands by putting a "#" in front.

In [1]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.7.5-cp310-cp310-macosx_10_9_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting spacy-legacy<3.1.0,>=3.0.11
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting wasabi<1.2.0,>=0.9.1
  Downloading wasabi-1.1.3-py3-none-any.whl (27 kB)
Collecting weasel<0.5.0,>=0.1.0
  Downloading weasel-0.4.1-py3-none-any.whl (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting cymem<2.1.0,>=2.0.2
  Using cached cymem-2.0.8-cp310-cp310-macosx_10_9_x86_64.whl (41 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0.10-py3-none-any.whl (17 kB)
Collecting typer<1.0.0,>=0.3.0
  Downloading typer-0.12.3-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.2/47.2 kB[0m [31m1.4 MB/s[0m eta 

Instead of an installation using *pip*, you can also try to install it using *conda*

In [2]:
%%bash
conda install -c conda-forge spacy

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



If there are no errors, you installed the software on your local machine. Congratulations!  But this is not enough. You also need to have the language resources to feed the software: trained models, grammars, lexicons, etc. spaCy comes with language resources for many different languages and this is growing. Maybe one day you may even contribute a module for your own language.

In this notebook, we are going to download the English language resources. The standard donwload command from the command line is the following:

In [2]:
!python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In the messages, you see that there is a download and install operation followed by a linking operation. There may be errors with the linking, which we try to solve below.

Now, let's first import spaCy in the notebook and check if we can load the English pipeline.

In [5]:
import spacy
nlp= spacy.load('en_core_web_sm') # other languages: de, es, pt, fr, it, nl

If there is no message, you can assume that everything went well and you can skip the next part on errors.

You may want to check the models that are installed for spaCy for their compatibility with the spaCy version that you are running.
The next cell shows you how to do this on the command line:

In [3]:
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs/venv/lib/python3.10/site-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_sm   >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m



If there is no model listed, something went wrong with downloading. If spaCy lists a compatibility problem, it suggests how to fix it. Below are some more possible fixes. Note that these toolkits are improved rapidly and a new release may change things.

### Possible error when downloading language models

You might get a variation of the following error:
```
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
```

This means that there is a problem with linking the language model of spaCy.

You might need to invest some time to make sure that the linking was succesful, i.e., that you can load spaCy with spacy.load('en'). Here is some more information on how to fix this.

**Troubleshooting (optional)**

*Cause*: Anaconda prompt does not have enough priviliges to execute the linking part of `python -m spacy download en`. The same is true for any other `python -m spacy [...]` command.

*Solution:* 

1. Create the link manually

The prompt should display something along the lines of:

```
<Data Downloaded>
You do not have sufficient privilege to perform this operation.

    Linking successful
    <Anaconda dir>\lib\site-packages\en_core_web_sm -->
    <Anaconda dir>\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load('en')

Use the following command:

mklink /D <Anaconda>\lib\site-packages\spacy\data\en <Anaconda>\lib\site-packages\en_core_web_sm

 Note that the target is pointing to the link, not the other way around. The syntax of the arguments is 'target' followed by 'link'
 
```

2. Give Anaconda Permissions to create link. Using "runas ... python -m spacy ..." may not suffice

3. More details: https://github.com/explosion/spaCy/issues/1283

**If none of this works (can happen on Windows)** You might want to install the model manually from SpaCy's GitHub through pip:

```
    pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz`.
```

Still having problems? Come to the VU feedback sessions or schedule a zoom meeting with the teachers to fix it.

## Using spaCy

If you succesfully loaded the English model (or another language) with the above command, you created a spaCy object 'nlp'. 

An 'object'????? So what is an object? 

An object is an instance of a class of something. An iPhone13 is a class of things. Your iPhone is an object which is an instance of the class iPhone13. My phone is an instance of a Samsung Galaxy S23. Both our phones are objects but they have different properties and can do different things defined by the classes iPhone13 and Samsung Galaxy S23.

In programming this works the same way. I can create as many instances of spaCy modules in my notebook as I want: nlp1, nlp2, nlp3.  So 'nlp' is an instance of a spaCy object loaded with the models for the English language. This means that 'nlp' has all the functions that the spaCy developers defined.

If you download the corresponding models to your local machine, you can also create instances by loading models for other languages (or a newer version of an English language model when it is released), e.g.:

```
nlp_nl = spacy.load('nl_core_web_sm')

nlp_it = spacy.load('it_core_web_sm')
```

This would give us two more spaCy instances in addition to the English variant that we created before. All three instances `nlp`, `nlp_nl` and `nlp_it` are objects of the same abstract class. This means all general functions of spaCy are available to all three but there could be language specific differences due to the implementation of these functions.

Below we will use 'nlp' to process text through a predefined pipeline of modules and store the result in another variable for accessing it. To process a text, you simple pass the string to the *nlp* object as input and we assign the result of it to a variable *doc* with a lower case. The variable *doc* will be of the spaCy class **Doc**. Let's check that.

In [6]:
doc = nlp("I have an awesome cat. She follows me through the house everywhere I go. Her name is Shadow")
type(doc)

spacy.tokens.doc.Doc

The result of processing a text with spaCy is another spaCy object instance of the class 'Doc'.

'Doc' objects are complex and they represent the interpretion of a text according to the standard functions in the spaCy **pipeline**. This pipeline does all kinds of processing of the input text. spaCy provides functions that let you access the output of the different analyses that have been applied to the input text. In a Doc object you can access tokens that make up the text, their lemmas, their PoS, the sentences, chunks, named entities, and many more. Below, we will look at different the types of analyses in more detail.



So whereas we had to call all these specific functions in the right order in the case of NLTK, i.e. sentence_splitting, tokenization, part-of-speech tagging, parsing and named entity recognition, spaCy did all of that when processing a text. The only thing you need to do is to access the output, which is what we will do next.

#### Doc, Token and Span objects

At this point, there are three important classes of output objects:

* A `Doc` is a sequence of `Token` objects.
* A `Token` object represents an individual token — i.e. a word, punctuation symbol, etc. It has attributes representing linguistic annotations. 
* A `Span` object is a slice from a `Doc` object and consists of a subsequence of the full list of `Token` objects.

Since `Doc` is a sequence of `Token` objects, we can iterate over all the tokens in the text as shown below. We need a 'for-loop' to iterate over the elements in the doc object just as we would do for a list:

In [7]:
# Iterate over the tokens using a for loop
for token in doc:
    print(token)
print()


I
have
an
awesome
cat
.
She
follows
me
through
the
house
everywhere
I
go
.
Her
name
is
Shadow



Note that spaCy does not really create a list but a so-called 'generator'. A generator is a so-called 'lazy iterator' in Python that does not overload memory:

https://realpython.com/introduction-to-python-generators/

You can turn it into a list however and load it in memory to see the content as a list:

In [18]:
print(list(doc))

[I, have, an, awesome, cat, ., She, follows, me, through, the, house, everywhere, I, go, ., Her, name, is, Shadow]


As a list, we can access each token individually using the item index:

In [19]:
# Select one single token by index
first_token = doc[0]
print("First token:", first_token)
second_token = doc[1]
print("Second token:", second_token)

First token: I
Second token: have


Please note that even though these tokens look like strings, they are not. *Print* just gives the print representation of the token. We can also ask for the 'type' in Python and it will tell you what class of object it is:

In [20]:
for token in doc:
    print(token, "\t", type(token))

I 	 <class 'spacy.tokens.token.Token'>
have 	 <class 'spacy.tokens.token.Token'>
an 	 <class 'spacy.tokens.token.Token'>
awesome 	 <class 'spacy.tokens.token.Token'>
cat 	 <class 'spacy.tokens.token.Token'>
. 	 <class 'spacy.tokens.token.Token'>
She 	 <class 'spacy.tokens.token.Token'>
follows 	 <class 'spacy.tokens.token.Token'>
me 	 <class 'spacy.tokens.token.Token'>
through 	 <class 'spacy.tokens.token.Token'>
the 	 <class 'spacy.tokens.token.Token'>
house 	 <class 'spacy.tokens.token.Token'>
everywhere 	 <class 'spacy.tokens.token.Token'>
I 	 <class 'spacy.tokens.token.Token'>
go 	 <class 'spacy.tokens.token.Token'>
. 	 <class 'spacy.tokens.token.Token'>
Her 	 <class 'spacy.tokens.token.Token'>
name 	 <class 'spacy.tokens.token.Token'>
is 	 <class 'spacy.tokens.token.Token'>
Shadow 	 <class 'spacy.tokens.token.Token'>


These `Token` objects have many useful *methods* and *attributes*, which we can get listed by using the Python function `dir()`.

In [12]:
dir(first_token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

We haven't really talked about attributes during this course, but while functions are operations or actions performed by an object, attributes are 'static' features or properties of objects. Functions are called using parantheses (as we have seen with `str.split()`, for instance), while attributes go without parentheses. We will see some examples below. In the case of spaCy tokens, attributes typically contain *annotations* of the token in the text.

You can find more detailed information about the token functions and attributes in the [documentation](https://spacy.io/api/token).

Notice that there are many attributes with double listings, one without and one with the suffix `_`. The attributes without `_` actually have numerical values that spaCy uses internally, whereas variants with `_` have the human readable rendering of the value in unicode. The internal numerical repesentations are used to store data more efficiently, whereas the readable values are only generated for rendering output. Usually, you want to use the attributes with the `_` suffix.

In [22]:
for token in doc:
    print(token.lemma, token.lemma_,token.pos, token.pos_)

4690420944186131903 I 95 PRON
14692702688101715474 have 100 VERB
15099054000809333061 an 90 DET
3240785716591152042 awesome 84 ADJ
5439657043933447811 cat 92 NOUN
12646065887601541794 . 97 PUNCT
6740321247510922449 she 95 PRON
14462500713227930305 follow 100 VERB
4690420944186131903 I 95 PRON
18216413589307435838 through 85 ADP
7425985699627899538 the 90 DET
9471806766518506264 house 92 NOUN
10957650314384693728 everywhere 86 ADV
4690420944186131903 I 95 PRON
8004577259940138793 go 100 VERB
12646065887601541794 . 97 PUNCT
4115755726172261197 her 95 PRON
18309932012808971453 name 92 NOUN
10382539506755952630 be 87 AUX
4250795299896093889 Shadow 96 PROPN


You can also use spacy.explain to get a description for any label:

In [8]:
# try out some more, such as NN, ADP, PRP, VBD, VBP, VBZ, WDT, aux, nsubj, pobj, dobj, npadvmod
spacy.explain("ADP")

'adposition'

## Sentence splitting & tokenization
spaCy performs sentence splitting for you. The information is stored in the attribute **sents** of `Doc` (of type *spacy.tokens.doc.Doc*).
Each `Doc` contains a sequence of `Token` objects, i.e., this is where the output from the tokenizer is found. The token itself can be accessed using the attribute **text**. Each `Doc` instance will also have an index over the tokens to group them into sentences. We can iterate over these sentence indexes and get the tokens from each sentence in sequence. This will access the text sentence by sentence.

In [9]:
doc = nlp("I have an awesome cat. She follows me through the house everywhere I go. Her name is Shadow")

In [10]:
sentences=doc.sents
for sentence in sentences:
    print('NEXT SENTENCE')
    print(sentence)
    for token in sentence:
        print(token.text)

NEXT SENTENCE
I have an awesome cat.
I
have
an
awesome
cat
.
NEXT SENTENCE
She follows me through the house everywhere I go.
She
follows
me
through
the
house
everywhere
I
go
.
NEXT SENTENCE
Her name is Shadow
Her
name
is
Shadow


## Lemmatization
The output from the lemmatizer is stored in the attribute **lemma_** of each `Token' object.

In [11]:
doc = nlp("I have awesome cats")

In [12]:
cat_token = doc[3]
print(cat_token.text, cat_token.lemma_)

cats cat


## Part of speech tagging

The output from the part of speech tagger is stored:
* in the attribute **pos_** of each `Token` object: The simple part-of-speech tag
* in the attribute **tag_** of each `Token object: The detailed part-of-speech tag ([Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))

In [13]:
doc = nlp("I have awesome cats")

In [14]:
cat_token = doc[3]
print(cat_token.text, cat_token.pos_, cat_token.tag_)
spacy.explain(cat_token.tag_)

cats NOUN NNS


'noun, plural'

## Dependency parsing
The output of the dependency parser can only be accessed by combining the information from multiple attributes. Let's look at an example:

In [15]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

spaCy has a special function *displacy* to display structures. We first need to import it and next can apply it to the *doc* object. There are various parameters that can be set but we only show a few. You can read more in the spaCy documentation: https://spacy.io/usage/visualizers. Now, we choose the *style* 'dep' for dependency.

In [17]:
from spacy import displacy
displacy.render(doc, jupyter=True, style='dep')

We observe that each token has a dependency relation with at least one other token. For example:
* **cars** has an **amod** relation with **autonomous**
* the main verb **shift** has an **nsubj** relation with **cars**

If you want to know what these relations mean, you can use **spacy.explain**

In [18]:
spacy.explain('amod')

'adjectival modifier'

spaCy makes use of the terms **child** and **head** in their dependency parsing output.
* a relation is always in one direction from a **child** to a **head**, e.g., *autonomous* is the child of *cars*
* a head of a phrase can be the child of another token, e.g., *cars* is the child of *shift*
* a token without a head is the root of the text or sentence (often the main verb)

The following attributes are needed to access this information:
* **dep_** provides the syntactic relation, e.g., *nsubj*
* **head** provides the **head** of a `Token`, e.g., in the case of *autonomous* the head would be *cars*

In [19]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

In [20]:
autonomous_token = doc[0]
print(autonomous_token, autonomous_token.dep_, autonomous_token.head)

Autonomous amod cars


In [21]:
cars_token = doc[1]
print(cars_token, cars_token.dep_, cars_token.head)

cars nsubj shift


### save tree structure to SVG image

In [22]:
tree_structure = displacy.render(doc, jupyter=False, style='dep')

output_path = 'spacy_tree_structure.svg'
with open(output_path, 'w') as outfile:
    outfile.write(tree_structure)

## Named Entity Recognition
The output from the Named Entity Recognizer is stored in the attribute **ents** of `Doc`.
The attribute **label_** and an **ent** (of type *spacy.tokens.span.Span*) contains the named entity type.

In [23]:
text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption."""
doc = nlp(text)

In [24]:
displacy.render(doc, jupyter=True, style='ent')

In [25]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Apple ORG
Siri PERSON
iPhones ORG
Amazon ORG
Alexa ORG
Echo LOC


# End of this notebook