# spaCy Tutorial

**(C) 2018 by [Damir Cavar](http://damir.cavar.me/)**

**Version:** 1.1, February 2018

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Fall 2018 at Indiana University.

## Introduction to spaCy

Follow the instructions on the [spaCy homepage](https://spacy.io/usage/) about installation of the module and language models. Your local spaCy module is correctly installed, if the following command is successfull:

In [1]:
import spacy

We can load the English NLP pipeline in the following way:

In [2]:
nlp = spacy.load('en')

### Tokenization

In [3]:
doc = nlp(u'John was wondering, if Peter knew that Dr. Smith bought a new car for her older son.')
for token in doc:
    print(token.text)

John
was
wondering
,
if
Peter
knew
that
Dr.
Smith
bought
a
new
car
for
her
older
son
.


### Part-of-Speech Tagging

We can tokenize and part of speech tag the individual tokens using the following code:

In [4]:
doc = nlp(u'John said yesterday that Mary bought a new car for her older son.')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

John john PROPN NNP nsubj Xxxx True False
said say VERB VBD ROOT xxxx True False
yesterday yesterday NOUN NN npadvmod xxxx True False
that that ADP IN mark xxxx True True
Mary mary PROPN NNP nsubj Xxxx True False
bought buy VERB VBD ccomp xxxx True False
a a DET DT det x True True
new new ADJ JJ amod xxx True False
car car NOUN NN dobj xxx True False
for for ADP IN prep xxx True True
her -PRON- ADJ PRP$ poss xxx True True
older old ADJ JJR amod xxxx True False
son son NOUN NN pobj xxx True False
. . PUNCT . punct . False False


The above output contains for every token in a line the token itself, the lemma, the Part-of-Speech tag, the dependency label, the orthographic shape (upper and lower case characters as X or x respectively), the boolean for the token being an alphanumeric string, and the boolean for it being a *stopword*.

### Dependency Parse

Using the same approach as above for PoS-tags, we can print the Dependency Parse relations:

In [5]:
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])

John nsubj said VERB []
said ROOT said VERB [John, yesterday, bought, .]
yesterday npadvmod said VERB []
that mark bought VERB []
Mary nsubj bought VERB []
bought ccomp said VERB [that, Mary, car, for]
a det car NOUN []
new amod car NOUN []
car dobj bought VERB [a, new]
for prep bought VERB [son]
her poss son NOUN []
older amod son NOUN []
son pobj for ADP [her, older]
. punct said VERB []


As specified in the code, each line represents one token. The token is printed in the first column, followed by the dependency relation to it from the token in the third column, followed by its main category type.

### Named Entity Recognition

Similarly to PoS-tags and Dependency Parse Relations, we can print out Named Entity labels:

In [6]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

John 0 4 PERSON
yesterday 10 19 DATE
Mary 25 29 PERSON


We can extend the input with some more entities:

In [7]:
doc = nlp(u'John Smith said that Apple Inc. will buy Google in May 2018.')

The corresponding NE-labels are:

In [8]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

John Smith 0 10 PERSON
Apple Inc. 21 31 ORG
Google 41 47 ORG
May 2018 51 59 DATE


### Pattern Matching in spaCy

In [9]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)

doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)
print(matches)

[(15578876784678163569, 0, 3)]


### spaCy is Missing

From the linguistic standpoint, when looking at the analytical output of the NLP pipeline in spaCy, there are some important components missing:

- Clause boundary detection
- Constituent structure trees (scope relations over constituents and phrases)
- Anaphora resolution
- Coreference analysis
- Temporal reference resolution
- ...

#### Clause Boundary Detection

Complex sentences consist of clauses. For precise processing of semantic properties of natural language utterances we need to segment the sentences into clauses. The following sentence:

*The man said that the woman claimed that the child broke the toy.*

can be broken into the following clauses:

- Matrix clause: [ *the man said* ]
- Embedded clause: [ *that the woman claimed* ]
- Embedded clause: [ *that the child broke the toy* ]

These clauses do not form an ordered list or flat sequence, they in fact are hierarchically organized. The matrix clause verb selects as its complement an embedded finite clause with the complementizer *that*. The embedded predicate *claimed* selects the same kind of clausal complement. We express this hierarchical relation in form of embedding in tree representations:

[ *the man said* [ *that the woman claimed* [ *that the child broke the toy* ] ] ]

Or using a graphical representation in form of a tree:

<img src="Embedded_Clauses_1.png", width=70%, height=70%>

The hierarchical relation of sub-clauses is relevant when it comes to semantics. The clause *John sold his car* can be interpreted as an assertion that describes an event with *John* as the agent, and *the car* as the object of a *selling* event in the past. If the clause is embedded under a matrix clause that contains a sentential negation, the proposition is assumed to NOT be true: [ *Mary did not say that* [ *John sold his car* ] ] 

It is possible with additional effort to translate the Dependency Trees into clauses and reconstruct the clause hierarchy into a relevant form or data structure. SpaCy does not offer a direct data output of such relations.

One problem still remains, and this is *clausal discontinuities*. None of the common NLP pipelines, and spaCy in particular, can deal with any kind of discontinuities in any reasonable way. Discontinuities can be observed when sytanctic structures are split over the clause or sentence, or elements ocur in a cannoically different position, as in the following example:

*Which car did John claim that Mary took?*

The embedded clause consists of the sequence [ *Mary took which car* ]. One part of the sequence appears dislocated and precedes the matrix clause in the above example. Simple Dependency Parsers cannot generate any reasonable output that makes it easy to identify and reconstruct the relations of clausal elements in these structures.

#### Constitutent Structure Trees

Dependency Parse trees are a simplification of relations of elements in the clause. They ignore structural and hierarchical relations in a sentence or clause, as shown in the examples above. Instead the Dependency Parse trees show simple functional relations in the sense of sentential functions like *subject* or *object* of a verb.

SpaCy does not output any kind of constituent structure and more detailed relational properties of phrases and more complex structural units in a sentence or clause.

Since many semantic properties are defined or determined in terms of structural relations and hierarchies, that is *scope relations*, this is more complicated to reconstruct or map from the Dependency Parse trees.

#### Anaphora Resolution

SpaCy does not offer any anaphora resolution annotation. That is, the referent of a pronoun, as in the following examples, is not annotated in the resulting linguistic data structure:

- *John saw **him**.*
- *John said that **he** saw the house.*
- *Tim sold **his** house. **He** moved to Paris.*
- *John saw **himself** in the mirror.*

Knowing the restrictions of pronominal binding (in English for example), we can partially generate the potential or most likely anaphora - antecedent relations. This - however - is not part of the spaCy output.

One problem, however, is that spaCy does not provide parse trees of the *constituent structure* and *clausal hierarchies*, which is crucial for the correct analysis of pronominal anaphoric relations.

#### Coreference Analysis

Some NLP pipelines are capable of providing coreference analyses for constituents in clauses. For example, the two clauses should be analyzed as talking about the same subject:

*The CEO of Apple, Tim Cook, decided to apply for a job at Google. Cook said that he is not satisfied with the quality of the iPhones anymore. He prefers the Pixel 2.*

The constituents [ *the CEO of Apple, Tim Cook* ] in the first sentence, [ *Cook* ] in the second sentence, and [ *he* ] in the third, should all be tagged as referencing the same entity, that is the one mentioned in the first sentence. SpaCy does not provide such a level of analysis or annotation.

#### Temporal Reference

For various analysis levels it is essential to identify the time references in a sentence or utterance, for example the time the utterance is made or the time the described event happened.

Certain tenses are expressed as periphrastic constructions, including auxiliaries and main verbs. SpaCy does not provide the relevant information to identify these constructions and tenses.

## Using the Dependency Parse Visualizer

More on Dependency Parse trees

In [10]:
import spacy

We can load the visualizer:

In [11]:
from spacy import displacy

Loading the English NLP pipeline:

In [12]:
nlp = spacy.load('en')

Process an input sentence:

In [13]:
doc = nlp(u'John said yesterday that Mary bought a new car for her older son.')

Visualizing the Dependency Parse tree can be achieved by running the following server code and opening up a new tab on the URL [http://localhost:5000/](http://localhost:5000/). You can shut down the server by clicking on the stop button at the top in the notebook toolbar.

In [None]:
displacy.serve(doc, style='dep')

Instead of serving the graph, one can render it directly into a Jupyter Notebook:

In [14]:
displacy.render(doc, style='dep', jupyter=True, options={"distance": 140})

In addition to the visualization of the Dependency Trees, we can visualize named entity annotations:

In [15]:
text = """Apple decided to fire Tim Cook and hire somebody called John Doe as the new CEO.
They also discussed a merger with Google. On the long run it seems more likely that Apple
will merge with Amazon and Microsoft with Google. The companies will all relocate to
Austin in Texas before the end of the century."""

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

## Vectors

To use vectors in spaCy, you might consider installing the larger models for the particular language. The common module and language packages only come with the small models. The larger models can be installed as described on the [spaCy vectors page](https://spacy.io/usage/vectors-similarity):

    python -m spacy download en_core_web_lg

The large model *en_core_web_lg* contains more than 1 million unique vectors.

Let us restart all necessary modules again, in particular spaCy:

In [16]:
import spacy

We can now import the English NLP pipeline to process some word list. Since the small models in spacy only include context-sensitive tensors, we should use the dowloaded large model for better word vectors. We load the large model as follows:

In [17]:
# nlp = spacy.load('en_core_web_lg')
nlp = spacy.load('en')

We can process a list of words by the pipeline using the *nlp* object:

In [18]:
tokens = nlp(u'dog cat banana')

As described in the spaCy chapter *[Word Vectors and Semantic Similarity](https://spacy.io/usage/vectors-similarity)*, the resulting elements of *Doc*, *Span*, and *Token* provide a method *similarity()*, which returns the similarities between words: 

In [19]:
for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))

dog dog 1.0
dog cat 0.5390696
dog banana 0.28760988
cat dog 0.5390696
cat cat 1.0
cat banana 0.48752153
banana dog 0.28760988
banana cat 0.48752153
banana banana 1.0


We can access the *vectors* of these objects using the *vector* attribute:

In [20]:
tokens = nlp(u'dog cat banana sasquatch')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 23.92024 True
cat True 24.228516 True
banana True 25.35453 True
sasquatch True 26.209084 True


The attribute *has_vector* returns a boolean depending on whether the token has a vector in the model or not. The token *sasquatch* has no vector. It is also out-of-vocabulary (OOV), as the fourth column shows. Thus, it also has a norm of $0$, that is, it has a length of $0$.

Here the token vector has a length of $300$. We can print out the vector for a token:

In [22]:
n = 0
print(tokens[n].text, len(tokens[n].vector), tokens[n].vector)

dog 384 [ 8.27200770e-01  2.36963582e+00 -6.35798633e-01  4.51421201e-01
  2.03428909e-01  1.73726356e+00 -3.18652272e+00  8.14928174e-01
  1.90902579e+00  2.81861591e+00  2.24422216e+00 -1.73021841e+00
  1.79004085e+00  3.29744518e-02 -1.84130037e+00  8.92891705e-01
 -2.34007502e+00 -6.58327699e-01 -2.56982803e+00  1.81837606e+00
 -2.24640161e-01  1.19199407e+00 -1.03678751e+00  1.85581863e+00
  9.48346257e-02 -1.62571692e+00 -5.23630440e-01  1.61878800e+00
 -2.62793928e-01 -2.29376721e+00 -6.65396869e-01 -7.22711563e-01
 -3.73787642e-01  1.11173570e-01 -8.51480961e-02 -1.27650201e+00
  1.60682821e+00 -5.60200214e-01  2.31330538e+00 -1.79506028e+00
 -1.91947556e+00 -2.31478238e+00  1.07934499e+00 -2.57284474e+00
 -2.47225070e+00 -6.94101095e-01 -1.99404633e+00 -5.84194660e-01
 -1.05473995e-01 -1.13228750e+00  3.32133532e+00 -1.98626065e+00
 -2.27126360e+00  3.23185134e+00  3.57697129e-01 -2.88535762e+00
  3.46697450e+00  3.08543921e+00  1.69311810e+00  6.86959505e-01
 -8.70782137e-03 

Here just another example of similarities for some famous words:

In [23]:
tokens = nlp(u'queen king chef')

for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))

queen queen 1.0
queen king 0.34783703
queen chef 0.2586036
king queen 0.34783703
king king 1.0
king chef 0.47207302
chef queen 0.2586036
chef king 0.47207302
chef chef 1.0


### Similarities in Context

In spaCy parsing, tagging and NER models make use of vector representations of contexts that represent the *meaning of words*. A text *meaning representation* is represented as an array of floats, i.e. a tensor, computed during the NLP pipeline processing. With this approach words that have not been seen before can be typed or classified. SpaCy uses a 4-layer convolutional network for the computation of these tensors. In this approach these tensors model a context of four words left and right of any given word.

Let us use the example from the spaCy documentation and check the word *labrador*:

In [24]:
import spacy
nlp = spacy.load('en')

tokens = nlp(u'labrador')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

labrador True 23.063505 True


We can now test for the context:

In [25]:
doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")

count = 0
for doc in [doc1, doc2, doc3]:
    lab = doc[1]
    dog = nlp(u"dog")
    count += 1
    print(str(count) + ":", lab.similarity(dog))

1: 0.3551335059008647
2: 0.21606158966020875
3: 0.2074718583991242


Using this strategy we can compute document or text similarities as well:

In [26]:
docs = ( nlp(u"Paris is the largest city in France."),
        nlp(u"Vilnius is the capital of Lithuania."),
        nlp(u"An emu is a large bird.") )

for x in range(len(docs)):
    for y in range(len(docs)):
        print(x, y, docs[x].similarity(docs[y]))

0 0 1.0
0 1 0.8139621420526477
0 2 0.6578787369563981
1 0 0.8139621420526477
1 1 1.0
1 2 0.6000087099931554
2 0 0.6578787369563981
2 1 0.6000087099931554
2 2 1.0


We can vary the word order in sentences and compare them:

In [27]:
docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
        nlp(u"man dog bites"), nlp(u"dog man bites")]

for doc in docs:
    for other_doc in docs:
        print('"' + doc.text + '"', '"' + other_doc.text + '"', doc.similarity(other_doc))

"dog bites man" "dog bites man" 1.0
"dog bites man" "man bites dog" 0.941871368221926
"dog bites man" "man dog bites" 0.9062079104027668
"dog bites man" "dog man bites" 0.9328819114282291
"man bites dog" "dog bites man" 0.941871368221926
"man bites dog" "man bites dog" 1.0
"man bites dog" "man dog bites" 0.91031258826218
"man bites dog" "dog man bites" 0.9005242840640686
"man dog bites" "dog bites man" 0.9062079104027668
"man dog bites" "man bites dog" 0.91031258826218
"man dog bites" "man dog bites" 1.0
"man dog bites" "dog man bites" 0.9483532486752623
"dog man bites" "dog bites man" 0.9328819114282291
"dog man bites" "man bites dog" 0.9005242840640686
"dog man bites" "man dog bites" 0.9483532486752623
"dog man bites" "dog man bites" 1.0


### Custom Models

#### Optimization

In [None]:
nlp = spacy.load('en_core_web_lg')