<strong>SpaCy</strong>

SpaCy is a Python library for advanced NLP. It helps pre-process texts for deep learning, and it can be used for information extraction, dependency parsing, lemmatization among other things.

In this notebook, I demonstrate a few features of SpaCy and its visualizer, DisplaCy.

In [26]:
import spacy
from spacy import displacy

print(spacy.__version__)

2.3.5


In [38]:
# I use the medium-sized model, which has a larger vocabulary with 20k unique vectors. 
nlp = spacy.load("en_core_web_md")

<strong>1. Linguistic Annotation</strong>
    
Tokenization and Part-of-Speech Tagging

Full documentation: https://spacy.io/api/token

In [28]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# Coarse-grained parts of speech & Syntactic dependency relations
for token in doc:
    print(token.text, token.pos_, token.dep_)

Gene PROPN compound
Munster PROPN nsubj
told VERB ROOT
CNBC PROPN dobj
that SCONJ mark
Apple PROPN nsubj
has AUX ccomp
a DET det
path NOUN dobj
to ADP prep
$ SYM nmod
200 NUM pobj
per ADP prep
share NOUN pobj
. PUNCT punct


In [29]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# Lemma of the tokens
for token in doc:
  print(token.i, token.lemma_)

0 Gene
1 Munster
2 tell
3 CNBC
4 that
5 Apple
6 have
7 a
8 path
9 to
10 $
11 200
12 per
13 share
14 .


In [30]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# Find upper-case tokens
for token in doc:
  print(token.i, token.is_upper)

0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False


In [31]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# Find tokens that represent a currency
for token in doc:
  print(token.i, token.is_currency)

0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 False
12 False
13 False
14 False


In [32]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# Fine-grained parts of speech, show orthographic features,
# find tokens with alphabetic characters & find tokens that are part of a "stop list"
for token in doc:
    print(token.text, token.tag_, token.shape_, token.is_alpha, token.is_stop)

Gene NNP Xxxx True False
Munster NNP Xxxxx True False
told VBD xxxx True False
CNBC NNP XXXX True False
that IN xxxx True True
Apple NNP Xxxxx True False
has VBZ xxx True True
a DT x True True
path NN xxxx True False
to IN xx True True
$ $ $ False False
200 CD ddd False False
per IN xxx True True
share NN xxxx True False
. . . False False


In [33]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# I visualize the syntactic dependency relation of the tokens.
options = {"bg": "MediumSeaGreen", "color": "white", "font": "Source Sans Pro",
           "arrow_stroke": 5, "arrow_width": 10}

displacy.render(doc, style="dep", options=options)

<strong>2. Named Entity Recognition</strong>

Full list of named entities: https://spacy.io/api/annotation#named-entities

In [34]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# Find named entities, entity starts with character, entity ends with character
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)

Gene Munster PERSON 0 12
CNBC ORG 18 22
Apple ORG 28 33
200 MONEY 49 52


In [35]:
doc = nlp("Gene Munster told CNBC that Apple has a path to $200 per share.")

# I visualize the named entities.
displacy.render(doc, jupyter=True, style="ent")

<strong>3. Word Vectors</strong>

Similarity of words can be detected by comparing word vectors (= word embeddings), which are multi-dimensional meaning representations of words.

In [39]:
tokens = nlp("cat dog apple ytutru")

# Indicate whether a word vector is associated with the token,
# the L2 norm of the token’s vector representation
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm)

cat True 6.6808186
dog True 7.0336733
apple True 7.1346846
ytutru False 0.0


<strong>4. Similarity</strong>

I compare 2 tokens and predict similarity. This can be used in recommendation models and duplicate detection.

In [37]:
tokens = nlp("tiger leopard car")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

tiger tiger 1.0
tiger leopard 0.9999999
tiger car 0.16612656
leopard tiger 0.9999999
leopard leopard 1.0
leopard car 0.16612656
car tiger 0.16612656
car leopard 0.16612656
car car 1.0
