## SpaCy 

+ A library for a number of NLP tasks
+ A number of languages: English, German, French, Greek etc. Models for more and more languages are coming
+ About spaCy: https://spacy.io/usage

+ Installation:
```
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
```

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

### Functionality

#### Part-of-speech tagging

https://spacy.io/usage/linguistic-features

In [3]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.pos_)

Colorless ADJ
green ADJ
ideas NOUN
sleep VERB
furiously ADV


#### Lemmatization and tokenization

In [5]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.lemma_)

Colorless colorless
green green
ideas idea
sleep sleep
furiously furiously


In [4]:
d = nlp("To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer")
for token in d:
    print(token.text, token.lemma_)

To to
be be
, ,
or or
not not
to to
be be
, ,
that that
is be
the the
question question
: :
Whether whether
' '
tis tis
nobler nobler
in in
the the
mind mind
to to
suffer suffer


#### Syntax: Dependency Parsing

In [6]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.lemma_, token.pos_, token.dep_)

Colorless colorless ADJ nmod
green green ADJ amod
ideas idea NOUN nsubj
sleep sleep VERB ROOT
furiously furiously ADV advmod


#### Navigating the tree:  heads and children

In [7]:
for token in d:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Colorless nmod ideas NOUN []
green amod ideas NOUN []
ideas nsubj sleep VERB [Colorless, green]
sleep ROOT sleep VERB [ideas, furiously]
furiously advmod sleep VERB []


In [8]:
d = nlp("If a farmer owns a donkey, he beats it.")
for token in d:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

If mark owns VERB []
a det farmer NOUN []
farmer nsubj owns VERB [a]
owns advcl beats VERB [If, farmer, donkey]
a det donkey NOUN []
donkey dobj owns VERB [a]
, punct beats VERB []
he nsubj beats VERB []
beats ROOT beats VERB [owns, ,, he, it, .]
it dobj beats VERB []
. punct beats VERB []


Let's find the children if *beats*:

In [9]:
d = nlp("If a farmer owns a donkey, he beats it.")
print([token.text for token in d[8].lefts]) 
print([token.text for token in d[8].rights])  
print(d[8].n_lefts)  
print(d[8].n_rights)  

['owns', ',', 'he']
['it', '.']
3
2


Let's find the dependency type, children on the left and on the right, parents:

In [10]:
root = [token for token in d if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors])

If mark 0 0 ['owns', 'beats']
a det 0 0 ['farmer', 'owns', 'beats']
farmer nsubj 1 0 ['owns', 'beats']
owns advcl 2 1 ['beats']
a det 0 0 ['donkey', 'owns', 'beats']
donkey dobj 1 0 ['owns', 'beats']


Let's print the POS, the dependency type, the head:

In [11]:
d = nlp("If a farmer owns a donkey, he beats it.")
for token in d:
    print(token.text, token.pos_, token.dep_, token.head.text)

If SCONJ mark owns
a DET det farmer
farmer NOUN nsubj owns
owns VERB advcl beats
a DET det donkey
donkey NOUN dobj owns
, PUNCT punct beats
he PRON nsubj beats
beats VERB ROOT beats
it PRON dobj beats
. PUNCT punct beats


Visualizing the tree:

In [5]:
from spacy import displacy

In [6]:
d = nlp("If a farmer owns a donkey, he beats it.")

displacy.render(d, style='dep')

#### Testing

0. Let's test SpaCy a little bit!
1. How does SpaCy handle syntactic ambiguity?
    + John saw the man on the mountain with a telescope.
    + I'm glad I'm a man, and so is Lola.
    + больше примеров: https://en.wikipedia.org/wiki/Syntactic_ambiguity
2. What about islands?

    + What_i does Sarah believe that Susan thinks that John bought __i?
    + Complex NPs

        - \*Who_i did Mary see the report that was about \_i?
        (cf. Mary saw the report that was about the senator.)

        - ?\*Which sportscar_i did the color of _i delight the baseball player?
        (cf. The color of the sportscar delighted the baseball player.)

    + Complements of manner-of-speaking verbs
        - ??What_i did Mary whisper that Bill liked _i?
        (cf. Mary whispered that Bill liked wine.)

    + Complex NPs (both noun complements and relative clauses)
        - \*The senator who_i Mary saw the report that was about _i bothered many people on the committee.

    + Subject islands
        - ?\* The sportscar which_i the color of _i delighted the baseball player…

    + Complements of manner-of-speaking verbs
        - ?The wine which_i Mary whispered that Bill liked _i was a Cabernet.

3. What about ellipsis?
    + John can play the guitar; Mary can, too.

In [7]:
d = nlp("I'm glad I'm a man, and so is Lola.")

displacy.render(d, style='dep')

In [8]:
d = nlp("Who did Mary see the report that was about?")

displacy.render(d, style='dep')

In [9]:
d = nlp("John can play the guitar; Mary can, too.")

displacy.render(d, style='dep')

#### Named Entity Recognition (NER)

In [22]:
doc = nlp("Twitter permanently suspends President Donald Trump")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Donald Trump 39 51 PERSON


In [23]:
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("Twitter permanently suspends President Donald Trump")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "President" as an entity :(

president_ent = Span(doc, 3, 4, label="PE") # create a Span for the new entity, PE = political entity
doc.ents = list(doc.ents) + [president_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [('President', 29, 38, 'PE')]

Before [('Donald Trump', 39, 51, 'PERSON')]
After [('President', 29, 38, 'PE'), ('Donald Trump', 39, 51, 'PERSON')]


#### What about other languages? 

Let's load a model for German:

In [None]:
! python -m spacy download de_core_news_sm

In [28]:
nlp = spacy.load("de_core_news_sm")
doc = nlp("schöne rote Äpfel auf dem Baum")
print([token.text for token in doc[2].lefts])  # ['schöne', 'rote']
print([token.text for token in doc[2].rights])  # ['auf']

['schöne', 'rote']
['auf']


What about Russian?

In [None]:
! python -m spacy download ru_core_news_sm

In [12]:
from spacy.lang.ru.examples import sentences 

nlp = spacy.load("ru_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд
Apple PROPN nsubj
рассматривает VERB ROOT
возможность NOUN obj
покупки NOUN nmod
стартапа NOUN nmod
из ADP case
Соединённого ADJ amod
Королевства PROPN nmod
за ADP case
$ SYM nmod
1 NUM appos
млрд NOUN punct


In [13]:
r = nlp("Какой прекрасный пень!")

displacy.render(r, style='dep')

In [14]:
r = nlp("Семантику я люблю, а вот синтаксис не очень.")

displacy.render(r, style='dep')

In [15]:
r = nlp("Блины бабушка очень уж вкусные печет.")

displacy.render(r, style='dep')

In [16]:
r = nlp("Ваня нарисовал чудесную картину.")

displacy.render(r, style='dep')