# spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc).

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [4]:
print('token\t\t', 'part of speech\t\t', 'part of speech (raw)\t\t', 'syntactic dependency\t\t')

for token in doc:
    print(f"{token.text:{15}} {token.pos:>{10}} {token.pos_:>{25}} {token.dep_:>{35}}")

token		 part of speech		 part of speech (raw)		 syntactic dependency		
Tesla                   96                     PROPN                               nsubj
is                      87                       AUX                                 aux
looking                100                      VERB                                ROOT
at                      85                       ADP                                prep
buying                 100                      VERB                               pcomp
U.S.                    96                     PROPN                            compound
startup                 92                      NOUN                                dobj
for                     85                       ADP                                prep
$                       99                       SYM                            quantmod
6                       93                       NUM                            compound
million                 93             

**Universal POS tags** - https://universaldependencies.org/u/pos/

**Token attributes** - https://spacy.io/api/token#attributes

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.
<img src="../images/spacy-processing-pipeline.svg" width="600">

In [5]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x119eaa930>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x119eaa5d0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x119db8f20>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x11a065f50>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x11a0534d0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x119db8f90>)]

In [6]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

Tesla Tesla PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.S. U.S. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
6 6 NUM CD compound d False False
million million NUM CD pobj xxxx True False


**Text**: The original word text.
**Lemma**: The base form of the word.
**POS**: The simple UPOS part-of-speech tag.
**Tag**: The detailed part-of-speech tag.
**Dep**: Syntactic dependency, i.e. the relation between tokens.
**Shape**: The word shape – capitalization, punctuation, digits.
**is alpha**: Is the token an alpha character?
**is stop**: Is the token part of a stop list, i.e. the most common words of the language?

Processing pipelines:

In [7]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

`['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']`

In [8]:
doc2 = nlp(u"Tesla isn't looking into startups anymore.")

print('token\t\t', 'part of speech\t\t', 'part of speech (raw)\t\t', 'syntactic dependency\t\t')

for token in doc2:
    print(f"{token.text:{15}} {token.pos:>{10}} {token.pos_:>{25}} {token.dep_:>{35}}")

token		 part of speech		 part of speech (raw)		 syntactic dependency		
Tesla                   96                     PROPN                               nsubj
is                      87                       AUX                                 aux
n't                     94                      PART                                 neg
looking                100                      VERB                                ROOT
into                    85                       ADP                                prep
startups                92                      NOUN                                pobj
anymore                 86                       ADV                              advmod
.                       97                     PUNCT                               punct


In [9]:
doc3 = nlp(u"green shirt")

for token in doc3:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

green green PROPN NNP amod xxxx True False
shirt shirt NOUN NN ROOT xxxx True False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [10]:
doc4 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [11]:
life_quote = doc4[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [12]:
type(life_quote)

spacy.tokens.span.Span

In [13]:
type(doc4)

spacy.tokens.doc.Doc

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`.

In [14]:
doc5 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [15]:
for sent in doc5.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [16]:
doc4[6].is_sent_start

False