# spaCy Basics

spaCy (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

In this section we'll setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing.


First import the library and load the English core data.

In [1]:
import spacy
nlp = spacy.load('en')

We can use spacy to break the text into tokens. Spacy has correctly identified that U.S. is a name of a country and one word.

In [2]:
doc = nlp(u"Tesla is looking at buying a U.S. startup for $6 million.")

for token in doc:
    print(token.text)

Tesla
is
looking
at
buying
a
U.S.
startup
for
$
6
million
.


We can also see that spacy recognized the parts of speech.

In [3]:
doc = nlp(u"Tesla is looking at buying a U.S. startup for $6 million.")

for token in doc:
    print(token.text, token.pos_)

Tesla PROPN
is VERB
looking VERB
at ADP
buying VERB
a DET
U.S. PROPN
startup NOUN
for ADP
$ SYM
6 NUM
million NUM
. PUNCT


___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

In [4]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x9a5b548>),
 ('parser', <spacy.pipeline.DependencyParser at 0x9a43b28>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x9a5f108>)]

In [5]:
nlp.pipe_names

['tagger', 'parser', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. Let's look at another example:

In [6]:
doc2 = nlp(u"Tesla isn't looking into startups anymore.")

In [7]:
for token in doc2:
    print(token.text, token.pos_)

Tesla PROPN
is VERB
n't ADV
looking VERB
into ADP
startups NOUN
anymore ADV
. PUNCT


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [8]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [9]:
life_quote = doc3[16:30]

In [10]:
print(life_quote)

"Life is what happens to us while we are making other plans"


In [11]:
type(doc3)

spacy.tokens.doc.Doc

In [12]:
type(life_quote)

spacy.tokens.span.Span

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [13]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [14]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [15]:
print(doc4[8].is_sent_start)

None


In [16]:
doc4[6].is_sent_start

True

So we learned some capabilities of spaCy. It can take a raw string and 
- Recognize parts of speech
- Named entities
- Token attributes
- Recognize where a sentence starts and ends