In [1]:
# import spaCy
import spacy

### Introduction & Tutorial of spaCy

#### Documents, Spans & Tokens

In [2]:
# create a blank nlp object
nlp = spacy.blank("en")

In [10]:
# create document
doc = nlp("Hello World!")

for token in doc:
    print(token.text)

Hello
World
!


In [19]:
# each token can be accessed by using the index
token1 = doc[0]
token2 = doc[1]
token3 = doc[2]

print(f'Token 1: {token1}')
print(f'Token 2: {token2}')
print(f'Token 3: {token3}')

Token 1: Hello
Token 2: World
Token 3: !


In [22]:
# a span is a slice from the object
span = doc[1:3]
print(span.text)

World!


#### Lexical Attributes

With *is_alpha*, *is_punct* and *like_num*, it is possible to indicate whether the token consists of *alphabetic characters*, *number* or *punctuations*. These attributes are also called *lexical attributes* and refer to the entry in the vocabulary and don't depend on the context of the token itself. Therefore, it's easy to distinguish between alphabetic characters, numbers and punctuations. 

These flags simply return a boolean value and are stored in an array, like the following examples demonstrate:

In [23]:
# create document
doc = nlp("It costs $5.")

In [30]:
# get each index of document
index = []

for token in doc:
    index.append(token.i)

print(f'Index: {index}')

Index: [0, 1, 2, 3, 4]


In [31]:
# get each character of document
text = []

for token in doc:
    text.append(token.text)

print(f'Text: {text}')

Text: ['It', 'costs', '$', '5', '.']


In [39]:
# check which character in the document is alphabetic
is_alpha = []

for token in doc:
    is_alpha.append(token.is_alpha)

print(f'Text: {text}')
print(f'is_alpha: {is_alpha}')

Text: ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]


In [40]:
# check which character in the document is numeric
like_num = []

for token in doc:
    like_num.append(token.like_num)

print(f'Text: {text}')
print(f'like_num: {like_num}')

Text: ['It', 'costs', '$', '5', '.']
like_num: [False, False, False, True, False]


In [41]:
# check which character in the document is a punctuation
is_punct = []

for token in doc:
    is_punct.append(token.is_punct)

print(f'Text: {text}')
print(f'is_punct: {is_punct}')

Text: ['It', 'costs', '$', '5', '.']
is_punct: [False, False, False, False, True]


#### Trained Pipeline

spaCy provides a number of trained pipeline packages. For example, the *en_core_web_sm* is a small English pipeline that supports all core cababilities and is trained especially on web-based text. The package provides the *binary weights* that enables spaCy to make predictions. It also includes the *vocabulary*, *meta information* and the *configuration file* used to train it. It tells spaCy which language class to use and how to configure the processing pipeline.

In [42]:
# load pipeline en_core_web_sm
nlp = spacy.load("en_core_web_sm")

In [45]:
"""
For each token in the document, it is possible to print out the text and the .pos_ attribute, 
the predicted part-of-speech tag. In spaCy, attributes that return strings usually end with an underscore - attributes without the underscore
return an integer ID value.
"""

doc = nlp("She ate the pizza")

for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN
