# What is spaCy?
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.


| NAME                       | DESCRIPTION                                                                                                      |
|----------------------------|------------------------------------------------------------------------------------------------------------------|
| Tokenization               | Segmenting text into words, punctuation marks, etc.                                                              |
| Part-of-speech (POS) Tagging | Assigning word types to tokens, like verb or noun.                                                              |
| Dependency Parsing         | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
| Lemmatization              | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.  |
| Sentence Boundary Detection (SBD) | Finding and segmenting individual sentences.                                                                |
| Named Entity Recognition (NER) | Labeling named “real-world” objects, like persons, companies, or locations.                                 |
| Entity Linking (EL)        | Disambiguating textual entities to unique identifiers in a knowledge base.                                        |
| Similarity                 | Comparing words, text spans, and documents and how similar they are to each other.                                |
| Text Classification        | Assigning categories or labels to a whole document or parts of a document.                                         |
| Rule-based Matching        | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
| Training                   | Updating and improving a statistical model’s predictions.                                                          |
| Serialization              | Saving objects to files or byte strings.                                                                          |


### Linguistic Annotations

In [2]:
import spacy

# to use en_core_web_sm, we need to download one time using python -m spacy download en_core_eb_sm
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


### Tokenization

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


### Part-of-speech taggs and dependencies

In [6]:
import spacy
# from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# displacy.serve(doc, style="dep")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


### named Entities

In [8]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.serve(doc, style="ent")

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
