# spaCy: The Basics

![https://images.newrepublic.com/bfddd1f9b55195fa1414bfc55474db6da38ed13b.jpeg?w=1000&q=65&dpi=1&fm=pjpg&h=449](https://images.newrepublic.com/bfddd1f9b55195fa1414bfc55474db6da38ed13b.jpeg?w=1000&q=65&dpi=1&fm=pjpg&h=449)

[spaCy](spacy.io) is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

spaCy is written in optimized Cython, which means it's **fast**. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the *GIL*).

![https://s3.amazonaws.com/skipgram-images/spaCy.png](https://s3.amazonaws.com/skipgram-images/spaCy.png)

# Installing spaCy

Via **Anaconda** (https://anaconda.org/spacy/spacy):
> `conda install -c spacy spacy`

or via **pip**
> `pip install -U spacy`

And you should download the data and models from spacy, here we downlaod the English data:
> `python -m spacy.en.download all`

Note the download data is about 1G，and it split by two parts: parser and glove word2vec modes, and you can also download them one by one:

> `python -m spacy.en.download parser`

> `python -m spacy.en.download glove`

# Test spaCy

In [1]:
import spacy

In [2]:
nlp = spacy.load('en')

In [3]:
nlp

<spacy.en.English at 0x10e98b0f0>

### Word Tokenize Test

In [4]:
doc1 = nlp("this's spacy tokenize test")
doc1

this's spacy tokenize test

In [5]:
for token in doc1:
    print(token)

this
's
spacy
tokenize
test


### Sentence Tokenize Test or Sentence Segmentation Test

In [6]:
doc2 = nlp("this is spacy sentence tokenize test. this is second sent! is this the third sent? final test.")
doc2

this is spacy sentence tokenize test. this is second sent! is this the third sent? final test.

In [7]:
for sent in doc2.sents:
    print(sent)

this is spacy sentence tokenize test.
this is second sent!
is this the third sent? final test.


### Lemmatize Test

In [8]:
doc3 = nlp("this is spacy lemmatize testing test. programme programming book books are more better than others. mouse mice. goose geese.")
doc3

this is spacy lemmatize testing test. programme programming book books are more better than others. mouse mice. goose geese.

In [9]:
for token in doc3:
    print('token: %16s | token.lemma: %8s | token.lemma_: %s' % (token, token.lemma, token.lemma_))

token:             this | token.lemma:      552 | token.lemma_: this
token:               is | token.lemma:      536 | token.lemma_: be
token:            spacy | token.lemma:   776980 | token.lemma_: spacy
token:        lemmatize | token.lemma:   776982 | token.lemma_: lemmatize
token:          testing | token.lemma:     4191 | token.lemma_: testing
token:             test | token.lemma:     1877 | token.lemma_: test
token:                . | token.lemma:      453 | token.lemma_: .
token:        programme | token.lemma:   203054 | token.lemma_: programme
token:      programming | token.lemma:     2171 | token.lemma_: programming
token:             book | token.lemma:     1300 | token.lemma_: book
token:            books | token.lemma:     1300 | token.lemma_: book
token:              are | token.lemma:      536 | token.lemma_: be
token:             more | token.lemma:      597 | token.lemma_: more
token:           better | token.lemma:      761 | token.lemma_: better
token:            

### Part-of-Speech (POS) Tagging Test

In [10]:
doc4 = nlp("This is pos tagger test for spacy pos tagger")
doc4

This is pos tagger test for spacy pos tagger

In [11]:
for token in doc4:
    print('token: %16s | token.pos: %8s | token.pos_: %s' % (token, token.pos, token.pos_))

token:             This | token.pos:       88 | token.pos_: DET
token:               is | token.pos:       98 | token.pos_: VERB
token:              pos | token.pos:       82 | token.pos_: ADJ
token:           tagger | token.pos:       90 | token.pos_: NOUN
token:             test | token.pos:       90 | token.pos_: NOUN
token:              for | token.pos:       83 | token.pos_: ADP
token:            spacy | token.pos:       90 | token.pos_: NOUN
token:              pos | token.pos:       90 | token.pos_: NOUN
token:           tagger | token.pos:       90 | token.pos_: NOUN


### Named Entity Recognizer (NER) Test

[Entity Types](https://spacy.io/docs/usage/entity-recognition#entity-types)

In [12]:
doc5 = nlp("Rami Eid is studying at Stony Brook University in New York")
doc5

Rami Eid is studying at Stony Brook University in New York

In [13]:
for ent in doc5.ents:
    print('ent: %25s | ent.label: %8s | ent.label_: %s' % (ent, ent.label, ent.label_))

ent:                  Rami Eid | ent.label:      377 | ent.label_: PERSON
ent:    Stony Brook University | ent.label:      380 | ent.label_: ORG


### Noun Chunk Test

In [14]:
doc6 = nlp("Natural language processing (NLP) deals with the application of computational models to text or speech data.")
doc6

Natural language processing (NLP) deals with the application of computational models to text or speech data.

In [15]:
for np in doc6.noun_chunks:
    print(np)

Natural language processing (NLP) deals
the application
computational models
text or speech data


### Word Vectors Test

In [16]:
doc7 = nlp("Apples and oranges are similar. Boots and hippos aren't.")
doc7

Apples and oranges are similar. Boots and hippos aren't.

In [17]:
for idx, token in enumerate(doc7):
    print('%2d   %s' % (idx, token))

 0   Apples
 1   and
 2   oranges
 3   are
 4   similar
 5   .
 6   Boots
 7   and
 8   hippos
 9   are
10   n't
11   .


In [18]:
apples = doc7[0]

In [19]:
oranges = doc7[2]

In [20]:
boots = doc7[6]

In [21]:
hippos = doc7[8]

In [24]:
hippos.vector

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0

In [22]:
# Find the similarity between "apples" and "oranges"
apples.similarity(oranges)

0.0

In [23]:
# Find the similarity between "boots" and "hippos"
boots.similarity(hippos)

0.0

# More Resources

- [NLTK vs. spaCy: Natural Language Processing in Python](http://blog.thedataincubator.com/2016/04/nltk-vs-spacy-natural-language-processing-in-python/)
- [More spaCy Tutorials](https://spacy.io/docs/usage/tutorials)
- [Advanced spaCy Stuff on a Million Yelp Reviews](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb)