In [1]:
import spacy

Basic tokenization using smallest CPU-based model:  
(use python -m spacy download {model_name})

In [3]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print((token.text, token.pos_, token.dep_))

('Apple', 'PROPN', 'nsubj')
('is', 'AUX', 'aux')
('looking', 'VERB', 'ROOT')
('at', 'ADP', 'prep')
('buying', 'VERB', 'pcomp')
('U.K.', 'PROPN', 'dobj')
('startup', 'NOUN', 'dep')
('for', 'ADP', 'prep')
('$', 'SYM', 'quantmod')
('1', 'NUM', 'compound')
('billion', 'NUM', 'pobj')


In [4]:
for token in doc:
    print(f"text: {token.text}, lemma: {token.lemma_}, pos: {token.pos_}, tag: {token.tag_}, dep: {token.dep_}, shape: {token.shape_}, is_alpha: {token.is_alpha}, is_stop: {token.is_stop}")

text: Apple, lemma: Apple, pos: PROPN, tag: NNP, dep: nsubj, shape: Xxxxx, is_alpha: True, is_stop: False
text: is, lemma: be, pos: AUX, tag: VBZ, dep: aux, shape: xx, is_alpha: True, is_stop: True
text: looking, lemma: look, pos: VERB, tag: VBG, dep: ROOT, shape: xxxx, is_alpha: True, is_stop: False
text: at, lemma: at, pos: ADP, tag: IN, dep: prep, shape: xx, is_alpha: True, is_stop: True
text: buying, lemma: buy, pos: VERB, tag: VBG, dep: pcomp, shape: xxxx, is_alpha: True, is_stop: False
text: U.K., lemma: U.K., pos: PROPN, tag: NNP, dep: dobj, shape: X.X., is_alpha: False, is_stop: False
text: startup, lemma: startup, pos: NOUN, tag: NN, dep: dep, shape: xxxx, is_alpha: True, is_stop: False
text: for, lemma: for, pos: ADP, tag: IN, dep: prep, shape: xxx, is_alpha: True, is_stop: True
text: $, lemma: $, pos: SYM, tag: $, dep: quantmod, shape: $, is_alpha: False, is_stop: False
text: 1, lemma: 1, pos: NUM, tag: CD, dep: compound, shape: d, is_alpha: False, is_stop: False
text: billi

Dependency parse tree:

In [5]:
spacy.displacy.render(doc, style="dep")

Named entities get split off into their own property:
(GPE = Geopolitical entity)

In [6]:
for ent in doc.ents:
    print((ent.text, ent.start_char, ent.end_char, ent.label_))

('Apple', 0, 5, 'ORG')
('U.K.', 27, 31, 'GPE')
('$1 billion', 44, 54, 'MONEY')


In [7]:
spacy.displacy.render(doc, style="ent")

Word vectors (L2 norm shown):

In [8]:
nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print((token.text, token.has_vector, token.vector_norm, token.is_oov))

('dog', True, 75.254234, False)
('cat', True, 63.188496, False)
('banana', True, 31.620354, False)
('afskfsd', False, 0.0, True)


Similarity: A .similarity method is provided for Doc, Span, Token, and Lexeme objects.

In [9]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [10]:
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489675521851


In [13]:
doc1[4].similarity(doc2[3])

0.3279081881046295

In [14]:
doc1 = nlp("If you're going through hell, keep going")
doc2 = nlp("Be yourself; everyone else is already taken.")
least_similar = (1.0, ('', ''))
for token1 in doc1:
    for token2 in doc2:
        sim = token1.similarity(token2)
        if sim < least_similar[0]:
            least_similar = (sim, (token1.text, token2.text))
print(least_similar)

(-0.31665900349617004, ('you', ';'))


In [18]:
print(f"{doc1[1].text} {doc2[4].text} {doc1[1].similarity(doc2[4])}")
print(f"{doc1[1].lex.similarity(doc2[4].lex)}")

you else 0.4331720471382141
0.4331720471382141


The similarity score is based on token averages, where order does not matter. Therefore it is scores higher for similar words than it does similar meanings.  
Consider [sense2vec](https://github.com/explosion/sense2vec) for more semantic and context sensitivity: [Demo blogpost](https://explosion.ai/blog/sense2vec-reloaded)

To save space, data is stored in the vocab object (shared by documents) and uses unique hash values derived from each string.

In [20]:
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])
print(doc.vocab.strings[3197928453018144401])

3197928453018144401
coffee


In [21]:
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_, lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


The vocab object's hashes are only present if it has processed that word:  
(If you want to restart a process with partial progress you'll have to save/load from a pickle file)

In [23]:
from spacy.tokens import Doc
from spacy.vocab import Vocab

empty_doc = Doc(Vocab())
print(empty_doc.vocab.strings[3197928453018144401]) # Errors

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [25]:
empty_doc.vocab.strings.add("coffee")
print(empty_doc.vocab.strings[3197928453018144401])

coffee


Provided doc objects share their vocab, you can use hash values that haven't been seen by that doc:

In [26]:
new_doc_shared = Doc(doc.vocab)
print(empty_doc.vocab.strings[3197928453018144401])

coffee


Use nlp.to_disk and nlp.from_disk to pickle to pick up where you left off.

Spacy provides a framework for [training custom models](https://spacy.io/usage/training)

Contributing: Easier to solve bugs are listed with the [help wanted (easy)](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted+%28easy%29%22) tag in github.  
Doc and typo fixes are also encouraged.

Social Media tags and badges:  
Twitter - tag [@spacy_io](https://x.com/spacy_io)  
Bagdes - [![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)  
         [![Made with love and spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://spacy.io)  
Can submit work to the [spaCy universe](https://spacy.io/universe/category/conversational) too.