In [None]:
 Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
 Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
 Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
 Configuration options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.

In [1]:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')

In [4]:
doc.print_tree

<function Doc.print_tree>

## Features


- Tokenization: Segmenting text into words, punctuations marks etc.
- Part-of-speech (POS) Tagging	Assigning word types to tokens, like verb or noun.
- Dependency Parsing:	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- Lemmatization:	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
- Sentence Boundary Detection (SBD):	Finding and segmenting individual sentences.
- Named Entity Recognition (NER):	Labelling named "real-world" objects, like persons, companies or locations.
- Similarity:	Comparing words, text spans and documents and how similar they are to each other.
- Text Classification:	Assigning categories or labels to a whole document, or parts of a document.
- Rule-based Matching:	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- Training:	Updating and improving a statistical model's predictions.
- Serialization: Saving objects to file or byte strings

### Statistical models
While some of spaCy's features work independently, others require statistical models to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. spaCy currently offers statistical models for 8 languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy and the data they include. The model you choose always depends on your use case and the texts you're working with. For a general-purpose use case, the small, default models are always a good start. They typically include the following components:



-  **Binary weights** for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
- **Lexical entries** in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
- **Word vectors**, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
- **Configuration** options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.




### Linguistic annotations
spaCy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you're analysing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether "google" is used as a verb, or refers to the website or company in a specific context.

In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

In [14]:
print('token.text\t token.pos_\t token.dep_')
for token in doc:
    print("{}\t\t{}\t\t{}".format(token.text, token.pos_, token.dep_))

token.text	 token.pos_	 token.dep_
Apple		PROPN		nsubj
is		VERB		aux
looking		VERB		ROOT
at		ADP		prep
buying		VERB		pcomp
U.K.		PROPN		compound
startup		NOUN		dobj
for		ADP		prep
$		SYM		quantmod
1		NUM		compound
billion		NUM		pobj


## Tokenization
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [15]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

 Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
 Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.


```
Tokenizer exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.
Prefix: Character(s) at the beginning, e.g. $, (, “, ¿.
Suffix: Character(s) at the end, e.g. km, ), ”, !.
Infix: Character(s) in between, e.g. -, --, /, ….
```

![](https://spacy.io/assets/img/tokenization.svg)

## Part-of-speech tags and dependencies

After tokenization, spaCy can **parse** and **tag** a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes . Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

In [23]:
print("TEXT\t LEMMA\t POS\t TAG\t DEP\t SHAPE\t ALPHA\t STOP")
for token in doc:
    print("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}".format(token.text, token.lemma_, token.pos_, 
                                                  token.tag_, token.dep_,token.shape_, 
                                                  token.is_alpha, token.is_stop))

TEXT	 LEMMA	 POS	 TAG	 DEP	 SHAPE	 ALPHA	 STOP
Apple	apple	PROPN	NNP	nsubj	Xxxxx	True	False
is	be	VERB	VBZ	aux	xx	True	True
looking	look	VERB	VBG	ROOT	xxxx	True	False
at	at	ADP	IN	prep	xx	True	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False
U.K.	u.k.	PROPN	NNP	compound	X.X.	False	False
startup	startup	NOUN	NN	dobj	xxxx	True	False
for	for	ADP	IN	prep	xxx	True	True
$	$	SYM	$	quantmod	$	False	False
1	1	NUM	CD	compound	d	False	False
billion	billion	NUM	CD	pobj	xxxx	True	False


## Named Entities 
A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:


```
Text: The original entity text.
Start: Index of start of entity in the Doc.
End: Index of end of entity in the Doc.
Label: Entity label, i.e. type.

```

In [32]:
for ent in doc.ents:
    print('{}|\t{}|\t   {}'.format(ent.text, ent.start_char, ent.end_char, ent.label_))

Apple|	0|	   5
U.K.|	27|	   31
$1 billion|	44|	   54


### Word vectors and similarityNEEDS MODEL 
spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

Each Doc, Span and Token comes with a .similarity()  method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether "dog" and "cat" are similar really depends on how you're looking at it. spaCy's similarity model usually assumes a pretty general-purpose definition of similarity.

In [47]:
import spacy

nlp = spacy.load('en_core_web_lg')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print('{}\t{}\t{:1.2f}'.format(token1.text, token2.text, token1.similarity(token2)))

dog	dog	1.00
dog	cat	0.80
dog	banana	0.24
cat	dog	0.80
cat	cat	1.00
cat	banana	0.28
banana	dog	0.24
banana	cat	0.28
banana	banana	1.00


---
Models that come with built-in word vectors make them available as the Token.vector  attribute. Doc.vector  and Span.vector  will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalise vectors.

```
Text: The original token text.
has vector: Does the token have a vector representation?
Vector norm: The L2 norm of the token's vector (the square root of the sum of the values squared)
OOV: Out-of-vocabulary
```

In [56]:
tokens = nlp(u'dog cat banana afskfsd')
print('text\tvector\tvec_norm\tis_ovv')
for token in tokens:
    print('{}\t{}\t{:1.2f}\t\t{}'.format(token.text, token.has_vector, token.vector_norm, token.is_oov))

text	vector	vec_norm	is_ovv
dog	True	7.03		False
cat	True	6.68		False
banana	True	6.70		False
afskfsd	False	0.00		True


The words "dog", "cat" and "banana" are all pretty common in English, so they're part of the model's vocabulary, and come with a vector. The word "afskfsd" on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it's practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package, for example, en_vectors_web_lg, which includes over 1 million unique vectors.

## Pipelines

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

![](https://spacy.io/assets/img/pipeline.svg)



![](https://i.imgur.com/p8nlUeU.png)


The processing pipeline always depends on the statistical model and its capabilities. For example, a pipeline can only include an entity recognizer component if the model includes data to make predictions of entity labels. This is why each model will specify the pipeline to use in its meta data, as a simple list containing the component names:

## Vocab, hashes and lexemes

Whenever possible, spaCy tries to store data in a vocabulary, the Vocab , that will be shared by multiple documents. To save memory, spaCy also encodes all strings to hash values – in this case for example, "coffee" has the hash 3197928453018144401. Entity labels like "ORG" and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only "speaks" in hash values.


![](https://spacy.io/assets/img/vocab_stringstore.svg)


If you process lots of documents containing the word "coffee" in all kinds of different contexts, storing the exact string "coffee" every time would take up way too much space. So instead, spaCy hashes the string and stores it in the StringStore . You can think of the StringStore as a lookup table that works in both directions – you can look up a string to get its hash, or a hash to get its string:

In [66]:
doc = nlp(u'I love coffee')
vocab_string = doc.vocab.strings['coffee']

print(vocab_string)

3197928453018144401


In [69]:
doc.vocab.strings[vocab_string]

'coffee'

Now that all strings are encoded, the entries in the vocabulary don't need to include the word text themselves. Instead, they can look it up in the StringStore via its hash value. Each entry in the vocabulary, also called Lexeme , contains the context-independent information about a word. For example, no matter if "love" is used as a verb or a noun in some context, its spelling and whether it consists of alphabetic characters won't ever change. Its hash value will also always be the sam

In [73]:
doc = nlp('I love coffee')
for word in doc:
    lexeme = doc.vocab[word.text]
    
    print('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
          lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_))

I	4690420944186131903	X	I	I	True	False	True	en
love	3702023516439754181	xxxx	l	ove	True	False	False	en
coffee	3197928453018144401	xxxx	c	fee	True	False	False	en
