## Statistical Models

### What are statistical models?
Enable spaCy to predict linguistic attributes in context
- Part-of-speech tags
- Syntactic dependencies
- Named entities

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts. They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

### Model Packages

spaCy provides a number of pre-trained model packages that can be downloaded using the "spacy download" command. One such package is the "en_core_web_sm" package. This is a small English model that supports all core capabilities and is trained on text from the web.

In [14]:
#!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

You can use the load method to load the model package by name. The method call returns an nlp object.

In [15]:
import spacy

nlp = spacy.load('en_core_web_sm')

The package comes with a pre-trained model that contains binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

### Predicting part-of-speech tags

spaCy can be used to predict part-of-speech tags using the pos_ attribute. 

In spaCy, attributes that return strings usually end with an underscore. Attributes without the underscore return an ID.

In [16]:
doc = nlp("She ate the pizza")

In [17]:
for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


### Predicting Syntactic Dependencies

The spacy model can also be used to predict how the words are related. 
- The "dep underscore" attribute returns the predicted dependency label.
- The head attribute returns the syntactic head token or in other words the parent token this word is attached to.

In [18]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


To describe syntactic dependencies, spaCy uses a standardized label scheme.
- The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".
- The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".
- The determiner "the", also known as an article, is attached to the noun "pizza".

Another example

In [19]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

It PRON nsubj ’s
’s VERB compound official
official NOUN ROOT official
: PUNCT punct official
Apple PROPN nsubj is
is AUX ROOT is
the DET det company
first ADJ amod company
U.S. PROPN nmod company
public ADJ amod company
company NOUN attr is
to PART aux reach
reach VERB relcl company
a DET det value
$ SYM quantmod trillion
1 NUM compound trillion
trillion NUM nummod value
market NOUN compound value
value NOUN dobj reach


### Predicting Named Entities

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The ents property on a document object lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the label/label_ attribute.

In [20]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for entity in doc.ents:
    print(entity.text, entity.label)

Apple 383
U.K. 384
$1 billion 394


### Predicting named entities in context

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. In the below example the model does not recognise Iphone X as an entity.

In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple 383
Missing entity: iPhone X


### The explain method

A quick tip: spacy.explain can be used to get definitions for the most common tags and labels. The same works for part-of-speech tags and dependency labels.

In [22]:
spacy.explain('dobj')

'direct object'

In [23]:
spacy.explain('NNP')

'noun, proper singular'