# spaCy: large-scale Natural Language Processing

### Features

* __Tokenization__:	
    Segmenting text into words, punctuations marks etc.
* __Part-of-speech (POS) Tagging__:	
    Assigning word types to tokens, like verb or noun.
* __Dependency Parsing__:	
    Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
* __Lemmatization__:	
    Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
* __Sentence Boundary Detection (SBD)__:	
    Finding and segmenting individual sentences.
* __Named Entity Recognition (NER)__:	
    Labelling named "real-world" objects, like persons, companies or locations.
* __Similarity__:	
    Comparing words, text spans and documents and how similar they are to each other.
* __Text Classification__:	
    Assigning categories or labels to a whole document, or parts of a document.
* __Rule-based Matching__:	
    Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
* __Training__:	
    Updating and improving a statistical model's predictions.
* __Serialization__:	
    Saving objects to files or byte strings.

In [26]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('default')

In [27]:
import spacy

In [33]:
# validation
!python -m spacy validate


[93m    Installed models (spaCy v2.0.5)[0m
    /home/bjpcjp/miniconda3/lib/python3.6/site-packages/spacy

    TYPE        NAME                  MODEL                 VERSION                                   
    package     en-core-web-sm        en_core_web_sm        [38;5;2m2.0.0[0m    [38;5;2m✔[0m      
    link        en                    en_core_web_sm        [38;5;2m2.0.0[0m    [38;5;2m✔[0m      


In [34]:
!pytest

platform linux -- Python 3.6.3, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/bjpcjp/projects/nlp/spaCy, inifile:
[1mcollecting 0 items                                                              [0m[1mcollected 0 items                                                               [0m



In [35]:
# load a statistical model - in this case, for English:
nlp = spacy.load('en')

# returns a language object, often named 'nlp'.

### Tokenization

In [36]:
# let's try a sample document.
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

In [37]:
# what tokens have been found?
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


* After tokenization, spaCy can parse and tag a Doc. The statistical model enables spaCy to predict which tag or label most likely applies in this context. 
* A model consists of binary data and is built by showing a system enough examples to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.
* Linguistic annotations are available as __Token attributes__. spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name.

* __Text__: The original word text.
* __Lemma__: The base form of the word.
* __POS__: The simple part-of-speech tag.
* __Tag__: The detailed part-of-speech tag.
* __Dep__: Syntactic dependency, i.e. the relation between tokens.
* __Shape__: The word shape – capitalisation, punctuation, digits.
* __is alpha__: Is the token an alpha character?
* __is stop__: Is the token part of a stop list, i.e. the most common words of the language?

In [39]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [40]:
# display dependencies
from spacy import displacy

In [49]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 80})

In [42]:
displacy.render(doc, style='ent', jupyter=True)

###  Named Entities ###

* A named entity is a "real-world object" that's assigned a name – a person, a country, a product or a book title. 
* spaCy can recognise various types of named entities in a document by asking the model for a prediction. 
* Because models are statistical and strongly depend on their training examples, this doesn't always work perfectly and might need some tuning depending on your use case.
* Named entities are available as the ents property of a Doc.

In [54]:
for ent in doc.ents:
    print(ent.text, "\t",         # original entity text
          ent.start_char, "\t",   # index of entity's start
          ent.end_char, "\t",     # index of entity's end
          ent.label_, "\t")       # entity label, ie. type.

Apple 	 0 	 5 	 ORG 	
U.K. 	 27 	 31 	 GPE 	
$1 billion 	 44 	 54 	 MONEY 	


### Word Vectors and Similarity

* spaCy can compare two objects & predict their similarity. This is useful for building recommendation systems or flagging duplicates. For example, you can suggest content that's similar to what a user is currently viewing, or label a support ticket as a duplicate if it's very similar to an already existing one.
* Each Doc, Span and Token comes with a __.similarity()__ method. Of course similarity is always subjective – whether "dog" and "cat" are similar really depends on how you're looking at it. spaCy's similarity model usually assumes a pretty general-purpose definition of similarity.

In [24]:
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    print(token1)
    for token2 in tokens:
        print(token1.similarity(token2))

dog
1.0
0.53906965
0.28761008
cat
0.53906965
1.0000001
0.48752162
banana
0.28761008
0.48752162
1.0


In [25]:
doc1 = nlp(u'the fries were gross.')
doc2 = nlp(u'worst fries ever.')
print(doc1.similarity(doc2))

0.6123773023244291
