# spaCy: large-scale Natural Language Processing

### Features

* __Tokenization__:	
    Segmenting text into words, punctuations marks etc.
* __Part-of-speech (POS) Tagging__:	
    Assigning word types to tokens, like verb or noun.
* __Dependency Parsing__:	
    Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
* __Lemmatization__:	
    Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
* __Sentence Boundary Detection (SBD)__:	
    Finding and segmenting individual sentences.
* __Named Entity Recognition (NER)__:	
    Labelling named "real-world" objects, like persons, companies or locations.
* __Similarity__:	
    Comparing words, text spans and documents and how similar they are to each other.
* __Text Classification__:	
    Assigning categories or labels to a whole document, or parts of a document.
* __Rule-based Matching__:	
    Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
* __Training__:	
    Updating and improving a statistical model's predictions.
* __Serialization__:	
    Saving objects to files or byte strings.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('default')

In [2]:
import spacy

In [3]:
# validation
!python -m spacy validate


[93m    Installed models (spaCy v2.0.5)[0m
    /home/bjpcjp/miniconda3/lib/python3.6/site-packages/spacy

    TYPE        NAME                  MODEL                 VERSION                                   
    package     en-core-web-sm        en_core_web_sm        [38;5;2m2.0.0[0m    [38;5;2m✔[0m      
    package     en-core-web-lg        en_core_web_lg        [38;5;2m2.0.0[0m    [38;5;2m✔[0m      
    package     de-core-news-sm       de_core_news_sm       [38;5;2m2.0.0[0m    [38;5;2m✔[0m      
    link        en_core_web_lg        en_core_web_lg        [38;5;2m2.0.0[0m    [38;5;2m✔[0m      
    link        en                    en_core_web_sm        [38;5;2m2.0.0[0m    [38;5;2m✔[0m      
    link        de                    de_core_news_sm       [38;5;2m2.0.0[0m    [38;5;2m✔[0m      


In [4]:
!pytest

platform linux -- Python 3.6.3, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/bjpcjp/projects/nlp/spaCy, inifile:
collected 0 items                                                               [0m[1m



In [5]:
# load a statistical model - in this case, for English:
nlp = spacy.load('en')

# returns a language object, often named 'nlp'.

### Tokenization

In [6]:
# let's try a sample document.
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

In [7]:
# what tokens have been found?
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


* After tokenization, spaCy can parse and tag a Doc. The statistical model enables spaCy to predict which tag or label most likely applies in this context. 
* A model consists of binary data and is built by showing a system enough examples to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.
* Linguistic annotations are available as __Token attributes__. spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name.

* __Text__: The original word text.
* __Lemma__: The base form of the word.
* __POS__: The simple part-of-speech tag.
* __Tag__: The detailed part-of-speech tag.
* __Dep__: Syntactic dependency, i.e. the relation between tokens.
* __Shape__: The word shape – capitalisation, punctuation, digits.
* __is alpha__: Is the token an alpha character?
* __is stop__: Is the token part of a stop list, i.e. the most common words of the language?

In [8]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [9]:
# display dependencies
from spacy import displacy

In [10]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 80})

In [11]:
displacy.render(doc, style='ent', jupyter=True)

###  Named Entities ###

* A named entity is a "real-world object" that's assigned a name – a person, a country, a product or a book title. 
* spaCy can recognise various types of named entities in a document by asking the model for a prediction. 
* Because models are statistical and strongly depend on their training examples, this doesn't always work perfectly and might need some tuning depending on your use case.
* Named entities are available as the ents property of a Doc.

In [12]:
for ent in doc.ents:
    print(ent.text, "\t",         # original entity text
          ent.start_char, "\t",   # index of entity's start
          ent.end_char, "\t",     # index of entity's end
          ent.label_, "\t")       # entity label, ie. type.

Apple 	 0 	 5 	 ORG 	
U.K. 	 27 	 31 	 GPE 	
$1 billion 	 44 	 54 	 MONEY 	


### Word Vectors and Similarity

* spaCy can compare two objects & predict their similarity. This is useful for building recommendation systems or flagging duplicates. For example, you can suggest content that's similar to what a user is currently viewing, or label a support ticket as a duplicate if it's very similar to an already existing one.
* Each Doc, Span and Token comes with a __.similarity()__ method. Of course similarity is always subjective – whether "dog" and "cat" are similar really depends on how you're looking at it. spaCy's similarity model usually assumes a pretty general-purpose definition of similarity.
* Similarity is found by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec.

In [13]:
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    print(token1)
    for token2 in tokens:
        print(token1.similarity(token2))

dog
1.0
0.53906965
0.28761008
cat
0.53906965
1.0000001
0.48752162
banana
0.28761008
0.48752162
1.0


In [14]:
doc1 = nlp(u'the fries were gross.')
doc2 = nlp(u'worst fries ever.')
print(doc1.similarity(doc2))

0.6123773023244291


* To make them compact and fast, spaCy's small models (all packages that end in __sm__) don't ship with word vectors, and only include context-sensitive tensors. 
* This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won't be as good, and individual tokens won't have any vectors assigned. __So in order to use real word vectors, you need to download a larger model.__

In [None]:
!python -m spacy download en_core_web_lg

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB)
[K    3% |█▏                              | 31.9MB 494kB/s eta 0:27:405^C
Traceback (most recent call last):
  File "/home/bjpcjp/miniconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/bjpcjp/miniconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/bjpcjp/miniconda3/lib/python3.6/site-packages/spacy/__main__.py", line 31, in <module>
    plac.call(commands[command])
  File "/home/bjpcjp/miniconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/bjpcjp/miniconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extr

In [None]:
nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana sasquatch')

for token in tokens:
    print(token.text,         # Text: The original token text.
          token.has_vector,   # has_vector: Does the token have a vector representation?
          token.vector_norm,  # Vector_norm: The L2 norm of the token's vector (square root(sum of the values squared))
          token.is_oov        # is_OOV: Is the word out-of-vocabulary?
         )

### Pipelines

![example](pipeline.png)

* __Tokenizer__: creates Doc; segments text into tokens.
* __Tagger__: creates Doc[i].tag; assigns part-of-speech tags.
* __Parser__: creates dependency labels (head, dep, sents, noun_chunks)
* __Ner__: creates .ents, .ent_iob, .ent_type; detects/labels named entities.
* __Textcat__: creates .cats; assigns document labels.
* __...__: assigns custom attributes/methods/properties.

### Vocab, hashes & lexemes

* spaCy tries to store data in a vocabulary, the Vocab , that will be shared by multiple documents. 
* To save memory, spaCy also encodes all strings to hash values. Example: "coffee" = hash 3197928453018144401. 
* Entity labels like "ORG" and part-of-speech tags like "VERB" are also encoded. 
* Internally, spaCy only "speaks" in hash values.

* If you process lots of documents containing the word "coffee" in many  contexts, storing the exact string "coffee" every time would take up way too much space. 
* spaCy instead hashes the string and stores it in the StringStore. Think of StringStore as a 2-way lookup table – you can look up a string to get its hash, or a hash to get its string:

In [None]:
doc = nlp(u'I like coffee')

assert doc.vocab.strings[u'coffee']           == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'

In [None]:
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text,       # original text
          lexeme.orth,       # hash value
          lexeme.shape_,     # abstract word shape
          lexeme.prefix_,    # 1st letter of word string
          lexeme.suffix_,    # last 3 letters of word string
          lexeme.is_alpha,   # consists of alpha characters?
          lexeme.is_digit,   # consitss of digits?
          lexeme.is_title, 
          lexeme.lang_)

* hashes cannot be reversed - there's no way to resolve 3197928453018144401 back to "coffee". 
* All spaCy can do is look it up in the vocabulary. That's why you always need to make sure all objects you create __have access to the same vocabulary__. If they don't, spaCy might not be able to find the strings it needs.

In [None]:
from spacy.tokens import Doc
from spacy.vocab import Vocab

doc = nlp(u'I like coffee') # original Doc
assert doc.vocab.strings[u'coffee']           == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'

empty_doc = Doc(Vocab()) # new Doc with empty Vocab

empty_doc.vocab.strings.add(u'coffee') # add "coffee" and generate hash
assert doc.vocab.strings[3197928453018144401] == u'coffee' #

new_doc = Doc(doc.vocab) # create new doc with first doc's vocab
assert doc.vocab.strings[3197928453018144401] == u'coffee' #

### Serialization

* If modifying the pipeline, vocabulary, vectors and entities, or made updates to the model, you'll want to save your progress.
* This means you'll have to translate its contents and structure into a format that can be saved. This process is called __serialization__. 
* spaCy comes with built-in serialization methods and supports the Pickle protocol.


* __to_bytes__: returns bytes; example: nlp.to_bytes()
* __from_bytes__: returns object: example: nlp.from_bytes(bytes)
* __to_disk__: returns --: example: nlp.to_disk('/path')
* __from_disk__: returns object: example: nlp.from_disk('/path')

In [None]:
three_quotes = open('3quotes.txt','r').read()
doc  = nlp(three_quotes)
#doc.to_disk('/3quotes.bin')
doc

### Training
*  spaCy models are statistical. Every "decision" they make is a prediction that is based on the examples the model has seen during training. 
* To train a model, you need training data – examples of text, and corresponding labels.
* The model is then shown unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback in the form of an __error gradient__ of the loss. 
* It calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.

![training](training.png)

### Language Data

* Every language is full of exceptions and special cases, especially amongst the most common words. Some exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. 
* The __lang__ module contains all language-specific data in simple Python files. This makes the data easy to update and extend.
* The shared language data in the directory root includes rules that can be generalised across languages (basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like " and ”, etc.) This helps the models make more accurate predictions. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of putting together all components and creating the Language subclass – for example, English or German.

![language](language.png)

### Lightning Tour:
* Install models & process text:

In [None]:
!python -m spacy download en
!python -m spacy download de

In [None]:
import spacy
nlp = spacy.load('en')
doc = nlp(u'Hello, world. Here are two sentences.')

nlp_de = spacy.load('de')
doc_de = nlp_de(u'Ich bin ein Berliner.')

* Get tokens, noun chunks & sentences

In [None]:
doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
          u"emoji. It's outranking eggplant 🍑 ")

assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'🍑'
assert doc[17:19].text == u'outranking eggplant'
assert list(doc.noun_chunks)[0].text == u'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[1].text == u'Peach is the superior emoji.'

* Get Part-Of-Text tags & flags

In [None]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
apple = doc[0]

#assert [apple.pos_, apple.pos] == [u'PROPN', 17049293600679659579]
#assert [apple.tag_, apple.tag] == [u'NNP', 15794550382381185553]
#assert [apple.shape_, apple.shape] == [u'Xxxxx', 16072095006890171862]
assert apple.is_alpha == True
assert apple.is_punct == False

billion = doc[10]
assert billion.is_digit == False
assert billion.like_num == True
assert billion.like_email == False

* Use hash values for any string

In [None]:
doc = nlp(u'I love coffee')

coffee_hash = nlp.vocab.strings[u'coffee'] # 3197928453018144401
coffee_text = nlp.vocab.strings[coffee_hash] # 'coffee'

assert doc[2].orth == coffee_hash == 3197928453018144401
assert doc[2].text == coffee_text == u'coffee'

beer_hash = doc.vocab.strings.add(u'beer') # 3073001599257881079
beer_text = doc.vocab.strings[beer_hash] # 'beer'

unicorn_hash = doc.vocab.strings.add(u'🦄 ') # 18234233413267120783
unicorn_text = doc.vocab.strings[unicorn_hash] # '🦄 '

* Recognize named entities

In [None]:
doc = nlp(u'San Francisco considers banning sidewalk delivery robots')
ents = [(
    ent.text, 
    ent.start_char, 
    ent.end_char, 
    ent.label_
) for ent in doc.ents]

assert ents == [(u'San Francisco', 0, 13, u'GPE')]

from spacy.tokens import Span
doc = nlp(u'Netflix is hiring a new VP of global policy')

doc.ents = [Span(
    doc, 0, 1, 
    label=doc.vocab.strings[u'ORG'])]

ents = [(
    ent.start_char, 
    ent.end_char, 
    ent.label_
) for ent in doc.ents]

assert ents == [(0, 7, u'ORG')]

* Train neural net models

In [None]:
import spacy
import random

nlp = spacy.load('en')
train_data = [
    ("Uber blew through $1 million", 
     {'entities': [(0, 4, 'ORG')]})]

with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(train_data)
        for text, annotations in train_data:
            nlp.update([text], [annotations] sgd=optimizer)
    nlp.to_disk('/model')