In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_lg

### Predicting Part-of-speech Tags

In [3]:
import spacy

nlp = spacy.load("en_core_web_lg")

doc = nlp("The doctor finished her work")

for token in doc:
    print(token.text, token.pos_)

The DET
doctor NOUN
finished VERB
her PRON
work NOUN


### Predicting Syntatic Dependencies

In [4]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

The DET det doctor
doctor NOUN nsubj finished
finished VERB ROOT finished
her PRON poss work
work NOUN dobj finished


In [5]:
doc_longer = nlp("The doctor finished her work, but the designer was not happy about it")

for token in doc_longer:
    print(token.text, token.pos_, token.dep_, token.head.text)

The DET det doctor
doctor NOUN nsubj finished
finished VERB ROOT finished
her PRON poss work
work NOUN dobj finished
, PUNCT punct finished
but CCONJ cc finished
the DET det designer
designer NOUN nsubj was
was AUX conj finished
not PART neg was
happy ADJ acomp was
about ADP prep happy
it PRON pobj about


### The explain method

In [6]:
spacy.explain('CCONJ')

'coordinating conjunction'

### Predicting named entities in context

In [7]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

Apple ORG


### Match patterns

Match exact token texts:
[{'ORTH':  'iPhone'}, {'ORTH': 'X'}]

Match lexical attributes:
[{'LOWER':  'iphone'}, {'LOWER': 'x'}]

Match any token attributes:
[{'LEMMA':  'buy'}, {'POS': 'NOUN'}]
This pattern would match phrases like "buying milk" or "bought flowers"

In [8]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

pattern = [{'ORTH':  'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', [pattern]) # first arg: unique id

doc = nlp('New iPhone X release date leaked')

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


In [9]:
### Matching lexical attributes

pattern = [ # matches the tokens '2018 FIFA World Cup:'
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True},
] 

doc = nlp('2018 FIFA World Cup: France won!')

Using operators and quantifiers: 

- `{'OP': '!'}` = Negation: match 0 times
- `{'OP': '?'}` = Optional: match 0 or 1 times
- `{'OP': '+'}` = Match 1 or more times
- `{'OP': '*'}` = Match 0 or more times

In [12]:
# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher.add('GADGET', [pattern1, pattern2])

### Doc and Span

In [13]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


### Check structure

In [14]:
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')

### Check similarity
By default SpaCy uses cosine similarity.

In [15]:
doc1 = nlp("I like pizza")
doc2 = nlp("I like fast food")

print(doc1.similarity(doc2))

0.8698332283318978


### Adding statistical predictions

In [16]:
matcher = Matcher(nlp.vocab)

matcher.add('DOG', [[{'LOWER': 'golden'}, {'LOWER': 'retriever'}]])
doc = nlp('I have a Golden Retriever')

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span: ', span.text)

    #Get the span's root token and root head token
    print('Root token: ', span.root.text)
    print('Root head token: ', span.root.head.text)

    #Get previous token and its POS tag
    print('Previous token: ', doc[start - 1].text, doc[start - 1].pos_)


Matched span:  Golden Retriever
Root token:  Retriever
Root head token:  have
Previous token:  a DET


##### Using Phrase Matcher

In [17]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
pattern = nlp('Golden Retriever')
matcher.add('DOG', [pattern])

doc = nlp('I have a Golden Retriever')

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span: ', span.text)

    #Get the span's root token and root head token
    print('Root token: ', span.root.text)
    print('Root head token: ', span.root.head.text)

    #Get previous token and its POS tag
    print('Previous token: ', doc[start - 1].text, doc[start - 1].pos_)

Matched span:  Golden Retriever
Root token:  Retriever
Root head token:  have
Previous token:  a DET


### Built-in pipeline components

The part-of-speech tagger sets the token dot tag attribute. 

`tagger -> part-of-speech tagger -> Token.tag`


The depdendency parser adds the token dot dep and token dot head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.

`parser -> dependency parser -> Token.dep | Token.head | Doc.sents | Doc.noun_chunks`


The named entity recognizer adds the detected entities to the doc dot ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.

`ner -> named entity recognizer -> Doc.ents | Token.ent_iob | Token.ent_type`


Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc dot cats property. Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

`textcat -> text classifier -> Doc.cats`

In [18]:
# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f4338e5da60>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f4338e5dad0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f4338b00550>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f4338a9ef00>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f4338aa4f00>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f4338b00850>)]


### Custom components

Why? 

- Make a function execute automatically when you call nlp

- Add your own metadata to documents and tokens

- Updating built-in attributes like doc.ents


```
def custom_component(doc):
    #do something
    return doc

nlp.add_pipeline(custom_component)
```

Setting last to True will add the component last in the pipeline. This is the default behavior.

`last -> if true, add last -> ex: nlp.add_pipe(component, last=True)`


Setting first to True will add the component first in the pipeline, right after the tokenizer. 

`first -> if true, add first -> ex: nlp.add_pipe(component, first=True)`


The "before" and "after" arguments let you define the name of an existing component to add the new component before or after. For example, before equals "ner" will add it before the named entity recognizer. The other component to add the new component before or after needs to exist, though – otherwise, spaCy will raise an error.
`before -> add before component -> ex: nlp.add_pipe(component, before='ner')`

`after -> add after component -> ex: nlp.add_pipe(component, after='tagger')`

In [19]:
from spacy.language import Language

@Language.component('custom_component')
def custom_component(doc):
    #Print the doc's length
    print('Doc length: ', len(doc))

    #important: have to return the modified doc
    return doc

#Add the component first in the pipeline
nlp.add_pipe('custom_component', first=True)

#Print the pipeline component names
print('Pipeline: ', nlp.pipe_names)

Pipeline:  ['custom_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [20]:
doc = nlp('Hello world')

Doc length:  2


In [21]:
# Define the custom component
@Language.component('animal_component')
def animal_component(doc):
    # Create a Span for each match and assign the label 'ANIMAL'
    # and overwrite the doc.ents with the matched spans
    doc.ents = [Span(doc, start, end, label='ANIMAL')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline after the 'ner' component 
nlp.add_pipe('animal_component', after='ner')

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

Doc length:  8
[('Golden Retriever', 'ANIMAL')]


### Setting custom attributes

- Add custom metadata to documents, tokens and spans 

- Accessible via the ._ property

In [22]:
from spacy.tokens import Doc, Token, Span


Doc.set_extension('title', default=None)

# set extension on the Token with default value
Token.set_extension('is_color', default=False)
doc = nlp("The sky is blue.")

#overwrite extension attribute value
doc[3]._.is_color = True

Span.set_extension('has_color', default=False)

Doc length:  5


- Define a getter and an optional setter function 

- Getter only is called when you retrieve the attribute value

In [23]:
#define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

#set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp("The sky is blue.")

print(doc[3]._.is_color, '-', doc[3].text)

Doc length:  5
True - blue


If you want to set extension attributes on a Span, you almost always want to use a property extension with a getter. Otherwise, you'd have to update *every possible span ever* by hand to set all the values. In this example, the "get has color" function takes the span and returns whether the text of any of the tokens is in the list of colors. After we've processed the doc, we can check different slices of the doc and the custom "has color" property returns whether the span contains a color token or not.

In [24]:
from spacy.tokens import Span

def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

#set extension on the Span with Getter
Span.set_extension('has_color', getter=get_has_color, force=True)

doc = nlp("The sky is blue.")

print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

Doc length:  5
True - sky is blue
False - The sky


Method extensions make the extension attribute a callable method. You can then pass one or more arguments to it, and compute attribute values dynamically – for example, based on a certain argument or setting. In this example, the method function checks whether the doc contains a token with a given text. The first argument of the method is always the object itself – in this case, the Doc. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, "token text". Here, the custom "has token" method returns True for the word "blue" and False for the word "cloud".

In [25]:
from spacy.tokens import Doc

#define methods with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

#set extension on the Doc with method
Doc.set_extension('has_token', method=has_token, force=True)
doc = nlp('The sky is blue.')

print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

Doc length:  5
True - blue
False - cloud


In [26]:
# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html, force=True)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

Doc length:  8
<strong>Hello world</strong>


### Processing large volumes of text

- Use `nlp.pipe` method 

- Processes texts as a stream, yields Doc objects

- Much faster than calling `nlp` on each text


BAD 
```
docs = [nlp(text) for text in LOTS_OF_TEXTS]
```

GOOD
```
docs = list(nlp.pipe(LOTS_OF_TEXTS))
```


-   Setting `as_tuples=True` on `nlp.pipe` lets you pass in (text, context) tuples

- Yields (doc, context) tuples

- Useful for associating metadata with the doc 

In [27]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

Doc length:  4
Doc length:  3
This is a text 15
And another text 16


### Using only the tokenizer

- User `nlp.make_doc` to turn a text in to a Doc object

Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text. Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.

If you only need a tokenized Doc object, you can use the nlp dot make doc method instead, which takes a text and returns a Doc. This is also how spaCy does it behind the scenes: nlp dot make doc turns the text into a Doc before the pipeline components are called.

BAD
`doc = nlp("Hello world")`

GOOD
`doc = nlp.make_doc("Hello world")`

### Disable pipeline components

- Use `nlp.disable_pipes` to temporarily disable one or more pipes

```
with nlp.disable_pipes('tagger', 'parser'):
    #Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)
```

- restores them after the `with` block

- only run the remaining components

### Why updating the model? 

- Better results on your specific domain

- Learn classification schemes specifically for your problem

- Essential for text classification 

- Very useful for named entity recognition

- Less critical for part-of-speech tagging and dependency parsing

How? 

1. Initialize the model weights randomly with `nlp.begin_training`

2. Predict a few examples with the current weights by calling `nlp.update`

3. Compare prediction with true labels 

4. Calculate how to change weights to improve predictions

5. Update weights slightly

6. Go back to 2

#### Example: Training the entity recognizer 

- The entity recognizer tags words and phrases in context

- Each token can only be part of one entity

- Examples need to come with context: 
```
("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})
```

- Texts with no entities are also important: 
```
("I need a new phone! Any tips?", {'entities': []})
```

- The goal is to teach the model to generalize 

In [29]:
matcher_c = Matcher(nlp.vocab)

#Creating training data
TEXTS = ['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today', 'I need a new phone! Any tips?']

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher_c.add('GADGET', [pattern1, pattern2])

for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher_c(doc)
    
    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for match_id, start, end in matches]
    print(doc.text, entities)  

Doc length:  6
Doc length:  4
Doc length:  10
Doc length:  6
Doc length:  7
Doc length:  9
How to preorder the iPhone X [(4, 6, 'GADGET'), (4, 5, 'GADGET')]
iPhone X is coming [(0, 2, 'GADGET'), (0, 1, 'GADGET')]
Should I pay $1,000 for the iPhone X? [(7, 9, 'GADGET'), (7, 8, 'GADGET')]
The iPhone 8 reviews are here [(1, 2, 'GADGET'), (1, 3, 'GADGET')]
Your iPhone goes up to 11 today [(1, 2, 'GADGET')]
I need a new phone! Any tips? []


In [30]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')    

Doc length:  6
Doc length:  4
Doc length:  10
Doc length:  6
Doc length:  7
Doc length:  9
('How to preorder the iPhone X', {'entities': []})
('iPhone X is coming', {'entities': []})
('Should I pay $1,000 for the iPhone X?', {'entities': []})
('The iPhone 8 reviews are here', {'entities': []})
('Your iPhone goes up to 11 today', {'entities': []})
('I need a new phone! Any tips?', {'entities': []})


The steps of a training loop: 

1. Loop for a number of times.
The training loop is a series of steps that's performed to train or update a model. We usually need to perform it several times, for multiple iterations, so that the model can learn from it effectively. If we want to train for 10 iterations, we need to loop 10 times. 


2. Shuffle the training data. 
To prevent the model from getting stuck in a suboptimal solution, we randomly shuffle the data for each iteration. This is a very common strategy when doing stochastic gradient descent. 


3. Divide the data into batches. 
Next, we divide the training data into batches of several examples, also known as minibatching. This makes it easier to make a more accurate estimate of the gradient.


4. Update the model for each batch. 
Finally, we update the model for each batch, and start the loop again until we've reached the last iteration.


4. Save the updated model. 
We can then save the model to a directory and use it in spaCy.

```
# Example loop
from __future__ import annotations
import random


TRAINING_DATA = [
    ("How to preorder the iPhone X", {'entities': [(20, 28, 'GADGET')]})
    # and many more examples...
]

# loop for 10 iterations
for i in range(10):
    #shuffle the training data
    random.shuffle(TRAINING_DATA)

    #create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=len(TRAINING_DATA)):
        #split the batch in texts in annotations
        text = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]

        #update the model
        nlp.update(texts, annotations)

# save the model
nlp.to_disk(path_to_model)        
```

```
# Setting up a new pipeline from scratch

# start with blank english model
nlp = spacy.blank('en')

#create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

#add new label
ner.add_label('GADGET')

#start the training
nlp.begin_training()

#train for 10 iterations
for itn in range(10):
    random.shuffle(examples)

    #divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]

        #update the model
        nlp.update(texts, annotations)
```

### Training best practices

- Problem 1: Models can "forget" things
    - Statistical models can learn lots of things – but it doesn't mean that they won't unlearn them. If you're updating an existing model with new data, especially new labels, it can overfit and adjust *too much* to the new examples. For instance, if you're only updating it with examples of "website", it may "forget" other labels it previously predicted correctly – like "person". This is also known as the catastrophic forgetting problem.
    - Existing model can overfit on new data
        - Ex: if you only update it with WEBSITE, it can "unlearn" what a PERSON is
    - Also known as "catastrophic forgetting" problem
    
- Solution 1: Mix in previously correct predictions
    - To prevent this, make sure to always mix in examples of what the model previously got correct. If you're training a new category "website", also include examples of "person". spaCy can help you with this. You can create those additional examples by running the existing model over data and extracting the entity spans you care about. You can then mix those examples in with your existing data and update the model with annotations of all labels.
    - For example, if you are training WEBSITE, also include examples of PERSON
    - Run existing spaCy model over data and extract all other relevant entities
```
    BAD 
    TRAINING_DATA = [
        ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
    ]

    GOOD 
    TRAINING_DATA = [
        ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
        ('Obama is a person', {'entities': [(0, 5, 'PERSON')]}),
    ]
```    


- Problem 2: Models can't learn everything
    - Another common problem is that your model just won't learn what you want it to. spaCy's models make predictions based on the local context – for example, for named entities, the surrounding words are most important. If the decision is difficult to make based on the context, the model can struggle to learn it. The label scheme also needs to be consistent and not too specific.
    - spaCy's models make predictions based on local context
    - Model can struggle to learn if decision is difficult to make based on context
    - Label scheme need to be consistent and not too specific: 
        - Ex: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING

- Solution 2: Plan your label scheme carefully
    - Before you start training and updating models, it's worth taking a step back and planning your label scheme. Try to pick categories that are reflected in the local context and make them more generic if possible. You can always add a rule-based system later to go from generic to specific.
    - Pick categories that are reflected in local context
    - More generic is better than too specific
    - Use rules to go from generic labels to specific categories         

```
    BAD 
    LABELS = ['ADULT_SHOES', 'CHILDREN_SHOES', 'BANDS_I_LIKE']

    GOOD 
    LABELS = ['CLOTHING', 'BANDS']
```      