1\. spaCy pipelines
-------------------

00:00 - 00:07

Welcome! We previously learned about spaCy pipelines, let's explore them further.

2\. spaCy pipelines
-------------------

00:07 - 00:22

Recall that when we call nlp on a text, spaCy first tokenizes the text to produce a Doc container. The Doc object is then processed in several different steps, known as the processing pipeline.

• spaCy first tokenizes the text to produce a Doc object

• The Doc is processed in several different steps of processing pipeline

```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)
```

3\. spaCy pipelines
-------------------

00:22 - 01:36

To continue our learnings on spaCy pipelines, in this video, we will explore how to create pipeline components and add them to an existing or blank spaCy pipeline. A pipeline is a sequence of pipes (pipeline components), or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes require the output from earlier components, while in other cases, a pipe can exist entirely on its own. As an example, for a named entity recognition pipeline, three pipes can be used: a Tokenizer pipe, which is the first processing step in spaCy pipelines; a rule-based named entity recognizer known as the EntityRuler, which finds entities; and an EntityLinker pipe that identifies the type of each entity. Through this processing pipeline, an input text is converted to a Doc container with its corresponding annotated entities. We can use the doc-dot-ents feature to find the entities in the input text.

• A pipeline is a sequence of pipes, or actors on data

• A spaCy NER pipeline:
  * Tokenization 
  * Named entity identification
  * Named entity classification

```
[Input text] --> [Tokenizer] --> [EntityRuler] --> [EntityLinker] --> [Doc with annotated entities]
```

```python
print([ent.text for ent in doc.ents])
```

4\. Adding pipes
----------------

01:36 - 02:48

We often use an existing spaCy model. However, in some cases, an off-the-shelf model will not satisfy our requirements. An example of this is the sentence segmentation for a long document with 10,000 sentences. To recall, sentence segmentation is breaking a text into its given sentences. Sentencizer is the name of the spaCy pipeline component that performs sentence segmentation. Given a document that has 10,000 sentences, even if we use the smallest English model, the most efficient spaCy model, en_core_web_sm, the model can take a long time to process 10,000 sentences and separate them. The reason is that when calling an existing spaCy model on a text, the whole NLP pipeline will be activated and that means that each pipe from named entity recognition to dependency parsing will run on the text. This increases the use of computational time by 100 times.

• sentencizer: spaCy pipeline component for sentence segmentation.

```python
text = " ".join(["This is a test sentence."]*10000)
en_core_sm_nlp = spacy.load("en_core_web_sm")
start_time = time.time()
doc = en_core_sm_nlp(text)
print(f"Finished processing with en_core_web_sm model in {round((time.time() - start_time)/60.0 , 5)} minutes")
```

```
>>> Finished processing with en_core_web_sm model in 0.09332 minutes
```

5\. Adding pipes
----------------

02:48 - 03:28

In this instance, we would want to make a blank spaCy English model by using spacy-dot-blank("en") and add the sentencizer component to the pipeline by using -dot-add_pipe method of the nlp model. By creating a blank model and simply adding a sentencizer pipe, we can considerably reduce computational time. The reason is that for this version of the spaCy model, only intended pipeline component (sentence segmentation) will run on the given documents.

• Create a blank model and add a sentencizer pipe:

```python
blank_nlp = spacy.blank("en")
blank_nlp.add_pipe("sentencizer")
start_time = time.time()
doc = blank_nlp(text)
print(f"Finished processing with blank model in {round((time.time() - start_time)/60.0 , 5)} minutes")
```

```
>>> Finished processing with blank model in 0.00091 minutes
```

6\. Analyzing pipeline components
---------------------------------

03:28 - 04:25

spaCy allows us to analyze a spaCy pipeline to check whether any required attributes are not set. The nlp-dot-analyze_pipes method analyzes the components in a pipeline and outputs structured information about them, like the attributes they set on the Doc and Token, whether they retokenize the Doc and which scores they produce during training. It also shows warnings if components require values that are not set by the previous components. For example, when the entity linker is used but no component before EntityLinker sets named entities. While calling nlp-dot-analyze_pipes() method we can also set the pretty argument to True, which will print a nicely organized table as the result of analyzing the pipeline components.

• nlp.analyze_pipes() analyzes a spaCy pipeline to determine:
  * Attributes that pipeline components set
  * Scores a component produces during training
  * Presence of all required attributes

• Setting pretty to True will print a table instead of only returning the structured data.

```python
import spacy

nlp = spacy.load("en_core_web_sm")
analysis = nlp.analyze_pipes(pretty=True)
```

7\. Analyzing pipeline components
---------------------------------

04:25 - 04:47

The snapshot shows the results of the analyze_pipes method. While we don't go into technical details of all the fields, we are familiar with some of the components and attributes provided in this snapshot. In this case, the result of analysis is "No problems found".

```
=============================== Pipeline Overview ===============================

#   Component         Assigns              Requires        Scores             Retokenizes
-   ---------------   -------------------  -------------   ---------------    -----------
0   tok2vec          doc.tensor                                             False

1   tagger           token.tag                            tag_acc           False

2   parser           token.dep                            dep_uas           False
                     token.head                           dep_las
                     token.is_sent_start                  dep_las_per_type
                     doc.sents                            sents_p
                                                         sents_r
                                                         sents_f

3   attribute_ruler                                                         False

4   lemmatizer       token.lemma                         lemma_acc          False

5   ner              doc.ents                            ents_f             False
                     token.ent_iob                       ents_p
                     token.ent_type                      ents_r
                                                        ents_per_type

6   entity_linker    token.ent_kb_id      doc.ents       nel_micro_f        False
                                         doc.sents       nel_micro_r
                                         token.ent_iob   nel_micro_p
                                         token.ent_type

✓ No problems found.
```

8\. Let's practice!
-------------------

04:47 - 04:50

Let's practice our learnings.

Adding pipes in spaCy
=====================

You often use an existing spaCy model for different NLP tasks. However, in some cases, an off-the-shelf pipeline component such as sentence segmentation will take long times to produce expected results. In this exercise, you'll practice adding a pipeline component to a spaCy model (text processing pipeline). 

You will use the first five reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the `texts` string. 

The `spaCy` package is already imported for you to use.

Instructions
------------

-   Load a blank `spaCy` English model and add a `sentencizer` component to the model.
-   Create a `Doc` container for the `texts`, create a list to store `sentences` of the given document and print its number of sentences.
-   Print the list of tokens in the second sentence from the `sentences` list.

In [None]:
# Load a blank spaCy English model and add a sentencizer component
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

# Create Doc containers, store sentences and print its number of sentences
doc = nlp(texts)
sentences = [sent for sent in doc.sents]
print("Number of sentences: ", len(sentences), "\n")

# Print the list of tokens in the second sentence
print("Second sentence tokens: ", [token for token in sentences[1]])

Analyzing pipelines in spaCy
============================

`spaCy` allows you to analyze a spaCy pipeline to check whether any required attributes are not set. In this exercise, you'll practice analyzing a `spaCy` pipeline. Earlier in the video, an existing `en_core_web_sm` pipeline was analyzed and the result was `No problems found.`, in this instance, you will analyze a blank `spaCy` English model with few added components and observe results of the analysis.

The `spaCy` package is already imported for you to use.

Instructions 1/2
----------------

-   Load a blank `spaCy` English model as `nlp`.
-   Add `tagger` and `entity_linker` pipeline components to the blank model.
-   Analyze the `nlp` pipeline.

In [None]:
# Load a blank spaCy English model
nlp = spacy.blank("en")

# Add tagger and entity_linker pipeline components
nlp.add_pipe("tagger")
nlp.add_pipe("entity_linker")

# Analyze the pipeline
analysis = nlp.analyze_pipes(pretty=True)

Instructions 2/2
----------------

Question
--------

The output of `analyze_pipes()` method showed that `entity_linker requirements not met: doc.ents, doc.sents, token.ent_iob, token.ent_type`.

Which NLP components should be added before adding `entity_linker` component to ensure the created `spaCy` pipeline have all the required attributes for entity linking?

### Possible answers

[/] ner, sentencizer

[] ner, lemmatizer

[] lemmatizer, sentencizer

1\. spaCy EntityRuler
---------------------

00:00 - 00:11

Welcome! Let's learn about EntityRuler, a component in spaCy that allows us to include or modify named entities using pattern matching rules.

2\. spaCy EntityRuler
---------------------

00:11 - 01:52

EntityRuler lets us add entities to Doc-dot-ents. It can be combined with EntityRecognizer, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. We can add named-entities to a Doc container using an entity pattern. Entity patterns are dictionaries with two keys. One key is the "label" which is specifying the label to assign to the entity if the pattern is matched, and the second key is the "pattern", which is the matched string. The entity ruler accepts two types of patterns: phrase entity and token entity patterns. A phrase entity pattern is used for exact string matches, for example to exactly match Microsoft as a named entity with a label of ORG, we can use an entity pattern dictionary with a "label" equal to ORG and the "pattern" to be set as "Microsoft". A token entity pattern uses one dictionary to describe one token. For example, to match lower cases san francisco to an entity type of GPE (a location type), we can use an entity pattern dictionary with a "label" equal to GPE and the "pattern" to be set to a list of two key value pairs where the key is set to "LOWER" and the value is set to "san" for one and "francisco" for the other pair.

• EntityRuler adds named-entities to a Doc container

• It can be used on its own or combined with EntityRecognizer

• Phrase entity patterns for exact string matches (string):
```python
{"label": "ORG", "pattern": "Microsoft"}
```

• Token entity patterns with one dictionary describing one token (list):
```python
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
```

3\. Adding EntityRuler to spaCy pipeline
----------------------------------------

01:52 - 02:49

The EntityRuler can be added to a spaCy model using dot-add_pipe() method by passing "entity_ruler" name. When the nlp model is called on a text, it will find matches in the doc container and add them as entities in the doc-dot-ents, using the specified pattern label as the entity label. As an example, we load a blank spaCy model and use -dot-add_pipe("entity_ruler") method to add EntityRuler component. Next, we define a list of patterns. Patterns can be a combination of phrase entity and token entity patterns. These patterns can be added to the EntityRuler component using -dot-add_patterns() method.

• Using .add_pipe() method 

• List of patterns can be added using .add_patterns() method

```python
nlp = spacy.blank("en")
entity_ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Microsoft"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
entity_ruler.add_patterns(patterns)
```

4\. Adding EntityRuler to spaCy pipeline
----------------------------------------

02:49 - 03:15

Next, we run the model on a given text to generate a Doc container. The nlp model uses the EntityRuler component to populate the dot-ents attribute of the Doc container. In this instance, Microsoft and San Francisco are extracted as entities with ORG and GPE entity labels respectively.

• .ents store the results of an EntityLinker component

```python
doc = nlp("Microsoft is hiring software developer in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]
```

5\. EntityRuler in action
-------------------------

03:15 - 03:38

The entity ruler is designed to integrate with spaCy's existing components and enhance the named entity recognizer performance. Let us look at an example of "Manhattan associates is a company in the US". In this case, the model is unable to accurately classify Manhattan associates as an ORG.

• Integrates with spaCy pipeline components

• Enhances the named-entity recognizer

• spaCy model without EntityRuler:

```python
nlp = spacy.load("en_core_web_sm")

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Manhattan', 'GPE'), ('U.S.', 'GPE')]
```

6\. EntityRuler in action
-------------------------

03:38 - 04:21

We can add an EntityRuler component to the current nlp pipeline. If we add the ruler after an existing ner component by setting the "after" argument of the -dot-add_pipe() method to "ner", the entity ruler will only add entities to the doc-dot-ents if they don't overlap with existing entities predicted by the model. In this case, the model tags Manhattan with an incorrect GPE type, because the ruler component is called after existing ner (EntityRecognizer) component of the model.

• EntityRuler added after existing ner component:

```python
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", after='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Manhattan', 'GPE'), ('U.S.', 'GPE')]
```

7\. EntityRuler in action
-------------------------

04:21 - 04:50

However, if we add an EntityRuler before the ner component by setting the "before" argument of -dot-add_pipe() method to "ner", to recognize Manhattan associate as an ORG, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the EntityRuler. This can improve model accuracy in our case.

• EntityRuler added before existing ner component:

```python
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Manhattan associates', 'ORG'), ('U.S.', 'GPE')]
```

8\. Let's practice!
-------------------

04:50 - 04:54

Let's practice!

EntityRuler with blank spaCy model
==================================

`EntityRuler` lets you to add entities to `doc.ents`. It can be combined with `EntityRecognizer`, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. In this exercise, you will practice adding an `EntityRuler` component to a blank `spaCy` English model and classify named entities of the given `text` using purely rule-based named-entity recognition.

The `spaCy` package is already imported and a blank `spaCy` English model is ready for your use as `nlp`. A list of `patterns` to classify lower cased `OpenAI` and `Microsoft` as `ORG`is already created for your use.

Instructions
------------

-   Create and add an `EntityRuler`component to the pipeline.
-   Add given patterns to the `EntityRuler`component. 
-   Run the model on the given `text` and create its corresponding `Doc` container. 
-   Print a tuple of (entities text and types) for all entities in the `Doc` container

In [None]:
nlp = spacy.blank("en")
patterns = [{"label": "ORG", "pattern": [{"LOWER": "openai"}]},
            {"label": "ORG", "pattern": [{"LOWER": "microsoft"}]}]
text = "OpenAI has joined forces with Microsoft."

# Add EntityRuler component to the model
entity_ruler = nlp.add_pipe("entity_ruler")

# Add given patterns to the EntityRuler component
entity_ruler.add_patterns(patterns)

# Run the model on a given text
doc = nlp(text)

# Print entities text and type for all entities in the Doc container
print([(ent.text, ent.label_) for ent in doc.ents])

EntityRuler for NER
===================

`EntityRuler` can be combined with `EntityRecognizer` of an existing model to boost its accuracy. In this exercise, you will practice combining an `EntityRuler`component and an existing `NER` component of the `en_core_web_sm` model. The model is already loaded as `nlp`. 

When `EntityRuler` is added before `NER`component, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the `EntityRuler` to improve accuracy of named entity recognition task.

Instructions
------------

-   Add an `EntityRuler` to the `nlp` before `ner` component.
-   Define a token entity pattern to classify lower cased `new york group` as `ORG`.
-   Add the `patterns` to the `EntityRuler`component.
-   Run the model and print the tuple of entities text and type for the `Doc` container.

In [None]:
nlp = spacy.load("en_core_web_sm")
text = "New York Group was built in 1987."

# Add an EntityRuler to the nlp before NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define a pattern to classify lower cased new york group as ORG
patterns = [{"label": "ORG", "pattern": [{"lower": "new york group"}]}]

# Add the patterns to the EntityRuler component
ruler.add_patterns(patterns)

# Run the model and print entities text and type for all the entities
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

EntityRuler with multi-patterns in spaCy
========================================

`EntityRuler` lets you to add entities to `doc.ents` and boost its named entity recognition performance. In this exercise, you will practice adding an `EntityRuler`component to an existing `nlp` pipeline to ensure multiple entities are correctly being classified.

The `en_core_web_sm` model is already loaded and is available for your use as `nlp`. You can access an example text in `example_text` and use `nlp` and `doc` to access an `spaCy` model and `Doc` container of `example_text`respectively.

Instructions
------------

-   Print a list of tuples of entities text and types in the `example_text` with the `nlp` model.
-   Define multiple patterns to match lower cased `brother` and `sisters` to `PERSON`label.
-   Add an `EntityRuler` component to the `nlp` pipeline and add the `patterns` to the `EntityRuler`.
-   Print a tuple of text and type of entities for the `example_text` with the `nlp` model.

In [None]:
nlp = spacy.load("en_core_web_md")

# Print a list of tuples of entities text and types in the example_text
print("Before EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents], "\n")

# Define pattern to add a label PERSON for lower cased sisters and brother entities
patterns = [{"label": "PERSON", "pattern": [{"lower": "brother"}]},
            {"label": "PERSON", "pattern": [{"lower": "sisters"}]}]

# Add an EntityRuler component and add the patterns to the ruler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Print a list of tuples of entities text and types
print("After EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents])