1\. spaCy pipelines
-------------------

00:00 - 00:07

Welcome! We previously learned about spaCy pipelines, let's explore them further.

2\. spaCy pipelines
-------------------

00:07 - 00:22

Recall that when we call nlp on a text, spaCy first tokenizes the text to produce a Doc container. The Doc object is then processed in several different steps, known as the processing pipeline.

• spaCy first tokenizes the text to produce a Doc object

• The Doc is processed in several different steps of processing pipeline

```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)
```

3\. spaCy pipelines
-------------------

00:22 - 01:36

To continue our learnings on spaCy pipelines, in this video, we will explore how to create pipeline components and add them to an existing or blank spaCy pipeline. A pipeline is a sequence of pipes (pipeline components), or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes require the output from earlier components, while in other cases, a pipe can exist entirely on its own. As an example, for a named entity recognition pipeline, three pipes can be used: a Tokenizer pipe, which is the first processing step in spaCy pipelines; a rule-based named entity recognizer known as the EntityRuler, which finds entities; and an EntityLinker pipe that identifies the type of each entity. Through this processing pipeline, an input text is converted to a Doc container with its corresponding annotated entities. We can use the doc-dot-ents feature to find the entities in the input text.

• A pipeline is a sequence of pipes, or actors on data

• A spaCy NER pipeline:
  * Tokenization 
  * Named entity identification
  * Named entity classification

```
[Input text] --> [Tokenizer] --> [EntityRuler] --> [EntityLinker] --> [Doc with annotated entities]
```

```python
print([ent.text for ent in doc.ents])
```

4\. Adding pipes
----------------

01:36 - 02:48

We often use an existing spaCy model. However, in some cases, an off-the-shelf model will not satisfy our requirements. An example of this is the sentence segmentation for a long document with 10,000 sentences. To recall, sentence segmentation is breaking a text into its given sentences. Sentencizer is the name of the spaCy pipeline component that performs sentence segmentation. Given a document that has 10,000 sentences, even if we use the smallest English model, the most efficient spaCy model, en_core_web_sm, the model can take a long time to process 10,000 sentences and separate them. The reason is that when calling an existing spaCy model on a text, the whole NLP pipeline will be activated and that means that each pipe from named entity recognition to dependency parsing will run on the text. This increases the use of computational time by 100 times.

• sentencizer: spaCy pipeline component for sentence segmentation.

```python
text = " ".join(["This is a test sentence."]*10000)
en_core_sm_nlp = spacy.load("en_core_web_sm")
start_time = time.time()
doc = en_core_sm_nlp(text)
print(f"Finished processing with en_core_web_sm model in {round((time.time() - start_time)/60.0 , 5)} minutes")
```

```
>>> Finished processing with en_core_web_sm model in 0.09332 minutes
```

5\. Adding pipes
----------------

02:48 - 03:28

In this instance, we would want to make a blank spaCy English model by using spacy-dot-blank("en") and add the sentencizer component to the pipeline by using -dot-add_pipe method of the nlp model. By creating a blank model and simply adding a sentencizer pipe, we can considerably reduce computational time. The reason is that for this version of the spaCy model, only intended pipeline component (sentence segmentation) will run on the given documents.

• Create a blank model and add a sentencizer pipe:

```python
blank_nlp = spacy.blank("en")
blank_nlp.add_pipe("sentencizer")
start_time = time.time()
doc = blank_nlp(text)
print(f"Finished processing with blank model in {round((time.time() - start_time)/60.0 , 5)} minutes")
```

```
>>> Finished processing with blank model in 0.00091 minutes
```

6\. Analyzing pipeline components
---------------------------------

03:28 - 04:25

spaCy allows us to analyze a spaCy pipeline to check whether any required attributes are not set. The nlp-dot-analyze_pipes method analyzes the components in a pipeline and outputs structured information about them, like the attributes they set on the Doc and Token, whether they retokenize the Doc and which scores they produce during training. It also shows warnings if components require values that are not set by the previous components. For example, when the entity linker is used but no component before EntityLinker sets named entities. While calling nlp-dot-analyze_pipes() method we can also set the pretty argument to True, which will print a nicely organized table as the result of analyzing the pipeline components.

• nlp.analyze_pipes() analyzes a spaCy pipeline to determine:
  * Attributes that pipeline components set
  * Scores a component produces during training
  * Presence of all required attributes

• Setting pretty to True will print a table instead of only returning the structured data.

```python
import spacy

nlp = spacy.load("en_core_web_sm")
analysis = nlp.analyze_pipes(pretty=True)
```

7\. Analyzing pipeline components
---------------------------------

04:25 - 04:47

The snapshot shows the results of the analyze_pipes method. While we don't go into technical details of all the fields, we are familiar with some of the components and attributes provided in this snapshot. In this case, the result of analysis is "No problems found".

```
=============================== Pipeline Overview ===============================

#   Component         Assigns              Requires        Scores             Retokenizes
-   ---------------   -------------------  -------------   ---------------    -----------
0   tok2vec          doc.tensor                                             False

1   tagger           token.tag                            tag_acc           False

2   parser           token.dep                            dep_uas           False
                     token.head                           dep_las
                     token.is_sent_start                  dep_las_per_type
                     doc.sents                            sents_p
                                                         sents_r
                                                         sents_f

3   attribute_ruler                                                         False

4   lemmatizer       token.lemma                         lemma_acc          False

5   ner              doc.ents                            ents_f             False
                     token.ent_iob                       ents_p
                     token.ent_type                      ents_r
                                                        ents_per_type

6   entity_linker    token.ent_kb_id      doc.ents       nel_micro_f        False
                                         doc.sents       nel_micro_r
                                         token.ent_iob   nel_micro_p
                                         token.ent_type

✓ No problems found.
```

8\. Let's practice!
-------------------

04:47 - 04:50

Let's practice our learnings.

Adding pipes in spaCy
=====================

You often use an existing spaCy model for different NLP tasks. However, in some cases, an off-the-shelf pipeline component such as sentence segmentation will take long times to produce expected results. In this exercise, you'll practice adding a pipeline component to a spaCy model (text processing pipeline). 

You will use the first five reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the `texts` string. 

The `spaCy` package is already imported for you to use.

Instructions
------------

-   Load a blank `spaCy` English model and add a `sentencizer` component to the model.
-   Create a `Doc` container for the `texts`, create a list to store `sentences` of the given document and print its number of sentences.
-   Print the list of tokens in the second sentence from the `sentences` list.

In [None]:
# Load a blank spaCy English model and add a sentencizer component
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

# Create Doc containers, store sentences and print its number of sentences
doc = nlp(texts)
sentences = [sent for sent in doc.sents]
print("Number of sentences: ", len(sentences), "\n")

# Print the list of tokens in the second sentence
print("Second sentence tokens: ", [token for token in sentences[1]])

Analyzing pipelines in spaCy
============================

`spaCy` allows you to analyze a spaCy pipeline to check whether any required attributes are not set. In this exercise, you'll practice analyzing a `spaCy` pipeline. Earlier in the video, an existing `en_core_web_sm` pipeline was analyzed and the result was `No problems found.`, in this instance, you will analyze a blank `spaCy` English model with few added components and observe results of the analysis.

The `spaCy` package is already imported for you to use.

Instructions 1/2
----------------

-   Load a blank `spaCy` English model as `nlp`.
-   Add `tagger` and `entity_linker` pipeline components to the blank model.
-   Analyze the `nlp` pipeline.

In [None]:
# Load a blank spaCy English model
nlp = spacy.blank("en")

# Add tagger and entity_linker pipeline components
nlp.add_pipe("tagger")
nlp.add_pipe("entity_linker")

# Analyze the pipeline
analysis = nlp.analyze_pipes(pretty=True)

Instructions 2/2
----------------

Question
--------

The output of `analyze_pipes()` method showed that `entity_linker requirements not met: doc.ents, doc.sents, token.ent_iob, token.ent_type`.

Which NLP components should be added before adding `entity_linker` component to ensure the created `spaCy` pipeline have all the required attributes for entity linking?

### Possible answers

[/] ner, sentencizer

[] ner, lemmatizer

[] lemmatizer, sentencizer

1\. spaCy EntityRuler
---------------------

00:00 - 00:11

Welcome! Let's learn about EntityRuler, a component in spaCy that allows us to include or modify named entities using pattern matching rules.

2\. spaCy EntityRuler
---------------------

00:11 - 01:52

EntityRuler lets us add entities to Doc-dot-ents. It can be combined with EntityRecognizer, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. We can add named-entities to a Doc container using an entity pattern. Entity patterns are dictionaries with two keys. One key is the "label" which is specifying the label to assign to the entity if the pattern is matched, and the second key is the "pattern", which is the matched string. The entity ruler accepts two types of patterns: phrase entity and token entity patterns. A phrase entity pattern is used for exact string matches, for example to exactly match Microsoft as a named entity with a label of ORG, we can use an entity pattern dictionary with a "label" equal to ORG and the "pattern" to be set as "Microsoft". A token entity pattern uses one dictionary to describe one token. For example, to match lower cases san francisco to an entity type of GPE (a location type), we can use an entity pattern dictionary with a "label" equal to GPE and the "pattern" to be set to a list of two key value pairs where the key is set to "LOWER" and the value is set to "san" for one and "francisco" for the other pair.

• EntityRuler adds named-entities to a Doc container

• It can be used on its own or combined with EntityRecognizer

• Phrase entity patterns for exact string matches (string):
```python
{"label": "ORG", "pattern": "Microsoft"}
```

• Token entity patterns with one dictionary describing one token (list):
```python
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
```

3\. Adding EntityRuler to spaCy pipeline
----------------------------------------

01:52 - 02:49

The EntityRuler can be added to a spaCy model using dot-add_pipe() method by passing "entity_ruler" name. When the nlp model is called on a text, it will find matches in the doc container and add them as entities in the doc-dot-ents, using the specified pattern label as the entity label. As an example, we load a blank spaCy model and use -dot-add_pipe("entity_ruler") method to add EntityRuler component. Next, we define a list of patterns. Patterns can be a combination of phrase entity and token entity patterns. These patterns can be added to the EntityRuler component using -dot-add_patterns() method.

• Using .add_pipe() method 

• List of patterns can be added using .add_patterns() method

```python
nlp = spacy.blank("en")
entity_ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Microsoft"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
entity_ruler.add_patterns(patterns)
```

4\. Adding EntityRuler to spaCy pipeline
----------------------------------------

02:49 - 03:15

Next, we run the model on a given text to generate a Doc container. The nlp model uses the EntityRuler component to populate the dot-ents attribute of the Doc container. In this instance, Microsoft and San Francisco are extracted as entities with ORG and GPE entity labels respectively.

• .ents store the results of an EntityLinker component

```python
doc = nlp("Microsoft is hiring software developer in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]
```

5\. EntityRuler in action
-------------------------

03:15 - 03:38

The entity ruler is designed to integrate with spaCy's existing components and enhance the named entity recognizer performance. Let us look at an example of "Manhattan associates is a company in the US". In this case, the model is unable to accurately classify Manhattan associates as an ORG.

• Integrates with spaCy pipeline components

• Enhances the named-entity recognizer

• spaCy model without EntityRuler:

```python
nlp = spacy.load("en_core_web_sm")

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Manhattan', 'GPE'), ('U.S.', 'GPE')]
```

6\. EntityRuler in action
-------------------------

03:38 - 04:21

We can add an EntityRuler component to the current nlp pipeline. If we add the ruler after an existing ner component by setting the "after" argument of the -dot-add_pipe() method to "ner", the entity ruler will only add entities to the doc-dot-ents if they don't overlap with existing entities predicted by the model. In this case, the model tags Manhattan with an incorrect GPE type, because the ruler component is called after existing ner (EntityRecognizer) component of the model.

• EntityRuler added after existing ner component:

```python
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", after='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Manhattan', 'GPE'), ('U.S.', 'GPE')]
```

7\. EntityRuler in action
-------------------------

04:21 - 04:50

However, if we add an EntityRuler before the ner component by setting the "before" argument of -dot-add_pipe() method to "ner", to recognize Manhattan associate as an ORG, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the EntityRuler. This can improve model accuracy in our case.

• EntityRuler added before existing ner component:

```python
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('Manhattan associates', 'ORG'), ('U.S.', 'GPE')]
```

8\. Let's practice!
-------------------

04:50 - 04:54

Let's practice!

EntityRuler with blank spaCy model
==================================

`EntityRuler` lets you to add entities to `doc.ents`. It can be combined with `EntityRecognizer`, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. In this exercise, you will practice adding an `EntityRuler` component to a blank `spaCy` English model and classify named entities of the given `text` using purely rule-based named-entity recognition.

The `spaCy` package is already imported and a blank `spaCy` English model is ready for your use as `nlp`. A list of `patterns` to classify lower cased `OpenAI` and `Microsoft` as `ORG`is already created for your use.

Instructions
------------

-   Create and add an `EntityRuler`component to the pipeline.
-   Add given patterns to the `EntityRuler`component. 
-   Run the model on the given `text` and create its corresponding `Doc` container. 
-   Print a tuple of (entities text and types) for all entities in the `Doc` container

In [None]:
nlp = spacy.blank("en")
patterns = [{"label": "ORG", "pattern": [{"LOWER": "openai"}]},
            {"label": "ORG", "pattern": [{"LOWER": "microsoft"}]}]
text = "OpenAI has joined forces with Microsoft."

# Add EntityRuler component to the model
entity_ruler = nlp.add_pipe("entity_ruler")

# Add given patterns to the EntityRuler component
entity_ruler.add_patterns(patterns)

# Run the model on a given text
doc = nlp(text)

# Print entities text and type for all entities in the Doc container
print([(ent.text, ent.label_) for ent in doc.ents])

EntityRuler for NER
===================

`EntityRuler` can be combined with `EntityRecognizer` of an existing model to boost its accuracy. In this exercise, you will practice combining an `EntityRuler`component and an existing `NER` component of the `en_core_web_sm` model. The model is already loaded as `nlp`. 

When `EntityRuler` is added before `NER`component, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the `EntityRuler` to improve accuracy of named entity recognition task.

Instructions
------------

-   Add an `EntityRuler` to the `nlp` before `ner` component.
-   Define a token entity pattern to classify lower cased `new york group` as `ORG`.
-   Add the `patterns` to the `EntityRuler`component.
-   Run the model and print the tuple of entities text and type for the `Doc` container.

In [None]:
nlp = spacy.load("en_core_web_sm")
text = "New York Group was built in 1987."

# Add an EntityRuler to the nlp before NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define a pattern to classify lower cased new york group as ORG
patterns = [{"label": "ORG", "pattern": [{"lower": "new york group"}]}]

# Add the patterns to the EntityRuler component
ruler.add_patterns(patterns)

# Run the model and print entities text and type for all the entities
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

EntityRuler with multi-patterns in spaCy
========================================

`EntityRuler` lets you to add entities to `doc.ents` and boost its named entity recognition performance. In this exercise, you will practice adding an `EntityRuler`component to an existing `nlp` pipeline to ensure multiple entities are correctly being classified.

The `en_core_web_sm` model is already loaded and is available for your use as `nlp`. You can access an example text in `example_text` and use `nlp` and `doc` to access an `spaCy` model and `Doc` container of `example_text`respectively.

Instructions
------------

-   Print a list of tuples of entities text and types in the `example_text` with the `nlp` model.
-   Define multiple patterns to match lower cased `brother` and `sisters` to `PERSON`label.
-   Add an `EntityRuler` component to the `nlp` pipeline and add the `patterns` to the `EntityRuler`.
-   Print a tuple of text and type of entities for the `example_text` with the `nlp` model.

In [None]:
nlp = spacy.load("en_core_web_md")

# Print a list of tuples of entities text and types in the example_text
print("Before EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents], "\n")

# Define pattern to add a label PERSON for lower cased sisters and brother entities
patterns = [{"label": "PERSON", "pattern": [{"lower": "brother"}]},
            {"label": "PERSON", "pattern": [{"lower": "sisters"}]}]

# Add an EntityRuler component and add the patterns to the ruler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Print a list of tuples of entities text and types
print("After EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents])

1\. RegEx with spaCy
--------------------

00:00 - 00:05

Welcome, let us learn about rule-based information extraction.

2\. What is RegEx?
------------------

00:05 - 00:45

Rule-based information extraction is useful for many NLP tasks. Certain types of entities, such as dates or phone numbers have distinct formats that can be recognized by a set of rules without needing to train any model. Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. For example, given a text, we can use regular expressions to find any reference to links or phone numbers.

• Rule-based information extraction (IR) is useful for many NLP tasks

• Regular expression (RegEx) is used with complex string matching patterns

• RegEx finds and retrieves patterns or replace matching patterns

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam quis purus a odio dapibus volutpat. Donec sed enim consequat, dapibus nisl at, fermentum tellus. Suspendisse id hendrerit felis. Sed sit amet hendrerit metus. https://www.att.com. Aliquam erat volutpat. In lobortis fermentum nulla non ullamcorper.

www.tellus.com. Donec elementum nibh ut tellus hendrerit consectetur. 555-555-5555 Aliquam eget imperdiet diam. Phasellus molestie rhoncus massa nec bibendum.


3\. RegEx strengths and weaknesses
----------------------------------

00:45 - 01:30

Nearly all data scientists and engineers use RegEx at some stage in their workflow from cleaning data to implementing machine learning models. There are several advantages in using RegEx. Due to its complex syntax, it allows programmers to write robust rules. It allows finding all types of variances in strings, performs quickly and it is supported by different programming languages. Despite these advantages, RegEx has a few weaknesses. Its syntax is quite difficult for beginners. Writing good RegEx patterns requires a knowledge of all the ways a pattern may vary in texts.

Pros:
• Enables writing robust rules to retrieve information
• Can allow us to find many types of variance in strings  
• Runs fast
• Supported by programming languages

Cons:
• Syntax is challenging for beginners
• Requires knowledge of all the ways a pattern may be mentioned in texts

4\. RegEx in Python
-------------------

01:30 - 02:26

Python comes prepackaged with a RegEx library, called re. Let's assume we want to find the phone numbers in a text. The first step is to define a pattern. Assuming a phone number is always written as something like 3 digits-3 digits-4 digits, a pattern to find such phone numbers is shown. In this pattern, backslash-d is representative of a metacharacter that matches any digit from 0 to 9. A number within curly brackets shows how many occurrences of the pattern are expected. Hence parenthesis backslash-d curly brackets 3 is looking for three digits. We also use dash in between digits to match the shape of the phone number.

• Python comes prepackaged with a RegEx library, re.

• The first step in using re package is to define a pattern.

• The resulting pattern is used to find matching content.

```python
import re

pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
```

5\. RegEx in Python
-------------------

02:26 - 03:10

To find any matching patterns in a given text, we can use re-dot-finditer() method from the re package. We can iterate through found matches. Every match contains of start and end characters of the matching section of the text, they are accessible using match-dot-start() and match-dot-end() methods. We can see that two phone numbers that are matching the given pattern, 832-123-5555 and 425-123-4567, are found with their corresponding start and end characters from the input text.

• We use .finditer() method from re package

```python
iter_matches = re.finditer(pattern, text)
for match in iter_matches:
    start_char = match.start()
    end_char = match.end()
    print("Start character: ", start_char, "| End character: ", end_char,
          "| Matching text: ", text[start_char:end_char])
```

```
Start character:  20 | End character:  32 | Matching text:  832-123-5555
Start character:  59 | End character:  71 | Matching text:  425-123-4567
```

6\. RegEx in spaCy
------------------

03:10 - 04:48

spaCy has quick ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. Matcher and PhraseMatcher do not align the matched patterns as entities in the doc-dot-ents. For this reason, we utilize EntityRuler to implement regular expressions. We have already learned to use EntityRuler to improve entity recognition accuracy in spaCy. We will learn more about Marcher and PhraseMatcher later on. Let's look at an example of using EntityRuler to find phone numbers. The pattern consists of a list of dictionaries with two keys of label and pattern. In this instance, the label is set to PHONE_NUMBER. To match a pattern such as 3 digits-3 digits-4 digits, we use a pattern that consists of 5 smaller dictionaries, where each dictionary is representing a part of the matching pattern. The first, third and fifth dictionaries with the key of SHAPE, are representing patterns with a shape of three or four digits by using three or four d's. The second and fourth dictionaries with a key of ORTH, are representing the exact match of a string, which is set to a dash in this pattern. Writing patterns in spaCy requires practice, spaCy documentation provides more information about different pattern attributes.

• RegEx in three pipeline components: Matcher, PhraseMatcher and EntityRuler.

```python
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
nlp = spacy.blank("en")
patterns = [{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
           {"ORTH": "-"}, {"SHAPE": "ddd"},
           {"ORTH": "-"}, {"SHAPE": "dddd"}]}]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
```

```
[('832-123-5555', 'PHONE_NUMBER'), ('425-123-4567', 'PHONE_NUMBER')]
```

7\. Let's practice!
-------------------

04:48 - 04:51

Let's practice!

RegEx in Python
===============

Rule-based information extraction is useful for many NLP tasks. Certain types of entities, such as dates or phone numbers have distinct formats that can be recognized by a set of rules without needing to train any model. In this exercise, you will practice using `re`package for RegEx. The goal is to find phone numbers in a given `text`.

`re` package is already imported for your use. You can use `\d` to match string patterns representative of a metacharacter that matches any digit from 0 to 9.

Instructions
------------

-   Define a pattern to match phone numbers of the form (111)-111-1111.
-   Find all the matching patterns using `re.finditer()` method.
-   For each match, print start and end characters and matching section of the given `text`.

In [None]:
text = "Our phone number is (425)-123-4567."

# Define a pattern to match phone numbers
pattern = r"\((\d){3}\)-(\d){3}-(\d){4}"

# Find all the matching patterns in the text
phones = re.finditer(pattern, text)

# Print start and end characters and matching section of the text
for match in phones:
    start_char = match.start()
    end_char = match.end()
    print("Start character: ", start_char, "| End character: ", end_char, "| Matching text: ", text[start_char:end_char])

RegEx with EntityRuler in spaCy
===============================

Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. In this exercise, you will practice using `EntityRuler`in `spaCy` to find email addresses in a given `text`.

`spaCy` package is already imported for your use. You can use `\d` to match string patterns representative of a metacharacter that matches any digit from 0 to 9.

A `spaCy` pattern can use `REGEX` as an attribute. In this case, a pattern will be of shape `[{"TEXT": {"REGEX": "<a given pattern>"}}]`.

Instructions
------------

-   Define a pattern to match phone numbers of the form `8888888888` to be used by the `EntityRuler`.
-   Load a blank `spaCy` English model and add an `EntityRuler` component to the pipeline.
-   Add the compiled pattern to the `EntityRuler` component.
-   Run the model and print the tuple of text and type of entities for the given `text`.

In [None]:
text = "Our phone number is 4251234567."

# Define a pattern to match phone numbers
patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": "(\d){10}"}}]}]

# Load a blank model and add an EntityRuler
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

# Add the compiled patterns to the EntityRuler
ruler.add_patterns(patterns)

# Print the tuple of entities texts and types for the given text
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

1\. spaCy Matcher and PhraseMatcher
-----------------------------------

00:00 - 00:07

Welcome! Let us learn more about rule-based information extraction using spaCy.

2\. Matcher in spaCy
--------------------

00:07 - 00:51

RegEx patterns are not trivial to read and debug. For these reasons, spaCy provides a readable, production-level, and maintainable alternative, the Matcher class. The Matcher class can match predefined rules to a sequence of tokens in Doc containers. Let's look at an example. We first import spaCy and the Matcher class. We then load the en_core_web_sm model and run the model on the given text to generate a Doc container. Next, a Matcher object is initialized with the given model's vocabulary by using Matcher(nlp-dot-vocab).

• RegEx patterns can be complex, difficult to read and debug.

• spaCy provides a readable and production-level alternative, the Matcher class.

```python
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, this is our first day on campus.")
matcher = Matcher(nlp.vocab)
```

3\. Matcher in spaCy
--------------------

00:51 - 01:33

Next, we define a pattern to match lower cased good and morning by defining a list with two key value pairs. The first one, has a key of "LOWER" and value of "good" and the second one, has a key of "LOWER" and value of "morning". Then we add this pattern with a custom name, such as morning_greeting, to a list of patterns in the Matcher object and run the matcher on the Doc container. The output of a Matcher object is matched patterns which include tuples of a match id, start and end token indices of the matched pattern.

• Matching output include start and end token indices of the matched pattern.

```python
pattern = [{"LOWER": "good"}, {"LOWER": "morning"}]
matcher.add("morning_greeting", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end,
          "| Matched text: ", doc[start:end].text)
```

```
Start token: 0 | End token: 2 | Matched text: Good morning
```

4\. Matcher extended syntax support
-----------------------------------

01:33 - 01:57

The Matcher class allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in and comparison operators. The table shows a list of supported operators in the Matcher class.

• Allows operators in defining the matching patterns.

• Similar operators to Python's in, not in and comparison operators

| Attribute | Value type | Description |
|-----------|------------|-------------|
| IN | any type | Attribute value is a member of a list |
| NOT_IN | any type | Attribute value is not a member of a list |
| ==, >=, <=, >, < | int, float | Comparison operators for equality or inequality checks |

5\. Matcher extended syntax support
-----------------------------------

01:57 - 02:33

For instance, if we want to match both lowercase good morning and good evening patterns in a text, we can use a single matching pattern and the IN operator. In this case, the pattern will be a list of two key value pairs. The first one is {"LOWER": "good"} and the second one is {"LOWER": {"IN": ["morning", "evening"]}}.

• Using IN operator to match both good morning and good evening

```python
doc = nlp("Good morning and good evening.")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])
matches = matcher(doc)
```

• The output of matching using IN operator

```python
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end,
          "| Matched text: ", doc[start:end].text)
```

```
Start token: 0 | End token: 2 | Matched text: Good morning
Start token: 3 | End token: 5 | Matched text: good evening
```

6\. PhraseMatcher in spaCy
--------------------------

02:33 - 03:26

While processing unstructured text, we often have long lists and dictionaries that we want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If we have a long list of phrases, Matcher is no longer the best option. In this instance, PhraseMatcher class helps us match long dictionaries. As an example for PhraseMatcher, let's assume that we want to match two terms in a given text, Bill Gates and John Smith. First, we import spaCy and PhraseMatcher class. Then, we load the en_core_web_sm model and initialize the PhraseMatcher object using PhraseMatcher(nlp-dot-vocab).

- PhraseMatcher class matches a long list of phrases in a given text.

```python
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Bill Gates", "John Smith"]
```

7\. PhraseMatcher in spaCy
--------------------------

03:26 - 04:00

Next, we create patterns for the PhraseMatcher object, by calling the nlp-dot-make_doc() method on each term. This method converts given terms into pattern entities, that are usable by the PhraseMatcher class. Then, we follow similar steps as the Matcher class, and run the PhraseMatcher object on the given Doc container of a text and iterate through matches to extract start and end token IDs of the matched patterns.

- PhraseMatcher outputs include start and end token indices of the matched pattern

```python
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("PeopleOfInterest", patterns)
doc = nlp("Bill Gates met John Smith for an important discussion regarding importance of AI.")
matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end,
          "| Matched text: ", doc[start:end].text)
```

```
Start token: 0 | End token: 2 | Matched text: Bill Gates
Start token: 3 | End token: 5 | Matched text: John Smith
```

8\. PhraseMatcher in spaCy
--------------------------

04:00 - 04:54

The previous example shows how we can match patterns by their exact values. If we want to match lower cased patterns or utilize shape of a pattern for matching, we can use the attr (attribute) argument in the PhraseMatcher class. In one example, we set the attr argument to LOWER and allow PhraseMatcher to find lower cased matching patterns. In the second example, by setting the attr argument to SHAPE, we are asking PhraseMatcher to match patterns to a given shape. In this instance, we are looking to retrieve IP addresses in a text and provide multiple examples of them, such as 110-dot-0-dot-0-dot-0 to the PhraseMatcher class.

- We can use `attr` argument of the `PhraseMatcher` class

```python
matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")
terms = ["Government", "Investment"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)
doc = nlp("It was interesting to the investment division of the government.")

matcher = PhraseMatcher(nlp.vocab, attr = "SHAPE")
terms = ["110.0.0.0", "101.243.0.0"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)
doc = nlp("The tracked IP address was 234.135.0.0.")
```

9\. Let's practice!
-------------------

04:54 - 04:57

Let's practice!

Matching a single term in spaCy
===============================

RegEx patterns are not trivial to read, write and debug. But you are not at a loss, spaCy provides a readable and production-level alternative, the Matcher class. The Matcher class can match predefined rules to a sequence of tokens in a given Doc container. In this exercise, you will practice using `Matcher`to find a single word.

You can access the corresponding text in `example_text` and use `nlp` and `doc` to access an `spaCy` model and `Doc` container of `example_text` respectively.

Instructions
------------

-   Initialize a `Matcher` class.
-   Define a pattern to match lower cased `witch` in the `example_text`.
-   Add the patterns to the `Matcher` class and find matches.
-   Iterate through matches and print start and end token indices and span of the matched text.

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Initialize a Matcher object
matcher = Matcher(nlp.vocab)

# Define a pattern to match lower cased word witch
pattern = [{"lower": "witch"}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print start and end token indices and span of the matched text
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

PhraseMatcher in spaCy
======================

While processing unstructured text, you often have long lists and dictionaries that you want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If you have a long list of phrases, `Matcher` is no longer the best option. In this instance, `PhraseMatcher`class helps us match long dictionaries. In this exercise, you will practice to retrieve patterns with matching shapes to multiple terms using `PhraseMatcher` class.

`en_core_web_sm` model is already loaded and ready for you to use as `nlp`. `PhraseMatcher`class is imported. A `text` string and a list of `terms` are available for your use.

Instructions
------------

-   Initialize a `PhraseMatcher` class with an `attr` to match to shape of given `terms`.
-   Create `patterns` to add to the `PhraseMatcher` object.
-   Find matches to the given patterns and print start and end token indices and matching section of the given `text`.

In [None]:
text = "There are only a few acceptable IP addresse: (1) 127.100.0.1, (2) 123.4.1.0."
terms = ["110.0.0.0", "101.243.0.0"]

# Initialize a PhraseMatcher class to match to shapes of given terms
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")

# Create patterns to add to the PhraseMatcher object
patterns = [nlp(term) for term in terms]
matcher.add("IPAddresses", patterns)

# Find matches to the given patterns and print start and end characters and matches texts
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

Matching with extended syntax in spaCy
======================================

Rule-based information extraction is essential for any NLP pipeline. The Matcher class allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in and comparison operators. In this exercise, you will practice with `spaCy`'s matching functionality, `Matcher`, to find matches for given terms from an example text.

`Matcher` class is already imported from `spacy.matcher` library. You will use a `Doc`container of an example text in this exercise by calling `doc`. A pre-loaded `spaCy` model is also accessible at `nlp`.

Instructions
------------

-   Define a matcher object using `Matcher` and `nlp`.
-   Use the `IN` operator to define a pattern to match `tiny squares` and `tiny mouthful`.
-   Use this pattern to find matches for `doc`.
-   Print start and end token indices and text span of the matches.

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Define a matcher object
matcher = Matcher(nlp.vocab)

# Define a pattern to match tiny squares and tiny mouthful
pattern = [{"lower": "tiny"}, {"lower": {"IN": ["squares", "mouthful"]}}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print out start and end token indices and the matched text span per match
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)