In [1]:
import spacy
#!python3 -m spacy download en_core_web_sm

import numpy as np

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

nlp = spacy.load("en_core_web_md")

## Adding pipes in spaCy

### You often use an existing spaCy model for different NLP tasks. However, in some cases, an off-the-shelf pipeline component such as sentence segmentation will take long times to produce expected results. In this exercise, you'll practice adding a pipeline component to a spaCy model (text processing pipeline).

### You will use the first five reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the texts string.

### The spaCy package is already imported for you to use.

### Instructions
-    Load a blank spaCy English model and add a sentencizer component to the model.
-    Create a Doc container for the texts, create a list to store sentences of the given document and print its number of sentences.
-    Print the list of tokens in the second sentence from the sentences list.

In [2]:
texts = 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most. Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch. If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal. Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.'

In [3]:
# Load a blank spaCy English model and add a sentencizer component
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

# Create Doc containers, store sentences and print its number of sentences
doc = nlp(texts)
sentences = [s for s in doc.sents]
print("Number of sentences: ", len(sentences), "\n")

# Print the list of tokens in the second sentence
print("Second sentence tokens: ", [token for token in sentences[1]])

Number of sentences:  19 

Second sentence tokens:  [The, product, looks, more, like, a, stew, than, a, processed, meat, and, it, smells, better, .]


### Analyzing pipelines in spaCy

### spaCy allows you to analyze a spaCy pipeline to check whether any required attributes are not set. In this exercise, you'll practice analyzing a spaCy pipeline. Earlier in the video, an existing en_core_web_sm pipeline was analyzed and the result was No problems found., in this instance, you will analyze a blank spaCy English model with few added components and observe results of the analysis.

### The spaCy package is already imported for you to use.

### Instructions 1/2
-    Load a blank spaCy English model as nlp.
-    Add tagger and entity_linker pipeline components to the blank model.
-    Analyze the nlp pipeline.

In [4]:
# Load a blank spaCy English model
nlp = spacy.blank("en")

# Add tagger and entity_linker pipeline components
nlp.add_pipe("tagger")
nlp.add_pipe("entity_linker")

# Analyze the pipeline
analysis = nlp.analyze_pipes(pretty=True)

[1m

#   Component       Assigns           Requires         Scores        Retokenizes
-   -------------   ---------------   --------------   -----------   -----------
0   tagger          token.tag                          tag_acc       False      
                                                                                
1   entity_linker   token.ent_kb_id   doc.ents         nel_micro_f   False      
                                      doc.sents        nel_micro_r              
                                      token.ent_iob    nel_micro_p              
                                      token.ent_type                            

[1m
[38;5;3m⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
token.ent_iob, token.ent_type[0m


## EntityRuler with blank spaCy model

### EntityRuler lets you to add entities to doc.ents. It can be combined with EntityRecognizer, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. In this exercise, you will practice adding an EntityRuler component to a blank spaCy English model and classify named entities of the given text using purely rule-based named-entity recognition.

### The spaCy package is already imported and a blank spaCy English model is ready for your use as nlp. A list of patterns to classify lower cased OpenAI and Microsoft as ORG is already created for your use.

### Instructions
-    Create and add an EntityRuler component to the pipeline.
-    Add given patterns to the EntityRuler component.
-    Run the model on the given text and create its corresponding Doc container.
-    Print a tuple of (entities text and types) for all entities in the Doc container

In [5]:
nlp = spacy.blank("en")
patterns = [{"label": "ORG", "pattern": [{"LOWER": "openai"}]},
            {"label": "ORG", "pattern": [{"LOWER": "microsoft"}]}]
text = "OpenAI has joined forces with Microsoft."

# Add EntityRuler component to the model
entity_ruler = nlp.add_pipe("entity_ruler")

# Add given patterns to the EntityRuler component
entity_ruler.add_patterns(patterns)

# Run the model on a given text
doc = nlp(text)

# Print entities text and type for all entities in the Doc container
print([(ent.text, ent.label_) for ent in doc.ents])

[('OpenAI', 'ORG'), ('Microsoft', 'ORG')]


## EntityRuler for NER

### EntityRuler can be combined with EntityRecognizer of an existing model to boost its accuracy. In this exercise, you will practice combining an EntityRuler component and an existing NER component of the en_core_web_sm model. The model is already loaded as nlp.

### When EntityRuler is added before NER component, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the EntityRuler to improve accuracy of named entity recognition task.

### Instructions
-    Add an EntityRuler to the nlp before ner component.
-    Define a token entity pattern to classify lower cased new york group as ORG.
-    Add the patterns to the EntityRuler component.
-    Run the model and print the tuple of entities text and type for the Doc container.

In [6]:
nlp = spacy.load("en_core_web_sm")
text = "New York Group was built in 1987."

# Add an EntityRuler to the nlp before NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define a pattern to classify lower cased new york group as ORG
patterns = [{"label": "ORG", "pattern": [{"lower": "new york group"}]}]

# Add the patterns to the EntityRuler component
ruler.add_patterns(patterns)

# Run the model and print entities text and type for all the entities
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

[('New York Group', 'ORG'), ('1987', 'DATE')]


## EntityRuler with multi-patterns in spaCy

### EntityRuler lets you to add entities to doc.ents and boost its named entity recognition performance. In this exercise, you will practice adding an EntityRuler component to an existing nlp pipeline to ensure multiple entities are correctly being classified.

### The en_core_web_sm model is already loaded and is available for your use as nlp. You can access an example text in example_text and use nlp and doc to access an spaCy model and Doc container of example_text respectively.

### Instructions
-    Print a list of tuples of entities text and types in the example_text with the nlp model.
-    Define multiple patterns to match lower cased brother and sisters to PERSON label.
-    Add an EntityRuler component to the nlp pipeline and add the patterns to the EntityRuler.
-    Print a tuple of text and type of entities for the example_text with the nlp model.

In [7]:
example_text = 'This is a confection. In this case Filberts. And it is cut into tiny squares. This is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.'

In [8]:
nlp = spacy.load("en_core_web_md")

# Print a list of tuples of entities text and types in the example_text
print("Before EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents], "\n")

# Define pattern to add a label PERSON for lower cased sisters and brother entities
patterns = [{"label": "PERSON", "pattern": [{"lower": "brother"}]},
            {"label": "PERSON", "pattern": [{"lower": "sisters"}]}]

# Add an EntityRuler component and add the patterns to the ruler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Print a list of tuples of entities text and types
print("After EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents])

Before EntityRuler:  [('Filberts', 'PERSON'), ('Edmund', 'PERSON'), ('Sisters', 'ORG')] 

After EntityRuler:  [('Filberts', 'PERSON'), ('Edmund', 'PERSON'), ('Brother', 'PERSON'), ('Sisters', 'ORG')]


## RegEx in Python

### Rule-based information extraction is useful for many NLP tasks. Certain types of entities, such as dates or phone numbers have distinct formats that can be recognized by a set of rules without needing to train any model. In this exercise, you will practice using re package for RegEx. The goal is to find phone numbers in a given text.

### re package is already imported for your use. You can use \d to match string patterns representative of a metacharacter that matches any digit from 0 to 9.

### Instructions
-    Define a pattern to match phone numbers of the form (111)-111-1111.
-    Find all the matching patterns using re.finditer() method.
-    For each match, print start and end characters and matching section of the given text.

In [9]:
import re

In [10]:
text = "Our phone number is (425)-123-4567."

# Define a pattern to match phone numbers
pattern = r"\((\d){3}\)-(\d){3}-(\d){4}"

# Find all the matching patterns in the text
phones = re.finditer(pattern, text)

# Print start and end characters and matching section of the text
for match in phones:
    start_char = match.start()
    end_char = match.end()
    print("Start character: ", start_char, "| End character: ", end_char, "| Matching text: ", text[start_char:end_char])

Start character:  20 | End character:  34 | Matching text:  (425)-123-4567


## RegEx with EntityRuler in spaCy

### Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. In this exercise, you will practice using EntityRuler in spaCy to find email addresses in a given text.

### spaCy package is already imported for your use. You can use \d to match string patterns representative of a metacharacter that matches any digit from 0 to 9.

### A spaCy pattern can use REGEX as an attribute. In this case, a pattern will be of shape [{"TEXT": {"REGEX": "<a given pattern>"}}].

### Instructions
-    Define a pattern to match phone numbers of the form 8888888888 to be used by the EntityRuler.
-    Load a blank spaCy English model and add an EntityRuler component to the pipeline.
-    Add the compiled pattern to the EntityRuler component.
-    Run the model and print the tuple of text and type of entities for the given text.

In [11]:
text = "Our phone number is 4251234567."

# Define a pattern to match phone numbers
patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": "(\d){10}"}}]}]

# Load a blank model and add an EntityRuler
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

# Add the compiled patterns to the EntityRuler
ruler.add_patterns(patterns)

# Print the tuple of entities texts and types for the given text
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

[('4251234567', 'PHONE_NUMBERS')]


## Matching a single term in spaCy

### RegEx patterns are not trivial to read, write and debug. But you are not at a loss, spaCy provides a readable and production-level alternative, the Matcher class. The Matcher class can match predefined rules to a sequence of tokens in a given Doc container. In this exercise, you will practice using Matcher to find a single word.

### You can access the corresponding text in example_text and use nlp and doc to access an spaCy model and Doc container of example_text respectively.

### Instructions
-    Initialize a Matcher class.
-    Define a pattern to match lower cased witch in the example_text.
-    Add the patterns to the Matcher class and find matches.
-    Iterate through matches and print start and end token indices and span of the matched text.

In [15]:
from spacy.matcher import Matcher
example_text = 'I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.'

In [16]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Initialize a Matcher object
matcher = Matcher(nlp.vocab)

# Define a pattern to match lower cased word witch
pattern = [{"lower" : "witch"}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print start and end token indices and span of the matched text
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

Start token:  24  | End token:  25 | Matched text:  Witch
Start token:  47  | End token:  48 | Matched text:  Witch


## PhraseMatcher in spaCy

### While processing unstructured text, you often have long lists and dictionaries that you want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If you have a long list of phrases, Matcher is no longer the best option. In this instance, PhraseMatcher class helps us match long dictionaries. In this exercise, you will practice to retrieve patterns with matching shapes to multiple terms using PhraseMatcher class.

### en_core_web_sm model is already loaded and ready for you to use as nlp. PhraseMatcher class is imported. A text string and a list of terms are available for your use.

### Instructions
-    Initialize a PhraseMatcher class with an attr to match to shape of given terms.
-    Create patterns to add to the PhraseMatcher object.
-    Find matches to the given patterns and print start and end token indices and matching section of the given text.

In [17]:
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")

In [18]:
text = "There are only a few acceptable IP addresse: (1) 127.100.0.1, (2) 123.4.1.0."
terms = ["110.0.0.0", "101.243.0.0"]

# Initialize a PhraseMatcher class to match to shapes of given terms
matcher = PhraseMatcher(nlp.vocab, attr = "SHAPE")

# Create patterns to add to the PhraseMatcher object
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)

# Find matches to the given patterns and print start and end characters and matches texts
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

Start token:  12  | End token:  13 | Matched text:  127.100.0.1
Start token:  17  | End token:  18 | Matched text:  123.4.1.0


## Matching with extended syntax in spaCy

### Rule-based information extraction is essential for any NLP pipeline. The Matcher class allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in and comparison operators. In this exercise, you will practice with spaCy's matching functionality, Matcher, to find matches for given terms from an example text.

### Matcher class is already imported from spacy.matcher library. You will use a Doc container of an example text in this exercise by calling doc. A pre-loaded spaCy model is also accessible at nlp.

### Instructions
-    Define a matcher object using Matcher and nlp.
-    Use the IN operator to define a pattern to match tiny squares and tiny mouthful.
-    Use this pattern to find matches for doc.
-    Print start and end token indices and text span of the matches.

In [20]:
example_text = "It is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven."

In [21]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Define a matcher object
matcher = Matcher(nlp.vocab)
# Define a pattern to match tiny squares and tiny mouthful
pattern = [{"lower": "tiny"}, {"lower": {"in": ["squares", "mouthful"]}}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print out start and end token indices and the matched text span per match
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

Start token:  4  | End token:  6 | Matched text:  tiny squares
Start token:  19  | End token:  21 | Matched text:  tiny mouthful
