In [1]:
import spacy

In [4]:
#create the English nlp object
nlp = spacy.blank("en")

In [7]:
# Process a text
doc = nlp("I like tree kangaroos and narwhals")

In [8]:
# Print the document text
print(doc.text)

I like tree kangaroos and narwhals


In [13]:
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

tree kangaroos


In [14]:
# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos and narwhals


In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

Use the like_num token attribute to check whether a token in the doc resembles a number.

In [15]:
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

In [24]:
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        print(token.text)
        print(f'id index is {token.i}')
        
        #get the next symbol 
        print(f'next symbol {doc[token.i + 1]}')
        if doc[token.i + 1].text == '%':
            print("is equal to %")

1990
id index is 1
next symbol ,
60
id index is 5
next symbol %
is equal to %
4
id index is 20
next symbol %
is equal to %


The pipelines we’re using in this course are already pre-installed. For more details on spaCy’s trained pipelines and how to install them on your machine, see the documentation.

Use spacy.load to load the small English pipeline "en_core_web_sm".
Process the text and print the document text.

In [25]:
# Load the "en_core_web_sm" pipeline
nlp = spacy.load("en_core_web_sm")

In [26]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"


In [27]:
# Process the text
doc = nlp(text)

Process the text and create a doc object. <br>
Iterate over the doc.ents and print the entity text and label_ attribute.

In [47]:

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    print(spacy.explain(ent.label_))

Apple ORG
Companies, agencies, institutions, etc.
first ORDINAL
"first", "second", etc.
U.S. GPE
Countries, cities, states
$1 trillion MONEY
Monetary values, including unit


In [50]:
print(doc.ents)

(Apple, first, U.S., $1 trillion)


In [52]:
txt = "David is  23 the best man on the world"
nlp = spacy.load("en_core_web_sm")
doc = nlp(txt)
doc

David is  23 the best man on the world

In [54]:
for e in doc.ents:
    print(e.text, e.label_)
    print(spacy.explain(ent.label_))
    
print(doc.ents)    

David PERSON
Monetary values, including unit
23 CARDINAL
Monetary values, including unit
(David, 23)


In [38]:

for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    print(spacy.explain(ent.label_))
    print("DSA")

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

Process the text with the nlp object.
Iterate over the entities and print the entity text and label.
Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [56]:
nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)
    print(spacy.explain(ent.label_))

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Companies, agencies, institutions, etc.
Missing entity: iPhone X


Let’s try spaCy’s rule-based Matcher. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.

Import the Matcher from spacy.matcher.
Initialize it with the nlp object’s shared vocab.
Create a pattern that matches the "TEXT" values of two tokens: "iPhone" and "X".
Use the matcher.add method to add the pattern to the matcher.
Call the matcher on the doc and store the result in the variable matches.
Iterate over the matches and get the matched span from the start to the end index.

In [58]:


# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


In [107]:
txt = """ Mehmed is 29 years old.He works in Haemimont, also he was worked on Haemimont and Haemimont. Maria is the best and Maria is with Deyvid.
    David is the man who believe in his farm.Be nice David"""

In [108]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(txt)


for e in doc.ents:
    print(e.text, e.label_)
    print(spacy.explain(e.label_))

Mehmed PERSON
People, including fictional
29 years old DATE
Absolute or relative dates or periods
Haemimont GPE
Countries, cities, states
Haemimont GPE
Countries, cities, states
Haemimont GPE
Countries, cities, states
Maria PERSON
People, including fictional
Maria PERSON
People, including fictional
Deyvid PERSON
People, including fictional


In [98]:
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "Haemimont"}]  #
matcher.add("Haemimont", [pattern])
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['Haemimont', 'Haemimont', 'Haemimont']
