# Chapter 1: Finding words, phrases, names and concepts
----

#### Contents:
* Documents
* Spans
* Tokens
* Lexical Attributes
* Statistical Models
* Named Entity Recognition
* Rule based Matching

## Getting Started

* Import the English class from spacy.lang.en and create the nlp object.
* Create a doc and print its text.

In [2]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
# it contains preprocessing pipeline
# inclueds language specific tokenization rules
nlp = English()

# Process a text
doc = nlp("This is a sentence.")


# Print the document text
print(doc[0].text)

This


## Documents, Spans, Tokens
* When you call nlp on a string, spaCy first tokenizes the text and creates a document object.
* Step 1
    * Import the English language class and create the nlp object.
    * Process the text and instantiate a Doc object in the variable doc.
    * Select the first token of the Doc and print its text.
* Step 2
    * Import the English language class and create the nlp object.
    * Process the text and instantiate a Doc object in the variable doc.
    * Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.   

In [3]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## Lexical Attributes
* In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. 
* You’ll be looking for two subsequent tokens: a number and a percent sign.
* Use the like_num token attribute to check whether a token in the doc resembles a number.
* Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
* Check whether the next token’s text attribute is a percent sign ”%“.

In [4]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


In [6]:
t = doc[0]

### Spacy's Token API Documentation: https://spacy.io/api/token

## Statistical Models

### Loading Models
* The models we’re using in this course are already pre-installed. 
* For more details on spaCy’s statistical models and how to install them on your machine, see the documentation.
* Use spacy.load to load the small English model "en_core_web_sm".
* Process the text and print the document text.

In [9]:
import spacy
# Load the small English model
nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
# Process the text
doc = nlp(text)
# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


### Predicting Linguistic Annotations
* You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. 
* Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. 
* For example: spacy.explain("PROPN") or spacy.explain("GPE").
* Part 1
    * Process the text with the nlp object and create a doc.
    * For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [18]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value. Hyderabad New York. India"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")


It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ccomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      
.           PUNCT     punct     
Hyderabad   PROPN     compound  
New         PROPN     compound  
York        PROPN     ROOT      
.           PUNCT     punct     
India       PROPN     ROOT      


In [17]:
print(spacy.explain('GPE'))

Countries, cities, states


* Part 2
    * Process the text and create a doc object.
    * Iterate over the doc.ents and print the entity text and label_ attribute.

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first India public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
India GPE
$1 trillion MONEY


## Predicting Named Entities in the context
* Models are statistical and not always right. 
* Whether their predictions are correct depends on the training data and the text you’re processing. 
* Let’s take a look at an example.
* Process the text with the nlp object.
* Iterate over the entities and print the entity text and label.
* Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


## Rule based Matching

### Using the Matcher

* Match patterns
* Lists of dictionaries, one per token
    * Match exact token texts: [{"TEXT": "iPhone"}, {"TEXT": "X"}]
    * Match lexical attributes: [{"LOWER": "iphone"}, {"LOWER": "x"}]
    * Match any token attributes: [{"LEMMA": "buy"}, {"POS": "NOUN"}]


* Let’s try spaCy’s rule-based Matcher. 
* You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.
* Steps:
    * Import the Matcher from spacy.matcher.
    * Initialize it with the nlp object’s shared vocab.
    * Create a pattern that matches the "TEXT" values of two tokens: "iPhone" and "X".
    * Use the matcher.add method to add the pattern to the matcher.
    * Call the matcher on the doc and store the result in the variable matches.
    * Iterate over the matches and get the matched span from the start to the end index.
    
    
> * If any match found matcher returns:
>    * match_id: hash value of the pattern name
>    * start: start index of matched span
>    * end: end index of matched span

In [27]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


### Matching lexical attributes
Here's an example of a more complex pattern using lexical attributes.
* We're looking for five tokens:
    * A token consisting of only digits.
    * Three case-insensitive tokens for "fifa", "world" and "cup".
    * And a token that consists of punctuation.
    * The pattern matches the tokens "2018 FIFA World Cup:".

In [37]:
# Matching lexical attributes
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
doc = nlp("2018 FIFA World Cup: France won!")
matcher.add("LEXICAL_PATTERN",None,pattern)
matches = matcher(doc)
doc[matches[0][1]:matches[0][2]]

2018 FIFA World Cup:

### Matching other token attributes
* In this example, we're looking for two tokens:
    * A verb with the lemma "love", followed by a noun.
    * This pattern will match "loved dogs" and "love cats".

In [42]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]
doc = nlp("I loved dogs but now I love cats more.")
matcher.add('LEXICAL_PATTERN',None,pattern)
matches = matcher(doc)
for match_ind in matches:
    print(doc[match_ind[1]:match_ind[2]])

loved dogs
love cats


### Using operators and quantifiers (1)
* Operators and quantifiers let you define how often a token should be matched. 
* They can be added using the "OP" key.
* Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [43]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]
matcher.add('OPERATOR_PATTERN',None,pattern)
doc = nlp("I bought a smartphone. Now I'm buying apps.")
matches = matcher(doc)
for match_ind in matches:
    print(doc[match_ind[1]:match_ind[2]])

bought a smartphone
buying apps


### Using operators and quantifiers (2)
* Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.
* "OP" can have one of four values:
    * An "!" negates the token, so it's matched 0 times.
    * A "?" makes the token optional, and matches it 0 or 1 times.
    * A "+" matches a token 1 or more times.
    * And finally, an "*" matches 0 or more times.
 

* Example	Description
    * {"OP": "!"}	Negation: match 0 times
    * {"OP": "?"}	Optional: match 0 or 1 times
    * {"OP": "+"}	Match 1 or more times
    * {"OP": "*"}	Match 0 or more times


### Matching Patterns 1:
* In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.
* Part 1
    * Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

### Matching Patterns 2:
* Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [23]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


### Matching Pattern 3:
* Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [24]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
