# Chapter 1: Finding words, phrases, names, and concepts
### _Advanced NLP with spaCy_

## 1. Introduction to spaCy

### The nlp object

* center of spacy: object containing the processing pipeline, usually called `nlp`
* Ex. to create English `nlp` object, import `English` language class from `spacy.lang.en` and instantiate it
* you can also use nlp object like a function to analyze text
* `nlp` contains all the different components in the pipeline
* also includes language-specific rules used for tokenizing the text into words and punctuation
* spaCy supports a variety of languages that are available in `spacy.lang`

_slide_
* contains the processing pipeline
* includes language-specific rules for tokenization etc.

In [2]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

### The Doc object
* when you process text with `nlp` object, spaCy creates a `Doc` object, short for "document"
* the Doc allows you to access information about the text in a structured way, and no information is lost
* the Doc behaves like a normal Python sequence & lets you iterate over its tokens, or get a token by its index

In [3]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### The Token object
* `Token` objects represent the tokens in a document
    - EX. a word or a punctuation character
* to get a token at a specific position, you can index into the doc
* `Token` objects also provide various attributes that let you access more information about the tokens

In [4]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


### The Span object
* a `Span` object is a slice of the document consisting of one or more tokens
* it's only a view of the `Doc` and doesn't contain any data itself
* to create a span, you can use Python's slice notation
    - EX. `1:3` will create a slice starting from the token at position 1 up to (but not including) the token at pos 3

In [5]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

world!


### Lexical Attributes
* some of the available token attributes
    - `i`: index of the token within the parent document
    - `text`: returns the token text
    - `is_alpha`, `is_punct`, and `like_num` return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number
    - EX. a token "10" or the word "ten"
* these attributes are also called lexical attributes, which refer to the entry in the vocabulary and don't depend on the token's context

In [7]:
doc = nlp("It costs $5.")

print("Index: ", [token.i for token in doc])
print("Text: ", [token.text for token in doc])

print("is_alpha: ", [token.is_alpha for token in doc])
print("is_punct: ", [token.is_punct for token in doc])
print("like_num: ", [token.like_num for token in doc])

Index:  [0, 1, 2, 3, 4]
Text:  ['It', 'costs', '$', '5', '.']
is_alpha:  [True, True, False, False, False]
is_punct:  [False, False, False, False, True]
like_num:  [False, False, False, True, False]


## 2. Getting Started

In [1]:
# Part 1: English
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


In [2]:
# Part 2: German
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [3]:
# Part 3: Spanish
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


## 3. Documents, spans, and tokens

When you call `nlp` on a string, spaCy first tokenizes the texts and then
creates a document object.

Step 1
* Import English language class and create the `nlp` object
* Process the text and instantiate a `Doc` object in the variable `doc`
* Select the first token of the `Doc` and print its `text`

In [4]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


Step 2
* Import the `English` language class and create the `nlp` object
* Process the text and instantiate a `Doc` object in the variable `doc`
* Create a slice of the `Doc` for the tokens "tree kangaroos" and "tree kangaroos and narwhals"

In [5]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## 4. Lexical Attributes

Use `Doc` and `Token' objects and lexical attributes to find percentages in a
text. In this example, we'll look for two subsequent tokens: a number and a
percentage sign

* Use the `like_num` token attribute to check whether a token in the `doc`
    resembles a number
* Get the token following the current token in the document. The index of the
    next token in the `doc` is `token.i + 1`
* Check whether the next token's `text` attribute is a percent sign "%"

In [6]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## 5. Statistical Models (Video)

### What are statistical models?

* some of the most interesting things you can analyze are context-specific
    - EX. whether a word is a verb or whether a span of text is a person name
* statistical models enable spaCy to make predictions in context
* usually include part-of-speech tags, syntactic dependencies, and named entities
* models are trained on large datasets of labeled example texts
* they can be updated w/ more examples to fine-tune their predictions
    - EX. perform better on your specific data

_slide_ 
* enable spaCy to predict linguistic attributes in context
  - part-of-speech tags
  - syntactic dependencies
  - named entities
* trained on labeled example texts
* can be updated with more examples to fine-tune predictions

### Model Packages

* spaCy provides a number of pre-trained model packages you can download using the `spacy download` command
    - EX. `en_core_web_sm` package: small English model that supports all core capabilities and is trained on web text
    - `$ python -m spacy download en_core_web_sm`
* the `spacy.load` method loads a model package by name and returns an `nlp` object
* the package provides the binary weights that enable spaCy to make predictions
* also includes the vocabulary and meta information to tell spaCy which language class to use and how to configure the processing pipeline

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")

### Predicting Part-of-speech Tags
* EX. using spaCy to predict part-of-speech tags, the word types in context
    - First, we load the small English model and receive an `nlp` object
    - Next, we're processing the text "She ate the pizza"
    - For each token in the doc, we can print the text and the `.pos_` attribute, the predicted part-of-speech tag
* attributes that return strings usually end with an underscore
* attributes without the underscore return an integer ID value

In [1]:
import spacy

# load the small English model
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


* in addition to part-of-speech tags, we can also predict how the words are related
    - EX. is a word the subject of a sentence or an object?
* the `.dep_` attribute returns the predicted dependency label
* the `.head` attribute returns the syntactic head token
* can also think of it as the parent token this word is attached to

In [2]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### Dependency label scheme
_Note: Dependency label scheme in video_

* to describe syntactic dependencies, spaCy uses a standardized label scheme
* EX of some common labels
    - the pronoun "She" is a nominal subject attached to the verb - in this case, to "ate"
    - the noun "pizza" is a direct object attached to the verb "ate"; it is eaten by the subject "she"
    - the determiner "the", also known as an article, is attached to the noun "pizza"

### Predicting Name Entities
* named entities are "real world objects" that are assigned a name
    - EX: a person, an organization, a country
* the `doc.ents` property lets you access the named entites predicted by the model
* it returns an iterator of `Span` objects so we can print the entity text and the entity label using the `.label_` attribute

In [14]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its labels
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Tip: the spacy.explain method
* to get the definition for the most common tags and labels, you can use the `spacy.explain` helper function
    - EX: "GPE" for geopolitical entity

In [15]:
spacy.explain("GPE")

'Countries, cities, states'

In [16]:
spacy.explain("NNP")

'noun, proper singular'

In [17]:
spacy.explain("dobj")

'direct object'

## 6. Model Packages

What's not included in a model package that you can load into spaCy?

1. A meta file including the language, pipeline, and license

2. Binary weights to make statistical predictions

3. The labelled data that the model was trained on

4. Strings of the model's vocabulary and their hashes

Answer: (3)

Explanation: Statistical models allow you to generalize based on a set of
training examples. Once they're trained, they use binary weights to make
predictions. That's why it's not necessary to ship them with their training
data.

## 7. Loading models

The models we're using in this course are already pre-installed.

* Use `spacy.load` to load the small English model `en_core_web_sm`
* Process the text and print the document text

In [18]:
import spacy

# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


## 8. Predicting linguistic annotations

### Part 1
* process the text with the `nlp` object and create a `doc`
* for each token, print the token text, the token's `.pos_` (part-of-speech tag) and the token's `.dep_` (dependency label)

In [19]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ccomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


### Part 2
* process the text and create a `doc` object
* iterate over the `doc.ents` and print the entity text and `label_` attribute

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


## 9. Predicting named entities in context

Models are statistical and not always right. Whether their predictions are coorect depends on the training data and the text you're processing.

* process the text with the `nlp` object
* iterate over the entities and print the entity text and label
* model didn't predict "iPhone X". Create a span for those tokens manually

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)
  
# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text) 

## 10. Rule-based matching
* GOAL: look at spaCy's matcher, which lets you write rules to find words and phrases in text

#### Why not just regular expressions?
* Match on `Doc` objects, not just strings
* Match on tokens and token attributes
* Use the model's predictions
    - EX: "duck" (verb) vs. "duck" (noun)

_lecture note_
* Compared to regular expressions, the matcher works with `Doc` and `Token` objects instead of only strings
* It's also more flexible; you can search for texts but also other lexical attributes
* you can even write rules that use the model's predictions
* EX: find the word "duck" only if it's a verb, not a noun

### Match patterns
* Lists of dictionaries, one per token
* Match exact token texts

`[{"TEXT": "iPhone"}, {"TEXT": "X"}]`

* Match lexical attributes

`[{"LOWER": "iphone"}, {"LOWER": "x"}]`

* Match any token attributes

`[{"LEMMA": "buy"}, {"POS": "NOUN}]`

_lecture notes_
* Match patterns are lists of dictionaries, where each dictionary describes one token
* The keys are the names of token attributes, mapped to their expected values
    - EX: looking for two tokens with the text "iPhone" and "X"
* We can also match on other token attributes
    - EX: Looking for two tokens whose lowercase form equal "iphone" and "x
* We can even write patterns using attributes predicted by the model
    - EX: matching a token with the lemma "buy" plus a noun
    - the lemma is the base form; this pattern would match phrases like "buying milk" or "bought flowers"