# Chapter 1: Finding words, phrases, names and concepts

In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
doc = nlp("Hello World!")

for token in doc:
    print(token.text)

Hello
World
!


In [5]:
token = doc[0]

print(token.text)

Hello


A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a Span, you can use Python's slice notation. For example, 1 colon 3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.

In [6]:
span = doc[1:3]
print(span.text)

World!


Here you can see some of the available token attributes:

"i" is the index of the token within the parent document.

"text" returns the token text.

"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [7]:
doc = nlp("This will cost $5.")

print('Index : ', [token.i for token in doc])
print('Text : ', [token.text for token in doc])
print('Is Alpha : ', [token.is_alpha for token in doc])
print('Is Punct : ', [token.is_punct for token in doc])
print('Like Num : ', [token.like_num for token in doc])

Index :  [0, 1, 2, 3, 4, 5]
Text :  ['This', 'will', 'cost', '$', '5', '.']
Is Alpha :  [True, True, True, False, False, False]
Is Punct :  [False, False, False, False, False, True]
Like Num :  [False, False, False, False, True, False]


### Statistical Models

What are statistical models ?

-> Enable Spacy to predict linguistic attributes in context
    1. Part-of-speech tags
    2. Syntactic dependencies
    3. Names entities
-> Trained on labeled example texts
-> Can be updated with more examples to fine-tune predictions

In [8]:
#Model Packages : 
#!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')

# Binary weights
# Vocabolary
# Meta Information(language, pipeline)  

What’s not included in a model package that you can load into spaCy?

    a.A meta file including the language, pipeline and license.
            All models include a meta.json that defines the language to initialize, the pipeline component names to load as well as general meta information like the model name, version, license, data sources, author and accuracy figures (if available).
    b.Binary weights to make statistical predictions.
            To predict linguistic annotations like part-of-speech tags, dependency labels or named entities, models include binary weights.
    c.The labelled data that the model was trained on. (This is not included)
            Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.
    d.Strings of the model's vocabulary and their hashes.
            Model packages include a strings.json that stores the entries in the model’s vocabulary and the mapping to hashes. This allows spaCy to only communicate in hashes and look up the corresponding string if needed.

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing.

### Rule-based Matching

Why not just regular expressions ?

1. Match on Doc objects, not just strings
2. Match on tokens and token attributes
3. Use the model's predictions
4. Example: "duck"(verb) vs. "duck"(noun)

### Match Patterns

1. Lists of dictionaries, one per token
2. Match exact token texts

        [{'TEXT': 'iPhone'} {'TEXT': 'X'}]

3. Match lexical attributes
        [{'LOWER' : 'iphone'},{'LOWER' : 'x'}]

4. Match any token attributes
        [{'LOWER': 'buy'}, {'POS', : 'NOUN'}]

### Using the Matcher Part 1

In [9]:
#Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

#Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the Matcher
pattern = [{'TEXT':'iPhone'}, {'TEXT':'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp('New iPhone X release data leaked')
matches = matcher(doc)

In [10]:
matches

[(9528407286733565721, 1, 3)]

### Using the Matcher Part 2

In [11]:
#Iterate over the matches
# match_id : hash value ot the pattern name
# start : start index of matched span
# end : end index of matched span

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)
    


iPhone X


### Matching lexical attributes

In [12]:
pattern =[
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

doc = nlp("2018 FIFA World Cup: France won!")

In [13]:
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN', None, pattern)

matches = matcher(doc)
matches

[(11920309760829426267, 0, 5)]

In [14]:
for match_id, start, end in matches:
    match_span = doc[start:end]
    print(match_span.text)

2018 FIFA World Cup:


### Matching other token attributes

In [15]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS' : 'NOUN'}
]

doc = nlp("I loved dogs but now i love cats more.")

In [16]:
matcher = Matcher(nlp.vocab)
matcher.add('TOKEN_PATTERN', None, pattern)
matches = matcher(doc)
matches

[(5725346615152885079, 1, 3), (5725346615152885079, 6, 8)]

In [17]:
for match_id, start, end in matches:
    match_span = doc[start:end]
    print(match_span.text)

loved dogs
love cats


### Using operators and quantifiers Part1

In [19]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP':'?'},    # Optional : match 0 or 1 times
    {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")
matcher = Matcher(nlp.vocab)
matcher.add('OPERAOTRS', None, pattern)

matches= matcher(doc)
matches

[(14539848206026242071, 1, 4), (14539848206026242071, 8, 10)]

In [20]:
for match_id, start, end in matches:
    match_span = doc[start:end]
    print(match_span)

bought a smartphone
buying apps


### Using operators and quantifiers Part2

In [22]:
#Example                 Description


#{'OP': '!'}             Negation: match 0 times
#{'OP': '?'}             Optional: match 0 or 1 times
#{'OP': '+'}             Match 1 or more times
#{'OP': '*'}             Match 0 or more times

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

### Writing Match Patterns