## [Chapter 2](https://course.spacy.io/chapter2)

In [38]:
import spacy
import json

# $ python -m spacy download en_core_web_sm
# $ python -m spacy download en_core_web_md
# $ python -m spacy download en_core_web_lg

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
# hashes
nlp.vocab.strings['coffee']

3197928453018144401

In [4]:
# Lexemes expose attributes, just like tokens.
# They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters.
# Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.
lexeme = nlp.vocab['coffee']
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


In [5]:
# In this exercise, you’ll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. 
# A shared nlp object has already been created.

from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


#### Data Structure Best Practice

In [6]:
# The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)
            
# Why is the code bad?
# It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

# Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent.
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin
Found proper noun before a verb: Berlin


#### Word vectors and semantic similarity
 - Similarity is determined using word vectors
 - Multi-dimensional meaning representations of words
 - Generated using an algorithm like Word2Vec and lots of text
 - Can be added to spaCy's statistical models
 - Default: cosine similarity, but can be adjusted
 - Doc and Span vectors default to average of token vectors
 - Short phrases are better than long documents with many irrelevant words

In [7]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_lg")

In [None]:
# Access the vector via the token.vector attribute
doc = nlp("I have a banana")
print(doc[3].vector)

In [11]:
# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8627203210548107


In [12]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.73695457


In [14]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]
print(doc.similarity(token))

0.32531983166759537


In [15]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")
print(span.similarity(doc))

0.6199091710787739


In [22]:
# Similarity depends on the application context
# Useful for many applications: recommendation systems, flagging duplicates etc.
# There's no objective definition of "similarity"
# Depends on the context and what application needs to do
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")
print(doc1.similarity(doc2))

print("""
        Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". 
        This makes sense, because both texts express sentiment about cats. 
        But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.
    """)

0.9501446702124066

        Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". 
        This makes sense, because both texts express sentiment about cats. 
        But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.
    


#### Combining models and rules

Statistical models are useful if your application needs to be able to generalize based on a few examples. For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.

Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.

In [23]:
# Efficient phrase matching
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


#### Efficient phrase matching
Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

In [57]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())
with open("exercises/countries_text.txt") as f:
    TEXT = f.read()

nlp = English()
doc = nlp(TEXT)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print('Matches:')
print([doc[start:end] for match_id, start, end in matches])

# Print the entities in the document
print('\nGPE ENT\'s:')
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

Matches:
[Namibia, South Africa, Cambodia, Kuwait, Somalia, Haiti, Mozambique, Somalia, Rwanda, Singapore, Sierra Leone, Afghanistan, Iraq, Sudan, Congo, Haiti]

GPE ENT's:
[]


##### Above you wrote a script using spaCy’s PhraseMatcher to find country names in text. 
Let’s use that country matcher and update the document’s entities with the matched countries.

In [58]:
# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print('\nGPE ENT\'s:')
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

Namibia --> Namibia
South --> South Africa
Cambodia --> Cambodia
Kuwait --> Kuwait
Somalia --> Somalia
Haiti --> Haiti
Mozambique --> Mozambique
Somalia --> Somalia
Rwanda --> Rwanda
Singapore --> Singapore
Sierra --> Sierra Leone
Afghanistan --> Afghanistan
Iraq --> Iraq
Sudan --> Sudan
Congo --> Congo
Haiti --> Haiti

GPE ENT's:
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
