# Text Analysis with spaCy
### The official tutorial

In this project, I am going over official spaCy tutorial to learn about its fundamental capabilities.

# Chapter 1:
Finding words, phrases, names and concepts
This chapter will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with trained pipelines, and how to use them to predict linguistic features in your text.

## 1.1. Introduction to spaCy

In [None]:
# Import spaCy
import spacy

### 1.1.1. Create a blank English nlp object
- contains the processing pipeline
- includes language-specific rules for tokenization etc.
-  You can use the nlp object like a function to analyze text. It contains all the different components in the pipeline.
- It also includes language-specific rules used for tokenizing the text into words and punctuation.

In [None]:
nlp = spacy.blank("en")

### 1.1.2. Process a text

In [None]:
doc = nlp("This is a sentence.")

Print the document text

In [None]:
print(doc.text)

### 1.1.3. Iterate over tokens in a Doc

In [None]:
for token in doc:
    print(token.text)

### 1.1.4. Index into the Doc to get a single Token

In [None]:
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

In [None]:
# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

### 1.1.5. Lexical Attributes
Lexical attributes, refer to the entry in the vocabulary and don't depend on the token's context.

In [None]:
doc = nlp("It costs $5.")

print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])
print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

In [None]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

## 1.2 Trained Pipelines

What are trained pipelines?
- Models that enable spaCy to predict linguistic attributes in context:
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

To install the package dependency use the directions in the following link
https://www.listendata.com/2019/04/install-python-package.html
python -m spacy download en_core_web_trf

### 1.2.1. Predicting Part-of-speech Tags

In [None]:
# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

### 1.2.2. Predicting Named Entities

In [None]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

### 1.2.3. Tip: the spacy.explain method

In [None]:
spacy.explain("GPE"), spacy.explain("NNP")

## 1.3. Rule-based matching

Why not just regular expressions?
- Match on Doc objects, not just strings
- Match on tokens and token attributes
- Use a model's predictions
- Example: "duck" (verb) vs. "duck" (noun)

Match patterns
Lists of dictionaries, one per token

- Match exact token texts
[{"TEXT": "iPhone"}, {"TEXT": "X"}]

- Match lexical attributes
[{"LOWER": "iphone"}, {"LOWER": "x"}]

- Match any token attributes
[{"LEMMA": "buy"}, {"POS": "NOUN"}]

### 1.3.1. Matcher

In [None]:
# Import the Matcher
from spacy.matcher import Matcher

In [None]:
# Load a pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

In [None]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

In [None]:
# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])

In [None]:
# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

In [None]:
# Call the matcher on the doc
matches = matcher(doc)

In [None]:
len(matches)

### 1.3.2. Using Matcher

In [None]:
# Call the matcher on the doc
doc = nlp("Upcoming iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [None]:
doc = nlp("I loved dogs but now I love cats more.")

pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

matcher = Matcher(nlp.vocab)

matcher.add("love_things_PATTERN", [pattern])

matches = matcher(doc)
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

# Chapter 2: Large-scale data analysis with spaCy

## 2.1. Data Structures (1): Vocab, Lexemes and StringStore

### 2.1.1. Shared vocab and string store (1)
Vocab: stores data shared across multiple documents
To save memory, spaCy encodes all strings to hash values
Strings are only stored once in the StringStore via nlp.vocab.strings
String store: lookup table in both directions

In [None]:
nlp.vocab.strings.add("coffee")

In [None]:
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

In [None]:
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

In [None]:
coffee_string

### 2.1.2. Lexemes: entries in the vocabulary

In [None]:
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

In [None]:
nlp = spacy.blank("en")
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

## 2.2. Data Structures (2): Doc, Span and Token

### 2.2.1. Creating a Doc object

In [None]:
# Create an nlp object
import spacy
nlp = spacy.blank("en")

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

print(doc.text)

In [None]:
import spacy

nlp = spacy.blank("en")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

### 2.2.2. Creating spans and entities manually

In [27]:
import spacy

nlp = spacy.blank("en")

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

In [28]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


## 2.3. Word vectors and semantic similarity

- spaCy can compare two objects and predict similarity
- Doc.similarity(), Span.similarity() and Token.similarity()
- Take another object and return a similarity score (0 to 1)
- Important: needs a pipeline that has word vectors included, for example:
    - ✅ en_core_web_md (medium)
    - ✅ en_core_web_lg (large)

- How does spaCy predict similarity?
- Similarity is determined using word vectors
- Multi-dimensional meaning representations of words
- Generated using an algorithm like Word2Vec and lots of text
- Can be added to spaCy's pipelines
- Default: cosine similarity, but can be adjusted
- Doc and Span vectors default to average of token vectors

To install the package dependency use the directions in the following link
https://www.listendata.com/2019/04/install-python-package.html
python -m spacy download en_core_web_lg

In [29]:
# Load a larger pipeline with vectors
nlp = spacy.load('en_core_web_lg')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8698332283318978


In [30]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.685019850730896


In [31]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.1821369691957915


In [32]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.4989228122727765


In [37]:
import spacy

nlp = spacy.load("en_core_web_lg")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.6348510384559631


### 2.3.1. Word vectors in spaCy
The result is a 300-dimensional vector of the word "banana".

In [34]:
# Load a larger pipeline with vectors
nlp = spacy.load("en_core_web_lg")

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 0.20778  -2.4151    0.36605   2.0139   -0.23752  -3.1952   -0.2952
  1.2272   -3.4129   -0.54969   0.32634  -1.0813    0.55626   1.5195
  0.97797  -3.1816   -0.37207  -0.86093   2.1509   -4.0845    0.035405
  3.5702   -0.79413  -1.7025   -1.6371   -3.198    -1.9387    0.91166
  0.85409   1.8039   -1.103    -2.5274    1.6365   -0.82082   1.0278
 -1.705     1.5511   -0.95633  -1.4702   -1.865    -0.19324  -0.49123
  2.2361    2.2119    3.6654    1.7943   -0.20601   1.5483   -1.3964
 -0.50819   2.1288   -2.332     1.3539   -2.1917    1.8923    0.28472
  0.54285   1.2309    0.26027   1.9542    1.1739   -0.40348   3.2028
  0.75381  -2.7179   -1.3587   -1.1965   -2.0923    2.2855   -0.3058
 -0.63174   0.70083   0.16899   1.2325    0.97006  -0.23356  -2.094
 -1.737     3.6075   -1.511    -0.9135    0.53878   0.49268   0.44751
  0.6315    1.4963    4.1725    2.1961   -1.2409    0.4214    2.9678
  1.841     3.0133   -4.4652    0.96521  -0.29787   4.3386   -1.2527
 -1.7734   -3.5637   -0.20035

Similarity depends on the application context
Useful for many applications: recommendation systems, flagging duplicates etc.
There's no objective definition of "similarity"
Depends on the context and what application needs to do

In [36]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9530094042245597


## 2.4. Combining predictions and rules

Efficient phrase matching

In [38]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


In [39]:
import json
import spacy

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

nlp = spacy.blank("en")
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

FileNotFoundError: [Errno 2] No such file or directory: 'exercises/en/countries.json'

# Chapter 3: Processing Pipelines