<a href="https://colab.research.google.com/github/diem-ai/natural-language-processing/blob/master/spaCy_chapter2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Chapter 2: Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

In [0]:
import spacy
# Import the Doc class
from spacy.tokens import Doc, Span
from spacy.lang.en import English

from spacy.matcher import Matcher, PhraseMatcher

In [0]:
nlp = spacy.load("en_core_web_sm")

**String to hashesh**

- Look up the string “cat” in nlp.vocab.strings to get the hash.
- Look up the hash to get back the string.

In [0]:
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


- Look up the string label “PERSON” in nlp.vocab.strings to get the hash.
- Look up the hash to get back the string.

In [0]:
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


**Creating a Doc from scratch**

In [0]:


# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


Create a sentence: "Go, get started!"

In [0]:
# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


Create a sentence: "Oh, really?!"

In [0]:
# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words = words, spaces = spaces)
print(doc.text)

Oh, really?!


**Doc, Span, Entities from Scratch**

- In this exercise, you’ll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

In [0]:
nlp_en = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp_en.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])


I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


**Data Structure Best Practices**

In [0]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice capital. Munich is historical city. We should stay there for a while. ")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin
Found proper noun before a verb: Munich


In [0]:
print(pos_tags)

['PROPN', 'VERB', 'DET', 'ADJ', 'NOUN', 'PUNCT', 'PROPN', 'VERB', 'ADJ', 'NOUN', 'PUNCT', 'PRON', 'VERB', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN', 'PUNCT']


In [0]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice capital. Munich is historical city. We should stay there for a while. ")

for token in doc:
  if(token.pos_ == "PROPN"):
    if( (token.i + 1) < len(doc) and (doc[token.i + 1].pos_ == "VERB") ):
      print("Found proper noun before a verb:", token.text)
# print(token.tag_, token.pos_, token.ent_type_)

Found proper noun before a verb: Berlin
Found proper noun before a verb: Munich


**Inspect Word Vector**

In [0]:
# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[ 2.1561384   0.6859281  -1.8234854   0.4145496  -0.886605    5.0773377
  0.28650832  3.6156225  -2.627604    5.01052     2.6055033   5.4986916
 -0.82726336 -2.4128723  -1.5714562   0.67344356 -1.1230624   3.017315
  3.4531426   2.6312394  -2.3144596   2.0717711  -0.5736556  -0.5199362
 -0.4892068   1.4417053   1.1748309   3.291245    2.7368522   2.1909308
  2.4100504  -1.5442916  -0.81270695 -1.7967525  -2.4401696   0.96489155
 -5.071314    2.4865592  -1.1760099   1.0010973  -1.8218107   6.159581
  5.876448   -1.9877293   6.579393    1.0499439  -1.5798447  -4.1203165
 -0.17076118 -4.819325   -2.1152763  -4.640588    1.5844907  -3.2757292
  2.1921952  -0.47692332 -1.8678508   1.0092752   0.7716696  -0.37776387
  0.07058215 -0.18511617  5.209738   -3.002555   -1.8404679   4.089005
 -2.0230193   1.0394226  -1.7199193   1.0383378   0.23976706 -0.67239416
  1.3192352  -0.33726573  0.21724188 -0.5032941   0.26279616 -0.58214176
 -3.0981517  -4.9684753  -3.2268834  -4.5933228  -3.0618596  -0

In [0]:
print(len(bananas_vector))

96


**Compare Similarity**

Use the doc.similarity method to compare doc1 to doc2 and print the result.

In [0]:
!python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [0]:
nlp_md = spacy.load("en_core_web_md")

In [0]:
doc1 = nlp_md("It's a warm summer day")
doc2 = nlp_md("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


Use the token.similarity method to compare token1 to token2 and print the result.

In [0]:
doc = nlp_md("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [0]:
doc = nlp_md("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = Span(doc, 3, 5)
span2 = Span(doc, 12, 15)

print(span1)
print(span2)
# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

great restaurant
really nice bar
0.75173926


In [0]:
for token in doc:
  print(token.text, token.i)

In [0]:
doc = nlp_md("This was a great restaurant. Afterwards, we went to a really nice bar.")

pattern1 = [{"TEXT": "great"}, {"TEXT": "restaurant"}]

matcher = Matcher(nlp_md.vocab)
matcher.add("PATTERN1", None, pattern1)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)



PATTERN1 great restaurant


**Combining models and rules**

- Both patterns in this exercise contain mistakes and won’t match as expected. Can you fix them? If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

- Edit pattern1 so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
- Edit pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.

In [0]:
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"POS": "PROPN"}]
#pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"POS": "PUNCT"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], start, end, doc[start:end].text)

PATTERN1 7 9 Amazon Prime
PATTERN2 27 31 ad-free viewing
PATTERN1 39 41 Amazon Prime
PATTERN2 44 48 ad-free viewing
PATTERN2 82 86 ad-free viewing
PATTERN2 102 106 ad-free viewing


**Efficient PharseMatcher**

- Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

- Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
- Add the phrase patterns and call the matcher on the doc.

In [0]:
COUNTRIES = ['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'United States Minor Outlying Islands', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cabo Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', "Côte d'Ivoire", 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', "Korea (Democratic People's Republic of)", 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Kosovo', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Korea (Republic of)', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']

doc = nlp_en("Czech Republic may help Slovakia protect its airspace. France is an ally of United States of America")

matcher = PhraseMatcher(nlp_en.vocab)
# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp_en.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)

print(matches)

print([doc[start:end] for match_id, start, end in matches])

[(4000319556510314152, 0, 2), (4000319556510314152, 4, 5), (4000319556510314152, 9, 10), (4000319556510314152, 14, 18)]
[Czech Republic, Slovakia, France, United States of America]


**Extract Countries and Relationship**

- In the previous exercise, you wrote a script using spaCy’s PhraseMatcher to find country names in text. Let’s use that country matcher on a longer text, analyze the syntax and update the document’s entities with the matched countries.

- Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
- Overwrite the entities in doc.ents and add the matched span.
- Get the matched span’s root head token.
- Print the text of the head token and the span.

In [0]:
matcher = PhraseMatcher(nlp_en.vocab)
patterns = list(nlp_en.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and find matches in it
doc = nlp_en("Czech Republic may help Slovakia protect its airspace. France is an ally of United States of America")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")
	
    #print(doc.ents)
    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]
    
     # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

Czech --> Czech Republic
Slovakia --> Slovakia
France --> France
United --> United States of America
[('Czech Republic', 'GPE'), ('Slovakia', 'GPE'), ('France', 'GPE'), ('United States of America', 'GPE')]
