In [1]:
import spacy 
from spacy.lang.en import English # import the English language class
nlp = spacy.load('en') # loading in the package we just downloaded...

In [2]:
nlp = spacy.load('en') # loading in the package we just downloaded...

In [3]:
# tis text is copy/pasted from a WSJ article by Brianna Abbott published 10/10/2019 at https://www.wsj.com/articles/vaping-related-lung-illnesses-jump-to-1-299-with-26-deaths-cdc-says-11570730171?mod=hp_lead_pos10

text = """

The number of confirmed and probable lung-injury cases linked to vaping increased to 1,299, including 26 deaths, the federal Centers for Disease Control and Prevention said Thursday.

The count of cases rose by 219 from a week ago.

The cases were spread across 49 states, the District of Columbia, and the U.S. Virgin Islands, and 26 people have died. Alaska is the only state without reported cases.

Connecticut, Pennsylvania, Michigan, Massachusetts, New York and Texas confirmed deaths for the first time over the past week. Georgia and California confirmed an additional death each.

Among the deaths recently reported was of a 17-year-old from New York City, one of the youngest people reported to have died due to vaping-related injury so far.

The CDC’s count of vaping-related deaths didn’t include one reported Wednesday by Utah’s health department. It said a person under the age of 30 years had died at home, without being hospitalized. The victim died after vaping products containing THC, the psychoactive ingredient in marijuana.

If confirmed by the CDC, the Utah death would raise the total number of vaping-related fatalities across the U.S. to 27.

Investigators from the Food and Drug Administration are conducting a criminal probe into the supply chain for vaping products, while health authorities investigate what is causing the vaping-related illnesses.

The authorities have found that, among the 573 patients who reported their vaping habits, 76% reported using products containing THC. Many had bought the products on the black market, according to previous reports.

Yet health officials say they haven’t linked any one product or substance with all of the illnesses, as only a third of the patients have reported exclusive THC use and only 13% have reported exclusive nicotine-product use.

As the numbers of injured have risen, health authorities have urged people to stop using electronic cigarettes, some highlighting THC-containing products specifically.

Separately, states including Massachusetts, New York and Washington have taken steps to crack down on flavored e-cigarettes, which the Trump administration has also said it would take.

"""

## 1. Creating a Spacy doc from our text

In [4]:
doc = nlp(text)

## 2. Finding the token text and associated part of speech for each token in our doc

In [5]:
for token in doc: # for each token in our Doc...
    print(token.text, token.pos_, "\n") # print the following:



 SPACE 

The DET 

number NOUN 

of ADP 

confirmed VERB 

and CCONJ 

probable ADJ 

lung NOUN 

- PUNCT 

injury NOUN 

cases NOUN 

linked VERB 

to ADP 

vaping VERB 

increased VERB 

to ADP 

1,299 NUM 

, PUNCT 

including VERB 

26 NUM 

deaths NOUN 

, PUNCT 

the DET 

federal ADJ 

Centers PROPN 

for ADP 

Disease PROPN 

Control PROPN 

and CCONJ 

Prevention PROPN 

said VERB 

Thursday PROPN 

. PUNCT 



 SPACE 

The DET 

count NOUN 

of ADP 

cases NOUN 

rose VERB 

by ADP 

219 NUM 

from ADP 

a DET 

week NOUN 

ago ADV 

. PUNCT 



 SPACE 

The DET 

cases NOUN 

were VERB 

spread VERB 

across ADP 

49 NUM 

states NOUN 

, PUNCT 

the DET 

District PROPN 

of ADP 

Columbia PROPN 

, PUNCT 

and CCONJ 

the DET 

U.S. PROPN 

Virgin PROPN 

Islands PROPN 

, PUNCT 

and CCONJ 

26 NUM 

people NOUN 

have VERB 

died VERB 

. PUNCT 

Alaska PROPN 

is VERB 

the DET 

only ADJ 

state NOUN 

without ADP 

reported VERB 

cases NOUN 

. PUNCT 



 SPACE 

C

## 3. Creating a set of each geopolitical entity mentioned in the article

In [6]:
gpes = set()

for ent in doc.ents:
    if ent.label_ == 'GPE':
        gpes.add(ent.text)
        
sorted(gpes)

['Alaska',
 'California',
 'Connecticut',
 'Georgia',
 'Massachusetts',
 'Michigan',
 'New York',
 'New York City',
 'Pennsylvania',
 'Texas',
 'U.S.',
 'Utah',
 'Washington',
 'the District of Columbia',
 'the U.S. Virgin Islands']

## 4. Using a RegEx to find any mention of a death count

In [7]:
import re

expression = r'\d* (deaths|death|fatality|fatalities)' 

for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        print("Found match:", span.text)

Found match: 26 deaths


## 5. Finding the similarity between the entire doc and the doc "I am happy"

In [8]:
import en_core_web_md
nlp = en_core_web_md.load()

doc1 = nlp(text)
doc2 = nlp("I am happy")

print(doc1.similarity(doc2))

0.5751000633710175
