## <center>EDA & Attempts on COVID-19 non-medical dataset</center>

The dataset has around 200,000 articles which are non-medical. All the articles are divided into following 5 major topics(sorted in descending order) - 
business,
general,        
finance,        
tech,            
science,         
healthcare,        
environment,       
automotive,        
ai.<br>We have taken only tech, autombile articles as our starting point. The previous EDA dataset has been updated to a more latest one and might not give the same results:

Dataset Link: <a> https://www.kaggle.com/jannalipenkova/covid19-public-media-dataset </a>


## TOC:

1) [Cleaning text](#clean)<br>
2) [Tokenizing and finding collocations](#collo)<br>
3) [Trying out with noun_chunks with just all extracted verbs](#nc)<br>
4) [Matcher - Spacy](#match)<br>
5) [Redudant Method 2](#redundant)<br>
6) [Regex to match Noun Phrases](#phrase)<br>

## Importing required libraries:

In [None]:
import pandas as pd
import numpy as np 
import pickle

In [None]:
filename = 'data/only_tech_automotive_articles'
getfile = open(filename, 'rb')
df = pickle.load(getfile)
getfile.close()
df.head()

Unnamed: 0,title,url,crawled_time,date,domain,author,content,topic_area
159,The US has its first case of the new Wuhan cor...,https://www.theverge.com/2020/1/21/21075647/us...,2020-03-27,2020-01-21,theverge,Nicole Wetsman,A case of the new virus spreading rapidly in C...,tech
197,Transportation shut down in city where new cor...,https://www.theverge.com/2020/1/22/21077545/co...,2020-03-19,2020-01-22,theverge,Nicole Wetsman,"Disease control officials in Wuhan, the Chines...",tech
198,Rapid global response to the new coronavirus s...,https://www.theverge.com/2020/1/22/21077214/co...,2020-03-19,2020-01-22,theverge,Nicole Wetsman,Scientists think the new virus spreading rapid...,tech
325,Huawei developer conference postponed due to W...,https://www.theverge.com/2020/1/23/21078258/hu...,2020-04-08,2020-01-23,theverge,Sam Byford,Huawei has announced the postponement of a maj...,tech
357,World Health Organization says it’s too early ...,https://www.theverge.com/2020/1/23/21077335/co...,2020-03-20,2020-01-23,theverge,Nicole Wetsman,The World Health Organization (WHO) said today...,tech


In [3]:
if 'url' and 'crawled_time'  in df:
    df.drop(['url','crawled_time'],axis=1,inplace=True)

In [4]:
df.groupby("topic_area")["domain"].value_counts()

topic_area  domain          
automotive  computerweekly       163
            autonews              55
            eenewsautomotive      25
            just-auto             24
tech        theverge            1749
            venturebeat          871
            techcrunch           715
            news.crunchbase      311
            bioworld             237
            engadget             181
            japantimes           180
            biospace              12
Name: domain, dtype: int64

In [5]:
df.describe()

Unnamed: 0,title,date,domain,author,content,topic_area
count,4523,4523,4523,4343,4523,4523
unique,4423,233,12,383,4523,2
top,Facebook introduces new livestreaming features...,2020-03-24,theverge,Kim Lyons,I got a pitch this week to cover an annual dev...,tech
freq,17,95,1749,149,1,4256


In [6]:
df.content[197][:1500]

'Disease control officials in Wuhan, the Chinese city where the outbreak of the new and rapidly spreading virus began, announced that it’s shutting down transportation within the city and will close all airports and train stations. The city is home to over 11 million people. By Thursday evening, the travel ban had been extended to two more cities as officials began closing off the seven million residents of Huanggang, a city about 30 miles east of Wuhan, and nearby Ezhou, a city of one million. The virus is similar to SARS, which circulated around the world in 2002 and 2003. So far, the new virus has sickened over 500 people and killed 17. In addition to the transportation shutdown, companies like General Motors and Ford are restricting and suspending travel to Wuhan, and Olympic qualifying events have been moved out of the city. **MAJOR BREAKING**: Wuhan, ground zero for the China #coronavirus, to be on public transport lockdown as of Thursday 10am, reports @ChinaDaily. All flights an

<a id='clean'></a>
## 2) Cleaning the text from puctuations, hastags and websites: 


In [7]:
import re
import string 

In [8]:
df.reset_index(inplace=True)

if 'index' in df:
    df.drop(['index'],axis=1,inplace=True)

In [9]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
def clean_round_1(text):
    text = text.lower()
    text = re.sub('[#|@]+[\w]+','',text)
    text = re.sub('http\S+','',text)
    #text = re.sub('\w+\d\w+', '', text)
    return text

def clean_round_2(text):
    for each in ['!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~']:
        text = text.replace(each,'')
        text = re.sub('—','',text)
        text = re.sub('’','',text)
        text = re.sub(',','',text)
        text = text.replace('“','')
        text = text.replace('”','')
    return text

In [11]:
df.content = df.content.apply(lambda x: clean_round_1(x))
df.content = df.content.apply(lambda x: clean_round_2(x))

### After cleaning:

In [12]:
df.content[0]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

## Tokenize the content articles: 

In [13]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


[nltk_data] Downloading package punkt to C:\Users\Karthik
[nltk_data]     Pyapali\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
df["tokens"] = [word_tokenize(each) for each in df.content]

In [15]:
df["unique_tokens"] = [set(each) for each in df.tokens] 

<a id='collo'></a>
## Getting collocations:

In [16]:
import nltk
from nltk.collocations import *

In [17]:
bigram_msr = nltk.collocations.BigramAssocMeasures()

In [18]:
res_txt = BigramCollocationFinder.from_words(df.tokens[197])
res = res_txt.score_ngrams(bigram_msr.raw_freq)
res[:5]

[(('.', '``'), 0.006864988558352402),
 (('.', 'the'), 0.006864988558352402),
 (('decision', 'to'), 0.006864988558352402),
 (("''", 'microsoft'), 0.004576659038901602),
 (("'ve", 'made'), 0.004576659038901602)]

<a id='kg'></a>
## Creating a small KG on just 100 rows from the topic areas - tech and automotive : 

In [19]:
df_kg = df.copy(deep=True)

In [20]:
df_kg.reset_index(inplace=True)

In [21]:
if "index" and "title" and "tokens" and "unique_tokens" and "date" and "domain" and "topic_area" in df_kg:
    df_kg.drop(columns=["index","title","tokens","unique_tokens","date","domain","topic_area"],axis=1,inplace=True)

In [22]:
df_kg = df_kg[:100]

In [23]:
df_kg[:5]

Unnamed: 0,author,content
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...
1,Nicole Wetsman,disease control officials in wuhan the chinese...
2,Nicole Wetsman,scientists think the new virus spreading rapid...
3,Sam Byford,huawei has announced the postponement of a maj...
4,Nicole Wetsman,the world health organization (who) said today...


<a id='aws'></a>
## Using AWS Comprehend to extract entities:

In [24]:
!pip install spacy
!python -m spacy download en

[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
[!] Download successful but linking failed
Creating a shortcut link for 'en' didn't work (maybe you don't have admin
permissions?), but you can still load the model via its full package name: nlp =
spacy.load('en_core_web_sm')


In [25]:
import spacy

In [26]:
nlp = spacy.load("en")

In [27]:
doc = nlp(df_kg.content[0])
print("Nouns:",[chunk.text for chunk in doc.noun_chunks])
print("Verbs:",[token.lemma_ for token in doc if token.pos_ == "VERB"])

Nouns: ['a case', 'the new virus', 'china', 'a patient', 'seattle washington reuters', 'the patient', 'china', 'the first us case', 'the virus', 'wuhan', 'a city', 'central china', 'late december', 'it', 'around 300 people', 'the case', 'the centers', 'disease control', 'prevention', '(cdc', 'a press briefing', 'they', 'the threat', 'the us', 'the virus', 'the designation', 'it', 'a coronavirus', 'the family', 'viruses', 'the sars outbreak', 'that outbreak', 'nearly 800 people', 'me', 'timothy', 'a coronavirus expert', 'assistant professor', 'the university', 'north carolina', 'gillings school', 'global public health', 'the us patient', 'seattle-tacoma international airport', 'january 15th', 'symptoms', 'his medical provider', 'sunday january 19th', 'the patient', 'the reports', 'the wuhan virus', 'them', 'his provider', 'the positive test', 'the coronavirus', 'he', 'little risk', 'hospital staff', 'the general public', 'the cdc', 'its briefing', 'january 17th', 'the cdc', 'enhanced he

In [28]:
# Find named entities, phrases and concepts
print('----Named Entities----')
for entity in doc.ents:
    print(entity.text, entity.label_)

----Named Entities----
china GPE
seattle GPE
washington GPE
china GPE
first ORDINAL
first ORDINAL
wuhan GPE
china GPE
late december 2019 DATE
300 CARDINAL
six CARDINAL
cdc ORG
us GPE
2003 DATE
nearly 800 CARDINAL
timothy sheahan PERSON
the university of north carolina ORG
us GPE
seattle GPE
tacoma international airport FAC
january 15th DATE
sunday january 19th DATE
the wuhan virus FAC
yesterday DATE
cdc ORG
january 17th DATE
cdc ORG
san francisco international FAC
john f. kennedy international PERSON
new york GPE
los angeles GPE
wuhan GPE
chicago GPE
hartsfield GPE
this week DATE
wuhan GPE
sheahan PERSON
2020 DATE
2002 DATE
the world health organization ORG
wuhan GPE
one CARDINAL
chinese NORP
this week DATE
at least one CARDINAL
one CARDINAL
nancy messonnier PERSON
the national center for immunization and respiratory ORG
cdc ORG
the world health organization ORG
wednesday january 22nd DATE
january 21st DATE
cdc ORG


In [29]:
# Visualize NER
from spacy import displacy
doc = nlp(df_kg.content[0])
displacy.render(doc, style='ent', jupyter=True)

In [30]:
## Trying out an example on creating a KG using py and Spacy: 

In [None]:
from spacy.lang.en import English
import networkx as nx
import matplotlib.pyplot as plt

def getSentences(text):
    nlp = English()
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    document = nlp(text)
    return [sent.string.strip() for sent in document.sents]

def printToken(token):
    print(token.text, "->", token.dep_)

def appendChunk(original, chunk):
    return original + ' ' + chunk

def isRelationCandidate(token):
    deps = ["ROOT", "adj", "attr", "agent", "amod"]
    return any(subs in token.dep_ for subs in deps)

def isConstructionCandidate(token):
    deps = ["compound", "prep", "conj", "mod"]
    return any(subs in token.dep_ for subs in deps)

def processSubjectObjectPairs(tokens):
    subject = ''
    object = ''
    relation = ''
    subjectConstruction = ''
    objectConstruction = ''
    for token in tokens:
        printToken(token)
        if "punct" in token.dep_:
            continue
        if isRelationCandidate(token):
            relation = appendChunk(relation, token.lemma_)
        if isConstructionCandidate(token):
            if subjectConstruction:
                subjectConstruction = appendChunk(subjectConstruction, token.text)
            if objectConstruction:
                objectConstruction = appendChunk(objectConstruction, token.text)
        if "subj" in token.dep_:
            subject = appendChunk(subject, token.text)
            subject = appendChunk(subjectConstruction, subject)
            subjectConstruction = ''
        if "obj" in token.dep_:
            object = appendChunk(object, token.text)
            object = appendChunk(objectConstruction, object)
            objectConstruction = ''

    print (subject.strip(), ",", relation.strip(), ",", object.strip())
    return (subject.strip(), relation.strip(), object.strip())

def processSentence(sentence):
    tokens = nlp_model(sentence)
    return processSubjectObjectPairs(tokens)

def printGraph(triples):
    G = nx.Graph()
    for triple in triples:
        G.add_node(triple[0])
        G.add_node(triple[1])
        G.add_node(triple[2])
        G.add_edge(triple[0], triple[1])
        G.add_edge(triple[1], triple[2])

    pos = nx.spring_layout(G)
    plt.figure()
    nx.draw(G, pos, edge_color='black', width=1, linewidths=1,
            node_size=1000, node_color='seagreen', alpha=0.9,
            labels={node: node for node in G.nodes()})
    plt.axis('off')
    plt.show()

if __name__ == "__main__":

    text = df_kg.content[0]
    sentences = getSentences(text)
    nlp_model = spacy.load('en_core_web_sm')

    triples = []
    print (text)
    for sentence in sentences:
        triples.append(processSentence(sentence))

    printGraph(triples)

In [31]:
len(df_kg.content[0])

3206

## First tokenize the text in content column:

In [32]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

df_kg["tokens"] = df_kg.content.apply(lambda x: nltk.word_tokenize(x)) 
df_kg["all_pos"] = df_kg.tokens.apply(lambda x : nltk.pos_tag(x))

#txt = nltk.pos_tag(txt)
#txt

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Karthik Pyapali\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [33]:
df_kg

Unnamed: 0,author,content,tokens,all_pos
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,"[a, case, of, the, new, virus, spreading, rapi...","[(a, DT), (case, NN), (of, IN), (the, DT), (ne..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,"[disease, control, officials, in, wuhan, the, ...","[(disease, NN), (control, NN), (officials, NNS..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,"[scientists, think, the, new, virus, spreading...","[(scientists, NNS), (think, VBP), (the, DT), (..."
3,Sam Byford,huawei has announced the postponement of a maj...,"[huawei, has, announced, the, postponement, of...","[(huawei, NN), (has, VBZ), (announced, VBN), (..."
4,Nicole Wetsman,the world health organization (who) said today...,"[the, world, health, organization, (, who, ), ...","[(the, DT), (world, NN), (health, NN), (organi..."
...,...,...,...,...
95,Jason D. Rowley,supergiant venture capital deals are big. like...,"[supergiant, venture, capital, deals, are, big...","[(supergiant, JJ), (venture, NN), (capital, NN..."
96,Nicole Wetsman,as the us health care system watches the ongoi...,"[as, the, us, health, care, system, watches, t...","[(as, IN), (the, DT), (us, PRP), (health, NN),..."
97,Sam Byford,intel vivo and ntt docomo are joining sony ama...,"[intel, vivo, and, ntt, docomo, are, joining, ...","[(intel, NN), (vivo, NN), (and, CC), (ntt, JJ)..."
98,Casey Newton,say you run a large social network in which yo...,"[say, you, run, a, large, social, network, in,...","[(say, VB), (you, PRP), (run, VBP), (a, DT), (..."


In [34]:
# Noun phrase = optional determinor (DT), followed by any of adjectives (JJ), 
# and ending in a noun (NN). 
pattern = 'NP: {<NN>*<VBP>*<NN>}'

pp = nltk.RegexpParser(pattern)

df_kg["pattern_parsed"] = df_kg.all_pos.apply(lambda x: pp.parse(x))    #.apply(lambda x: pp.parse(x))

In [35]:
df_kg.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [36]:
pp.parse(df_kg.all_pos[0])[:5] ## gives some error after sometime so trying out spacy

[('a', 'DT'),
 Tree('NP', [('case', 'NN')]),
 ('of', 'IN'),
 ('the', 'DT'),
 ('new', 'JJ')]

In [37]:
len(df_kg.all_pos[0])

553

In [38]:
len(df_kg.pattern_parsed[0])

538

In [39]:
df_kg.all_pos[0][:10]

[('a', 'DT'),
 ('case', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('new', 'JJ'),
 ('virus', 'NN'),
 ('spreading', 'VBG'),
 ('rapidly', 'RB'),
 ('in', 'IN'),
 ('china', 'NN')]

In [40]:
df_kg[:5]

Unnamed: 0,author,content,tokens,all_pos,pattern_parsed
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,"[a, case, of, the, new, virus, spreading, rapi...","[(a, DT), (case, NN), (of, IN), (the, DT), (ne...","[(a, DT), [(case, NN)], (of, IN), (the, DT), (..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,"[disease, control, officials, in, wuhan, the, ...","[(disease, NN), (control, NN), (officials, NNS...","[[(disease, NN), (control, NN)], (officials, N..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,"[scientists, think, the, new, virus, spreading...","[(scientists, NNS), (think, VBP), (the, DT), (...","[(scientists, NNS), (think, VBP), (the, DT), (..."
3,Sam Byford,huawei has announced the postponement of a maj...,"[huawei, has, announced, the, postponement, of...","[(huawei, NN), (has, VBZ), (announced, VBN), (...","[[(huawei, NN)], (has, VBZ), (announced, VBN),..."
4,Nicole Wetsman,the world health organization (who) said today...,"[the, world, health, organization, (, who, ), ...","[(the, DT), (world, NN), (health, NN), (organi...","[(the, DT), [(world, NN), (health, NN), (organ..."


<a id='spacy'></a>
## Using spacy, en_model, dependency parser:

In [41]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [42]:
kg = df_kg.copy(deep=True).drop(["tokens","all_pos","pattern_parsed"],axis=1)

In [43]:
kg[:5]

Unnamed: 0,author,content
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...
1,Nicole Wetsman,disease control officials in wuhan the chinese...
2,Nicole Wetsman,scientists think the new virus spreading rapid...
3,Sam Byford,huawei has announced the postponement of a maj...
4,Nicole Wetsman,the world health organization (who) said today...


In [44]:
kg["segment"] = kg.content.apply(nlp)

In [45]:
kg[:5]

Unnamed: 0,author,content,segment
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,"(a, case, of, the, new, virus, spreading, rapi..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,"(disease, control, officials, in, wuhan, the, ..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,"(scientists, think, the, new, virus, spreading..."
3,Sam Byford,huawei has announced the postponement of a maj...,"(huawei, has, announced, the, postponement, of..."
4,Nicole Wetsman,the world health organization (who) said today...,"(the, world, health, organization, (, who, ), ..."


In [46]:
[(x,':',x.dep_) for x in kg.segment[0]][:5]

[(a, ':', 'det'),
 (case, ':', 'nsubjpass'),
 (of, ':', 'prep'),
 (the, ':', 'det'),
 (new, ':', 'amod')]

In [47]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(len(spacy_stopwords))

326


In [48]:
kg["cleaned"] = ""

In [49]:
type(kg.segment[0])

spacy.tokens.doc.Doc

In [50]:
for doc in range(len(kg.segment)):
    for tok in kg.segment[doc]:
        if not tok.is_stop:
            kg.cleaned[doc] += str(tok)+' ' 

In [51]:
kg.cleaned[:5]

0    case new virus spreading rapidly china reporte...
1    disease control officials wuhan chinese city o...
2    scientists think new virus spreading rapidly c...
3    huawei announced postponement major developers...
4    world health organization ( ) said today early...
Name: cleaned, dtype: object

In [52]:
kg.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [53]:
if 'segment' in kg:
    kg.drop("segment",axis=1,inplace=True)

In [54]:
kg["tokens"] = kg.cleaned.apply(nlp)

In [55]:
kg[:5]

Unnamed: 0,author,content,cleaned,tokens
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,case new virus spreading rapidly china reporte...,"(case, new, virus, spreading, rapidly, china, ..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,disease control officials wuhan chinese city o...,"(disease, control, officials, wuhan, chinese, ..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,scientists think new virus spreading rapidly c...,"(scientists, think, new, virus, spreading, rap..."
3,Sam Byford,huawei has announced the postponement of a maj...,huawei announced postponement major developers...,"(huawei, announced, postponement, major, devel..."
4,Nicole Wetsman,the world health organization (who) said today...,world health organization ( ) said today early...,"(world, health, organization, (, ), said, toda..."


In [56]:
#kg.entities[0]

### Checking with spaCy dependecy parser:  

In [57]:
tok = {}
for each in kg.tokens[0]:
    tok[each] = each.dep_

In [58]:
from collections import Counter

Counter(tok.values())

Counter({'compound': 82,
         'amod': 30,
         'nsubj': 46,
         'acl': 3,
         'advmod': 10,
         'ROOT': 32,
         'ccomp': 24,
         'punct': 33,
         'dobj': 24,
         'advcl': 2,
         'nmod': 9,
         'npadvmod': 9,
         'nummod': 4,
         'prep': 1,
         'appos': 4,
         'acomp': 1,
         'csubj': 1,
         'xcomp': 4,
         'relcl': 1})

In [59]:
for key,value in tok.items():
    if value=='ROOT':
        a = key
        
        
        #(iNDEX)artcle.loc["report"]+1 b
        
     # PATTERN - (COMPUND +nn)

In [60]:
[key for key,value in tok.items() if value=='compound'][:20]

[case,
 patient,
 seattle,
 washington,
 case,
 wuhan,
 case,
 report,
 disease,
 control,
 press,
 coronavirus,
 family,
 sars,
 timothy,
 sheahan,
 coronavirus,
 expert,
 assistant,
 professor]

In [61]:
len([key for key,value in tok.items() if value=='nsubj'])

46

In [62]:
[key for key,value in tok.items() if value=='dobj'] #[:20]

[china,
 china,
 2019-ncov,
 people,
 flashbacks,
 provider,
 public,
 briefing,
 screenings,
 airport,
 wuhan,
 week,
 wuhan,
 airports,
 humans,
 bats,
 market,
 market,
 questions,
 person,
 reservoir,
 human,
 emergency,
 cdc]

In [63]:
tok1 = {}
for each in kg.tokens[1]:
    tok1[each] = each.dep_

In [64]:
Counter(tok1.values())

Counter({'compound': 97,
         'nsubj': 42,
         'nmod': 16,
         'amod': 42,
         'appos': 3,
         'ROOT': 38,
         'advmod': 9,
         'dobj': 27,
         'ccomp': 18,
         'xcomp': 5,
         'punct': 38,
         'nummod': 7,
         'npadvmod': 8,
         'prep': 3,
         'pobj': 3,
         'dep': 4,
         '': 6,
         'attr': 1,
         'relcl': 2,
         'csubj': 2,
         'acomp': 1,
         'advcl': 1,
         'acl': 3,
         'conj': 1,
         'auxpass': 1,
         'oprd': 1})

In [65]:
kg.columns

Index(['author', 'content', 'cleaned', 'tokens'], dtype='object')

In [66]:
for chunk in kg.tokens[0].noun_chunks:
        print(chunk)    #(chunk.root.text,':',chunk.root.dep_)

case new virus
china
patient seattle washington reuters
patient
china
case virus
wuhan city central china
300 people
case report centers
( cdc
press briefing
threat
virus
designation
coronavirus family viruses
sars
outbreak
nearly 800 people
sars flashbacks
patient
symptoms medical provider
patient familiar reports
wuhan virus
provider positive test coronavirus
little risk hospital staff general public
cdc
briefing
january 17th cdc
enhanced health screenings
san francisco international airport
john f. kennedy international airport new york
los angeles international airport passengers
connected wuhan
agency
chicago ohare international airport hartsfield - jackson atlanta international airport week
flights
wuhan
airports
health officials
sars
prepared respond threats
sheahan
people
sars
common animals
forms
infect humans
virus
sars example
bats
world health organization
wholesale seafood market
wuhan possible source virus
laboratory - confirmed patients
market
questions
new virus
chinese

## Trying out patterns to get < noun> < verb> < noun> meaning out of the big article:

In [67]:
len(kg.tokens[0])

320

In [68]:
len(kg.content[0])

3206

### Trying to segment into sentences and then checking the important tags which we need to describe: 

In [69]:
a = Counter(tok1.values())

In [70]:
type(a)

collections.Counter

In [71]:
[{key:value} for key,value in a.items() if (int(value)>20)]

[{'compound': 97},
 {'nsubj': 42},
 {'amod': 42},
 {'ROOT': 38},
 {'dobj': 27},
 {'punct': 38}]

In [72]:
for key,value in a.items():
    if int(value)>15:
        print({key:value})

{'compound': 97}
{'nsubj': 42}
{'nmod': 16}
{'amod': 42}
{'ROOT': 38}
{'dobj': 27}
{'ccomp': 18}
{'punct': 38}


In [73]:
#kg["Imp_dep_tags"] = [{key:value} for key,value in kg.tokens.items() if (int(value)>20)]

In [74]:
tok1 = {}
for each in kg.tokens[1]:
    tok1[each] = each.dep_

In [75]:
kg['all_token_deps'] = ""

In [76]:
kg.columns

Index(['author', 'content', 'cleaned', 'tokens', 'all_token_deps'], dtype='object')

In [77]:
for each in range(len(kg.tokens)):
    kg.all_token_deps.at[each] = dict([(x,x.dep_) for x in kg.tokens[each]])
kg[:5]

Unnamed: 0,author,content,cleaned,tokens,all_token_deps
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,case new virus spreading rapidly china reporte...,"(case, new, virus, spreading, rapidly, china, ...","{case: 'compound', new: 'amod', virus: 'nsubj'..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,disease control officials wuhan chinese city o...,"(disease, control, officials, wuhan, chinese, ...","{disease: 'compound', control: 'compound', off..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,scientists think new virus spreading rapidly c...,"(scientists, think, new, virus, spreading, rap...","{scientists: 'nsubj', think: 'ROOT', new: 'amo..."
3,Sam Byford,huawei has announced the postponement of a maj...,huawei announced postponement major developers...,"(huawei, announced, postponement, major, devel...","{huawei: 'nsubj', announced: 'ROOT', postponem..."
4,Nicole Wetsman,the world health organization (who) said today...,world health organization ( ) said today early...,"(world, health, organization, (, ), said, toda...","{world: 'compound', health: 'compound', organi..."


In [78]:
kg['imp_token_deps'] = ''
kg['imp_deps_count'] = ''

In [79]:
for each in range(len(kg.all_token_deps)):
    each_count = Counter(kg.all_token_deps[each].values())
    for key,value in each_count.items():
        if int(value)>15:
            
            if kg.imp_token_deps is None and kg.imp_deps_count is None:
                kg.imp_token_deps.at[each] += ','+key
                kg.imp_deps_count.at[each] += ','+key+','+str(value)

            else: 
                if key not in kg.imp_token_deps.at[each]:
                    kg.imp_token_deps.at[each] += ','+key
                    kg.imp_deps_count.at[each] += ','+key+','+str(value)
            
kg.imp_token_deps = kg.imp_token_deps.str.lstrip(',')
kg.imp_deps_count = kg.imp_deps_count.str.lstrip(',')

In [80]:
kg[kg.columns[-2:]]

Unnamed: 0,imp_token_deps,imp_deps_count
0,"compound,amod,nsubj,ROOT,ccomp,punct,dobj","compound,82,amod,30,nsubj,46,ROOT,32,ccomp,24,..."
1,"compound,nsubj,nmod,amod,ROOT,dobj,ccomp,punct","compound,97,nsubj,42,nmod,16,amod,42,ROOT,38,d..."
2,"nsubj,ROOT,amod,ccomp,advmod,dobj,punct,compou...","nsubj,80,ROOT,63,amod,74,ccomp,33,advmod,35,do..."
3,"nsubj,ROOT,amod,compound","nsubj,17,ROOT,17,amod,22,compound,43"
4,"compound,nsubj,punct,ROOT,amod,dobj","compound,44,nsubj,27,punct,38,ROOT,25,amod,35,..."
...,...,...
95,"compound,nsubj,ROOT,punct,amod,dobj,npadvmod,nmod","compound,97,nsubj,35,ROOT,31,punct,35,amod,52,..."
96,"compound,nsubj,ROOT,amod,dobj,punct","compound,73,nsubj,30,ROOT,25,amod,49,dobj,27,p..."
97,"compound,nsubj,dobj,ROOT,nmod,amod,punct","compound,75,nsubj,31,dobj,20,ROOT,24,nmod,18,a..."
98,"ROOT,amod,dobj,nsubj,advmod,punct,nummod,npadv...","ROOT,139,amod,172,dobj,143,nsubj,137,advmod,59..."


In [81]:
kg.imp_deps_count[0]

'compound,82,amod,30,nsubj,46,ROOT,32,ccomp,24,punct,33,dobj,24'

In [82]:
#for each in kg.imp_deps_count:
#    print(dict(zip(each)))

In [83]:
for each in range(len(kg.imp_deps_count)):
    kg.imp_deps_count[each] = kg.imp_deps_count[each].split(',')

## To create the Matcher pattern for creating relations, understanding how the data's dependencies are related:

In [84]:
kg_deps = kg[["imp_token_deps","imp_deps_count"]].copy(deep=True)

In [85]:
kg_deps[:5]

Unnamed: 0,imp_token_deps,imp_deps_count
0,"compound,amod,nsubj,ROOT,ccomp,punct,dobj","[compound, 82, amod, 30, nsubj, 46, ROOT, 32, ..."
1,"compound,nsubj,nmod,amod,ROOT,dobj,ccomp,punct","[compound, 97, nsubj, 42, nmod, 16, amod, 42, ..."
2,"nsubj,ROOT,amod,ccomp,advmod,dobj,punct,compou...","[nsubj, 80, ROOT, 63, amod, 74, ccomp, 33, adv..."
3,"nsubj,ROOT,amod,compound","[nsubj, 17, ROOT, 17, amod, 22, compound, 43]"
4,"compound,nsubj,punct,ROOT,amod,dobj","[compound, 44, nsubj, 27, punct, 38, ROOT, 25,..."


In [86]:
for each in range(len(kg_deps.imp_token_deps)):
    kg_deps.imp_token_deps[each] = kg_deps.imp_token_deps[each].split(',')

In [87]:
kg_deps.imp_token_deps[0][0]

'compound'

In [88]:
mySet = set()
for i in kg_deps.imp_token_deps:
    for each in i:
        if each not in mySet:
            mySet.add(each)

In [89]:
mySet   

{'ROOT',
 'acl',
 'advmod',
 'amod',
 'appos',
 'ccomp',
 'compound',
 'dobj',
 'nmod',
 'npadvmod',
 'nsubj',
 'nummod',
 'punct',
 'xcomp'}

In [90]:
for each in mySet:
    kg[each] = ''

In [91]:
kg.imp_deps_count[0][0]

'compound'

In [92]:
kg.head()

Unnamed: 0,author,content,cleaned,tokens,all_token_deps,imp_token_deps,imp_deps_count,acl,compound,appos,...,punct,npadvmod,nsubj,xcomp,nmod,ROOT,ccomp,advmod,amod,dobj
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,case new virus spreading rapidly china reporte...,"(case, new, virus, spreading, rapidly, china, ...","{case: 'compound', new: 'amod', virus: 'nsubj'...","compound,amod,nsubj,ROOT,ccomp,punct,dobj","[compound, 82, amod, 30, nsubj, 46, ROOT, 32, ...",,,,...,,,,,,,,,,
1,Nicole Wetsman,disease control officials in wuhan the chinese...,disease control officials wuhan chinese city o...,"(disease, control, officials, wuhan, chinese, ...","{disease: 'compound', control: 'compound', off...","compound,nsubj,nmod,amod,ROOT,dobj,ccomp,punct","[compound, 97, nsubj, 42, nmod, 16, amod, 42, ...",,,,...,,,,,,,,,,
2,Nicole Wetsman,scientists think the new virus spreading rapid...,scientists think new virus spreading rapidly c...,"(scientists, think, new, virus, spreading, rap...","{scientists: 'nsubj', think: 'ROOT', new: 'amo...","nsubj,ROOT,amod,ccomp,advmod,dobj,punct,compou...","[nsubj, 80, ROOT, 63, amod, 74, ccomp, 33, adv...",,,,...,,,,,,,,,,
3,Sam Byford,huawei has announced the postponement of a maj...,huawei announced postponement major developers...,"(huawei, announced, postponement, major, devel...","{huawei: 'nsubj', announced: 'ROOT', postponem...","nsubj,ROOT,amod,compound","[nsubj, 17, ROOT, 17, amod, 22, compound, 43]",,,,...,,,,,,,,,,
4,Nicole Wetsman,the world health organization (who) said today...,world health organization ( ) said today early...,"(world, health, organization, (, ), said, toda...","{world: 'compound', health: 'compound', organi...","compound,nsubj,punct,ROOT,amod,dobj","[compound, 44, nsubj, 27, punct, 38, ROOT, 25,...",,,,...,,,,,,,,,,


In [93]:
import spacy
from spacy import displacy

txt = 'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airport on january 15th and reported symptoms to his medical provider on sunday january 19th. the patient was familiar with the reports of the wuhan virus and shared them with his provider and the positive test for the coronavirus came back yesterday. he poses little risk to hospital staff or to the general public and is cooperating fully the cdc said in its briefing. on january 17th the cdc started enhanced health screenings at san francisco international airport john f. kennedy international airport in new york and los angeles international airport for passengers who flew from or connected through wuhan. the agency will begin screening at chicago ohare international airport and hartsfield-jackson atlanta international airport this week. any flights from or connecting through wuhan will also begin to be funneled to those airports. health officials experience with sars makes them more prepared to respond to threats from coronaviruses sheahan says. people are aware. before sars people had no idea this could happen. i think everyone is much more prepared now in 2020 than we were in 2002. coronaviruses are common in animals and can evolve into forms that can be passed to and infect humans. the virus that caused sars for example originated in bats. the world health organization has identified a wholesale seafood market in wuhan as a possible source of the virus but it also noted that some laboratory-confirmed patients did not report visiting this market. one of the most pressing questions about the new virus is how easily it spreads. chinese health authorities said this week that there has been at least one confirmed case where the virus passed directly from one person to another without passing through an animal reservoir. the key issue we need to understand is how easily or sustainably the virus is spread from human to human says nancy messonnier director of the national center for immunization and respiratory diseases at the cdc. an expert panel at the world health organization is meeting on wednesday january 22nd to determine if it should declare a global public health emergency. update january 21st 3:08pm et: this report was updated to include new information from the cdc.'
nlp = spacy.load("en_core_web_sm")

doc = nlp(txt)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [95]:
import spacy
from spacy import displacy

txt = 'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airport on january 15th and reported symptoms to his medical provider on sunday january 19th. the patient was familiar with the reports of the wuhan virus and shared them with his provider and the positive test for the coronavirus came back yesterday. he poses little risk to hospital staff or to the general public and is cooperating fully the cdc said in its briefing. on january 17th the cdc started enhanced health screenings at san francisco international airport john f. kennedy international airport in new york and los angeles international airport for passengers who flew from or connected through wuhan. the agency will begin screening at chicago ohare international airport and hartsfield-jackson atlanta international airport this week. any flights from or connecting through wuhan will also begin to be funneled to those airports. health officials experience with sars makes them more prepared to respond to threats from coronaviruses sheahan says. people are aware. before sars people had no idea this could happen. i think everyone is much more prepared now in 2020 than we were in 2002. coronaviruses are common in animals and can evolve into forms that can be passed to and infect humans. the virus that caused sars for example originated in bats. the world health organization has identified a wholesale seafood market in wuhan as a possible source of the virus but it also noted that some laboratory-confirmed patients did not report visiting this market. one of the most pressing questions about the new virus is how easily it spreads. chinese health authorities said this week that there has been at least one confirmed case where the virus passed directly from one person to another without passing through an animal reservoir. the key issue we need to understand is how easily or sustainably the virus is spread from human to human says nancy messonnier director of the national center for immunization and respiratory diseases at the cdc. an expert panel at the world health organization is meeting on wednesday january 22nd to determine if it should declare a global public health emergency. update january 21st 3:08pm et: this report was updated to include new information from the cdc.'
nlp = spacy.load("en_core_web_sm")

doc = nlp(txt)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [96]:
doc = nlp(txt)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [98]:
kg.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [99]:
if 'imp_deps_count' and 'entities' in df:
    kg.drop(['imp_deps_count','entities'],axis=1,inplace=True)

In [100]:
type(kg.all_token_deps[0])

dict

In [101]:
tok = {}
for each in kg.tokens[0]:
    tok[each] = each.dep_

In [104]:
tok

{case: 'compound',
 new: 'amod',
 virus: 'nsubj',
 spreading: 'acl',
 rapidly: 'advmod',
 china: 'nsubj',
 reported: 'ROOT',
 patient: 'compound',
 seattle: 'compound',
 washington: 'compound',
 reuters: 'nsubj',
 reports: 'ccomp',
 .: 'punct',
 patient: 'nsubj',
 recently: 'advmod',
 returned: 'ROOT',
 china: 'dobj',
 clinically: 'advmod',
 healthy: 'advmod',
 monitored: 'advcl',
 .: 'punct',
 case: 'compound',
 virus: 'nsubj',
 detected: 'ROOT',
 wuhan: 'compound',
 city: 'nmod',
 central: 'amod',
 china: 'dobj',
 late: 'amod',
 december: 'npadvmod',
 2019: 'nummod',
 .: 'punct',
 sickened: 'ROOT',
 300: 'nummod',
 people: 'nsubj',
 killed: 'ccomp',
 .: 'punct',
 despite: 'prep',
 case: 'compound',
 report: 'compound',
 centers: 'nsubj',
 disease: 'compound',
 control: 'compound',
 prevention: 'nsubj',
 (: 'punct',
 cdc: 'appos',
 ): 'punct',
 said: 'ROOT',
 press: 'compound',
 briefing: 'nsubj',
 believe: 'ccomp',
 threat: 'nsubj',
 remains: 'ccomp',
 low: 'acomp',
 .: 'punct',
 vir

In [115]:
from spacy.matcher import Matcher

# Matcher class object 
matcher = Matcher(nlp.vocab)

#define the pattern 
pattern = [{'DEP':'NOUN','OP':'*'}, 
        {'DEP':'VER','OP':"+"},  
        {'POS':'NOUN','OP':"+"}] 

matcher.add("matching_1", None, pattern) 

matches = matcher(nlp(kg.content[0]))
print(matches)
#k = len(matches) - 1

#span = doc[matches[k][1]:matches[k][2]] 
#print(span)

[]


In [117]:
kg[:2]

Unnamed: 0,author,content,cleaned,tokens,all_token_deps,imp_token_deps,imp_deps_count,acl,compound,appos,...,punct,npadvmod,nsubj,xcomp,nmod,ROOT,ccomp,advmod,amod,dobj
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,case new virus spreading rapidly china reporte...,"(case, new, virus, spreading, rapidly, china, ...","{case: 'compound', new: 'amod', virus: 'nsubj'...","compound,amod,nsubj,ROOT,ccomp,punct,dobj","[compound, 82, amod, 30, nsubj, 46, ROOT, 32, ...",,,,...,,,,,,,,,,
1,Nicole Wetsman,disease control officials in wuhan the chinese...,disease control officials wuhan chinese city o...,"(disease, control, officials, wuhan, chinese, ...","{disease: 'compound', control: 'compound', off...","compound,nsubj,nmod,amod,ROOT,dobj,ccomp,punct","[compound, 97, nsubj, 42, nmod, 16, amod, 42, ...",,,,...,,,,,,,,,,


In [118]:
relations = []

for each in range(len(kg.all_token_deps)):
    for key,value in kg.all_token_deps[each].items():
        if value == 'ROOT':
                relations.append(key)
          

In [119]:
relations = []

def get_entities():
    
    for each in range(len(kg.tokens)):

        for tag in kg.tokens[each]:
        
            if tag.pos_=='VERB':
                relations.append(tag)

            sub = ""
            obj = ""

            prv_tok_pos = ""    
            prv_tok_text = ""   

            prefix = ""

            if tag.pos_=='NOUN':
                prefix = tag

            if (prefix is not None) and (tag.pos_=='VERB'):
                sub = prefix
                prefix = ""
                prv_tok_pos = ""
                prv_tok_text = ""

            if prv_tok_pos=='ADJ':
                prefix = prv_tok_text + ' ' + tag
            elif prv_tok_pos =='NOUN':
                prefix = prv_tok_text + ' ' + tag

            prv_tok_pos = tag.pos_
            prv_tok_text = tag.text
        
    return [sub, obj]

In [120]:
get_entities()

['', '']

In [121]:
len(relations)

7398

In [122]:
relations[:5]

[spreading, reported, reports, returned, monitored]

In [123]:
def get_entities(sent):

    ent1=""
    ent2=""

    prv_tok_pos=""    
    prv_tok_text=""   

    prefix=""
    modifier=""

    for tok in nlp(sent):

        if tok.pos_=="NOUN":
            prefix=tok.text
        
        if prv_tok_pos=="NOUN" or prv_tok_pos=="ADJ":
            prefix=prv_tok_text+" "+tok.text

        ##if tok.pos_.endswith("mod")==True:
        ##    modifier = tok.text

        ##if prv_tok_dep=="compound":
        ##    modifier=prv_tok_text + " "+ tok.text

        if tok.dep_.find("subj")==True:
            ent1=modifier+" "+prefix+" "+tok.text
            prefix=""
            modifier=""
            prv_tok_pos=""
            prv_tok_text=""      

        if tok.dep_.find("obj")==True:
            ent2 = modifier+" "+prefix+" "+tok.text

        prv_tok_pos=tok.dep_
        prv_tok_text=tok.text

    return [ent1.strip(),ent2.strip()]

In [124]:
get_entities('a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports.')

['case case', 'patient reuters']

In [125]:
t1 = 'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airport on january 15th and reported symptoms to his medical provider on sunday january 19th. the patient was familiar with the reports of the wuhan virus and shared them with his provider and the positive test for the coronavirus came back yesterday. he poses little risk to hospital staff or to the general public and is cooperating fully the cdc said in its briefing. on january 17th the cdc started enhanced health screenings at san francisco international airport john f. kennedy international airport in new york and los angeles international airport for passengers who flew from or connected through wuhan. the agency will begin screening at chicago ohare international airport and hartsfield-jackson atlanta international airport this week. any flights from or connecting through wuhan will also begin to be funneled to those airports. health officials experience with sars makes them more prepared to respond to threats from coronaviruses sheahan says. people are aware. before sars people had no idea this could happen. i think everyone is much more prepared now in 2020 than we were in 2002. coronaviruses are common in animals and can evolve into forms that can be passed to and infect humans. the virus that caused sars for example originated in bats. the world health organization has identified a wholesale seafood market in wuhan as a possible source of the virus but it also noted that some laboratory-confirmed patients did not report visiting this market. one of the most pressing questions about the new virus is how easily it spreads. chinese health authorities said this week that there has been at least one confirmed case where the virus passed directly from one person to another without passing through an animal reservoir. the key issue we need to understand is how easily or sustainably the virus is spread from human to human says nancy messonnier director of the national center for immunization and respiratory diseases at the cdc. an expert panel at the world health organization is meeting on wednesday january 22nd to determine if it should declare a global public health emergency. update january 21st 3:08pm et: this report was updated to include new information from the cdc.'

In [126]:
for i in range(len(kg.tokens)):
    for each in range(len(kg.tokens[i])):
            print(kg.tokens[i][each].text)

case
new
virus
spreading
rapidly
china
reported
patient
seattle
washington
reuters
reports
.
patient
recently
returned
china
clinically
healthy
monitored
.
case
virus
detected
wuhan
city
central
china
late
december
2019
.
sickened
300
people
killed
.
despite
case
report
centers
disease
control
prevention
(
cdc
)
said
press
briefing
believe
threat
remains
low
.
virus
currently
known
2019-ncov
.
designation
indicates
coronavirus
family
viruses
caused
sars
outbreak
2003
.
outbreak
killed
nearly
800
people
.
bringing
sars
flashbacks
says
timothy
sheahan
coronavirus
expert
assistant
professor
university
north
carolina
gillings
school
global
public
health
.
patient
flew
seattle
-
tacoma
international
airport
january
15th
reported
symptoms
medical
provider
sunday
january
19th
.
patient
familiar
reports
wuhan
virus
shared
provider
positive
test
coronavirus
came
yesterday
.
poses
little
risk
hospital
staff
general
public
cooperating
fully
cdc
said
briefing
.
january
17th
cdc
started
enhanced
he

lot
things
coronaviruses
emerge
cause
severe
human
disease
sheahan
says
.
people
aware
.
update
january
22nd
5:06pm
et
:
report
updated
include
new
information
.
huawei
announced
postponement
major
developers
conference
safety
precautions
coronavirus
outbreak
wuhan
china
.
hdc.cloud
2020
planned
place
shenzhen
february
11th
12th
pushed
march
27th-28th
according
events
website
.
17
people
confirmed
killed
sars
-
like
virus
far
.
conference
enterprise
-
focused
serves
huaweis
primary
event
developers
.
want
share
ict
technologies
capabilities
huawei
developed
past
30
years
huawei
said
promote
.
kunpeng
ascend
processors
particular
powerful
new
engines
global
developers
.
huawei
unlikely
chinese
company
alter
upcoming
events
calendar
situation
wuhan
treated
countermeasures
.
foxconn
ceo
terry
gou
warned
staff
travel
mainland
china
upcoming
lunar
new
year
period
reuters
reports
.
entire
city
wuhan
11
million
residents
lockdown
thursday
morning
airports
train
stations
closed
notice
.
unclea

knowing
good
search
webcrawler
good
yahoo
power
user
internet
skills
.
google
hit
nt
realize
powerful
good
pagerank
technology
right
away
.
noticed
right
away
trust
search
results
organic
instead
paid
dark
patterns
tricking
clicking
ad
.
reasons
google
won
search
place
old
people
like
addition
superior
technology
drew
harder
line
allowing
paid
advertisements
search
results
competitors
.
search
engines
problem
paid
inclusion
rare
business
practice
exactly
phrase
means
.
knew
seeing
result
web
-
crawling
bot
business
deal
.
new
ad
layout
nt
cross
line
definitely
problematic
definitely
reduces
trust
googles
results
.
paid
inclusion
paid
occlusion
.
today
trust
google
allow
business
dealings
affect
rankings
organic
results
matter
people
nt
visually
tell
difference
glance
?
matter
certain
sections
google
like
hotels
flights
use
paid
inclusion
?
matter
business
dealings
likely
affect
outcome
use
generation
search
google
assistant
?
:
google
willing
visually
muddle
ads
long
users
lose
trust
a

overcast
spotify
casts
.
equity
drops
friday
6:00
pt
subscribe
apple
podcasts
overcast
spotify
casts
.
equity
drops
friday
6:00
pt
subscribe
apple
podcasts
overcast
spotify
casts
.
equity
drops
friday
6:00
pt
subscribe
apple
podcasts
overcast
spotify
casts
.
equity
drops
friday
6:00
pt
subscribe
apple
podcasts
overcast
spotify
casts
.
equity
drops
friday
6:00
pt
subscribe
apple
podcasts
overcast
spotify
casts
.
equity
drops
friday
6:00
pt
subscribe
apple
podcasts
overcast
spotify
casts
.
plague
inc
.
creator
ndemic
creations
announced
friday
players
remember
game
nt
scientific
model
deal
coronavirus
broke
wuhan
china
.
reminder
things
learn
video
games
limits
.
idea
2012
simulation
game
worldwide
pandemic
like
company
concerned
influx
saw
week
-
selling
app
china
.
coronavirus
killed
80
people
far
.
developer
recommended
people
inundating
companys
website
spiking
interest
game
check
official
sources
like
world
health
organization
.
centers
disease
control
answered
q&a
plague
inc
.
2013

apples
lead
supplier
said
tuesday
expect
coronavirus
affect
manufacturing
timelines
.
clear
time
apple
experiencing
retail
slowdowns
country
outbreak
planned
adjusting
manufacturing
plans
separately
.
cook
announcement
investors
apples
quarterly
earnings
release
saying
closed
retail
stores
number
channel
partners
closed
storefronts
.
apple
says
sales
area
city
wuhan
coronavirus
outbreak
said
originated
low
.
said
retail
traffic
country
negatively
affected
situation
.
reason
chinese
government
extended
lunar
new
year
holiday
encouraging
people
stay
home
avoid
unintentionally
spreading
contracting
virus
.
cook
said
apple
accounted
delay
reopening
production
facilities
holiday
extension
.
companys
revenue
projections
upcoming
quarter
reflect
added
.
additionally
apple
providing
care
kits
employees
wuhan
area
regularly
taking
temperature
employees
check
fever
flu
-
like
symptoms
indicative
virus
aggressively
cleaning
retail
stores
offices
.
related
united
airlines
suspending
flights
china


scroll
wo
nt
sell
data
company
snaps
scroll
someday
?
s
prominent
button
deleting
data
.
scrolls
privacy
policy
refreshingly
readable
candid
gathers
nt
share
  
including
honest
sharing
information
governments
required
law
.
notes
data
sale
company
.
basically
suggest
find
delete
information
button
remember
.
second
:
scrolls
entire
method
stopping
ads
absolutely
ingenious
repurposing
-
party
cookies
.
log
scroll
sets
cookie
websites
visit
special
cookie
nt
serve
ads
.
ad
-
blocking
nt
served
.
actually
elegant
second
think
chain
communications
deals
required
elegant
like
hellacious
hack
.
nilay
patel
said
today
nt
web
technology
hellacious
hack
?
details
  
safari
particular
stricter
browsers
requires
extension
.
brave
need
extra
effort
work
scroll
.
(
scroll
snarky
footnote
.
)
:
easier
solution
websites
paid
asking
roll
subscription
.
tracks
visit
automatically
divvies
payment
partner
sites
.
(
eventually
)
quibble
percentage
scroll
taking
:
$
1.50
$
5
thirty
percent
.
independent
s

crew
priority
company
said
email
.
ba.com
currently
shows
direct
flights
mainland
china
flights
hong
kong
unaffected
.
company
said
flights
suspended
receives
information
british
officials
according
bloomberg
.
airline
dropped
china
flights
.
case
united
airlines
announced
suspending
flights
yesterday
decision
taken
significant
decline
demand
specific
safety
concerns
.
british
airways
announced
cancellation
flights
wake
british
foreign
offices
essential
travel
mainland
china
.
uk
arranging
evacuate
citizens
wuhan
surrounding
hubei
province
result
virus
resulted
132
deaths
china
.
6000
people
worldwide
currently
thought
infected
according
cnn
.
american
airlines
suspend
flights
los
angeles
mainland
china
february
9th
march
27th
significant
decline
demand
brought
coronavirus
outbreak
company
announced
wednesday
.
airline
continue
operate
flights
beijing
shanghai
dallas
-
fort
worth
los
angeles
hong
kong
.
company
says
contact
customers
tickets
affected
flights
directly
email
telephone
.


.
new
mom
photo
breastfeeding
child
removed
error
appeal
board
decides
hear
case
landmark
decision
global
nipple
viewability
standards
probably
stand
wait
months
answer
.
business
ads
removed
promote
products
contain
cbd
oil
cbd
oil
widely
legal
-
month
delay
mean
difference
life
death
company
.
worth
noting
-
month
process
far
superior
current
system
justice
involves
filling
form
sending
praying
.
(
new
system
involve
filling
form
sending
praying
chance
independent
board
ask
case
formally
consult
experts
render
binding
opinion
favor
.
)
m
glad
worlds
largest
quasi
-
states
evolved
include
judicial
system
.
worth
noting
system
set
explicitly
redress
complaints
individual
users
.
wo
nt
asked
fix
facebook
broadly
  
judgments
service
health
overall
user
base
world
inhabit
.
remains
sole
discretion
executive
  
facebooks
ceo
.
company
ceo
majority
control
voting
shares
effectively
legislative
branch
.
facebook
taking
boldest
approach
ve
seen
establishing
independent
mechanism
accountabili

major
emergency
china
.
government
said
companies
city
allowed
resume
operations
week
.
company
nt
expect
big
financial
hit
shanghai
-
produced
model
3
represents
tiny
fraction
companys
quarterly
profits
kirkhorn
said
.
tesla
nt
company
disrupted
viral
outbreak
.
google
said
today
temporarily
shutting
china
offices
coronavirus
.
apple
facebook
restricted
employee
travel
week
.
wuhan
coronavirus
death
toll
rises
200
people
number
confirmed
cases
reaches
nearly
10000
15
countries
world
health
organization
(
)
yesterday
declared
public
health
emergency
international
concern
.
coronavirus
continues
spread
misinformation
ways
prevent
treat
.
false
claims
potential
vaccines
preposterous
prevention
methods
avoiding
cold
food
eating
spicy
food
shared
widely
social
media
.
miracle
cures
rinsing
mouth
saline
solution
drinking
bleach
reared
heads
.
facebook
confirmed
looking
limit
spread
misinformation
coronavirus
directing
people
helpful
information
.
company
said
fact
-
checking
content
debunki

cost
expensive
smartphone
.
bad
news
:
subsidizing
at&ts
harebrained
scheme
turn
media
conglomerate
turns
5
g
subscribers
hbo
max
watchers
vice
versa
.
└
amazon
says
150
million
prime
members
huge
holiday
season
occurred
today
amazon
prime
kind
infrastructure
service
includes
hidden
costs
video
content
.
└
heres
need
watch
super
bowl
4k
hdr
good
story
cameron
faulkner
.
story
way
complicated
!
low
-
key
win
amazon
short
answer
best
super
bowl
stream
?
fire
tv
.
└
twitch
caffeine
hosting
super
bowls
big
game
bijan
stephen
:
squint
twitch
rivals
tournament
makes
sense
terms
negotiating
new
better
deal
nfl
;
nt
escaped
leagues
notice
lot
players
  
young
internet
-
literate
  
interested
streaming
professionally
personally
.
└
nintendo
says
plans
new
switch
year
animal
crossing
-
themed
switch
nt
count
.
└
huawei
overtakes
apple
annual
race
samsungs
smartphone
crown
jump
especially
surprising
given
huaweis
continued
presence
usas
entity
list
prevents
company
installing
googles
apps
servic

adult
commentary
underneath
childrenʼs
videos
core
business
.
⭐
facebook
agreed
pay
$
550
million
settle
class
-
action
lawsuit
use
facial
recognition
technology
illinois
.
news
marks
major
victory
privacy
groups
report
natasha
singer
mike
isaac
new
york
times
:
case
stemmed
facebooks
photo
-
labeling
service
tag
suggestions
uses
face
-
matching
software
suggest
names
people
users
photos
.
suit
said
silicon
valley
company
violated
illinois
biometric
privacy
law
harvesting
facial
data
tag
suggestions
photos
millions
users
state
permission
telling
long
data
kept
.
facebook
said
allegations
merit
.
agreement
facebook
pay
$
550
million
eligible
illinois
users
plaintiffs
legal
fees
.
sum
dwarfs
$
380.5
million
equifax
credit
reporting
agency
agreed
month
pay
settle
class
-
action
case
2017
consumer
data
breach
.
mark
zuckerberg
slated
visit
brussels
mid
-
february
meeting
european
union
officials
facebook
fends
antitrust
privacy
scrutiny
handles
user
data
.
judge
texas
temporarily
blocked
r

canadian
company
bludot
sifted
global
news
reports
airline
data
reports
animal
disease
outbreaks
issue
alert
current
coronavirus
outbreak
days
ahead
official
organizations
world
health
organization
.
ai
survey
year
found
70
%
public
technology
leaders
believe
ai
lead
greater
social
isolation
loss
human
intellect
creativity
.
data
collected
knowledge
processed
ai
applications
manipulate
wants
beliefs
effectively
controlling
people
commercial
political
purposes
.
deepfake
videos
created
readily
available
ai
tools
pose
additional
challenge
undermining
ability
know
real
fake
.
course
ai
expected
jobs
.
given
force
technology
nt
governments
bracing
effect
robust
regulations
?
u.s
.
government
far
taking
hands
-
approach
.
u.s
.
chief
technology
officer
michael
kratsios
warned
federal
agencies
-
regulating
companies
developing
artificial
intelligence
.
views
u.s
.
government
nt
want
issue
meaningful
regulation
administration
finds
regulation
antithetical
core
beliefs
.
greater
movement
under

hedge
shared
story
unsubstantiated
claims
wuhan
-
based
scientist
created
new
coronavirus
weapon
doxxed
researcher
publishing
photo
email
phone
number
.
  
buzzfeed
news
discovered
zero
hedge
suggested
readers
"
probably
pay
[
scientist
]
visit
"
--
thinly
-
veiled
threat
violence
.
statement
twitter
said
banned
zero
hedge
violating
social
network
"
platform
manipulation
policy
.
"
  
zero
hedge
said
received
notice
friday
violating
twitter
policies
"
abuse
harassment
.
"
  
twitter
warned
late
january
ban
accounts
involved
"
coordinated
attempts
"
spread
coronavirus
misinformation
.
ban
completely
cut
zero
hedge
social
channels
(
facebook
write
)
significantly
limit
site
ability
disseminate
stories
roughly
670000
followers
.
  
incident
illustrates
risks
sites
use
mainstream
social
networks
spread
conspiracies
threats
.
  
reach
largest
potential
audiences
days
social
media
sites
crack
large
-
scale
abusers
erase
presence
instant
.
-
electric
racing
series
formula
e
canceled
upcoming


revenue
googles
cloud
service
rose
53
%
$
2.6
billion
quarter
advertising
youtube
rose
31
%
$
4.7
billion
.
addition
youtube
generated
$
750
million
subscription
non
-
advertising
revenue
alphabet
ceo
sundar
pichai
said
.
google
namesake
search
engine
properties
youtube
webs
biggest
draw
advertisers
decade
enabling
month
fourth
listed
company
$
1
trillion
market
capitalization
.
new
concerns
emerged
investors
dominance
u.s
.
antitrust
regulators
investigate
google
amazon.com
facebook
continue
grow
ads
businesses
globally
.
google
blamed
foreign
exchange
rates
-
time
product
changes
recent
lapses
20
%
revenue
growth
investors
grown
accustomed
company
.
overall
sales
fourth
quarter
$
46.08
billion
17
%
compared
average
estimate
$
46.94
billion
financial
analysts
tracked
refinitiv
.
google
ad
sales
holiday
shopping
quarter
$
37.93
billion
16.7
%
period
year
googles
revenue
bucket
including
app
store
purchases
cloud
computing
deals
rose
21.6
%
$
7.88
billion
.
shares
company
fell
4.66
%
ex

infections
fauci
said
.
results
remdesivir
trial
nt
expected
end
april
turn
drugs
investigation
effective
treating
new
virus
.
options
available
  
available
quickly
  
testament
research
s
.
   
mobile
world
congress
event
world
--
particularly
nerdy
corner
--
looks
annual
updates
smartphones
networking
technology
.
year
concerns
growing
coronavirus
outbreak
continue
mount
lg
decided
barcelona
-
based
trade
simply
worth
risk
.
"
safety
employees
general
public
foremost
mind
lg
decided
withdraw
exhibiting
participating
mwc
2020
later
month
barcelona
spain
"
statement
emailed
reporters
said
.
"
decision
prevent
needlessly
exposing
hundreds
lg
employees
international
travel
health
experts
advised
.
"
instead
lg
hold
separate
events
"
near
future
"
reveal
batch
2020
smartphones
expected
include
sequel
year
ambitious
underwhelming
g8
thinq
.
officially
begins
february
24th
played
host
100000
attendees
year
based
china
attended
behalf
chinese
companies
.
unfortunately
coronavirus
situation


mislead
people
voting
census
banned
deepfake
videos
pose
risk
egregious
harm
taken
context
.
note
actually
new
  
restated
time
caucuses
.
(
youtube
)
google
limiting
access
key
tools
track
ad
spending
disrupt
hundreds
marketers
rely
tools
jobs
.
situation
underscores
powerful
role
google
plays
digital
advertising
space
prompted
industry
partners
company
anticompetitive
.
(
gerrit
de
vynck
mark
bergen
/
bloomberg
)
ad
industry
groups
asking
california
delay
enforcement
states
new
privacy
law
.
law
went
effect
january
1st
stringent
rules
wo
nt
enforced
july
.
groups
s
time
businesses
compliance
.
(
suhauna
hussain
/
los
angeles
times
)
senator
lindsey
graham
trump
ally
targeting
big
tech
companies
like
apple
facebook
new
child
protection
bill
threaten
use
encryption
.
proposal
weaken
section
230
protections
related
child
exploitation
abuse
laws
.
(
ben
brody
naomi
nix
/
bloomberg
)
popular
pro
-
trump
website
released
personal
information
scientist
wuhan
china
falsely
accusing
creating


profile
redesign
s
remarkably
similar
instagram
.
new
profile
shifts
avatars
follow
count
left
places
emphasis
user
bios
.
(
dami
lee
/
verge
)
teenagers
group
accounts
flood
instagram
hard
-
-
parse
user
data
nt
tied
single
person
.
fascinating
response
young
peoples
anxieties
surveillance
.
teens
!
(
alfred
ng
/
cnet
)
bizarre
twist
fate
iowa
caucus
quarantined
rare
case
moronovirus
.
(
.
)
send
tips
comments
questions
iowa
caucus
results
:
casey.com
zoe.com
.
food
drug
administration
issued
expedited
approval
test
new
coronavirus
signing
use
state
health
labs
.
speed
efforts
detect
cases
virus
sickened
nearly
25000
people
world
.
samples
suspected
cases
sent
centers
disease
control
prevention
testing
.
ability
distribute
diagnostic
test
qualified
labs
critical
step
forward
protecting
public
health
said
fda
commissioner
stephen
hahn
statement
.
fda
sidestepped
usual
regulatory
channels
signed
test
emergency
use
authorization
allows
use
medical
products
life
-
threatening
situations
a

offer
googles
apps
services
year
including
google
play
store
.
problem
huawei
decided
officially
releasing
flagship
mate
30
internationally
.
company
announced
working
operating
system
called
harmony
os
says
investing
$
1
billion
fund
development
user
growth
marketing
huawei
mobile
services
alternative
googles
services
.
play
store
provides
significant
revenue
stream
google
takes
30
percent
cut
sales
store
.
total
thought
company
$
8.8
billion
worldwide
year
according
analyst
quoted
reuters
.
chinas
phone
makers
want
slice
pie
handset
sales
slow
globally
.
reuters
reports
march
launch
planned
new
platform
delayed
coronavirus
outbreak
.
-
interactive
publishing
label
private
division
announced
today
outer
worlds
‘
switch
port
delayed
coronavirus
.
supposed
come
march
6
.
virtuous
studio
handling
port
largest
office
shanghai
china
.
country
combating
coronavirus
private
division
notes
impacting
development
.
illness
causing
production
problems
switch
hardware
.
delaying
  
nintendo
switc

lecher
andrew
j.
hawkins
/
verge
)
face
masks
mandatory
provinces
china
government
tries
contain
coronavirus
.
residents
saying
masks
trip
facial
recognition
technology
everyday
transactions
.
(
anne
quito
/
quartz
)
companies
like
wechat
bytedance
working
stop
misinformation
coronavirus
china
.
filling
void
left
government
slow
acknowledge
crisis
.
(
south
china
morning
post
)
regulators
ireland
launched
inquiries
google
tinder
process
user
data
.
currently
23
ongoing
inquiries
big
tech
companies
include
facebook
twitter
.
(
associated
press
)
⭐
facebook
shutting
mobile
web
arm
audience
network
starting
april
11
.
network
offered
advertisers
way
extend
facebook
ad
campaigns
network
-
party
apps
.
lara
oreilly
digiday
explains
decision
:
open
web
environment
outside
facebooks
properties
changed
significantly
years
audience
network
launched
.
majority
browsers
turned
-
party
web
tracking
default
.
google
web
browser
commands
largest
market
share
indicated
month
plans
switch
support
-
pa

street
view
took
photos
everyones
homes
allows
browse
leisure
.
response
criticism
-
google
ceo
eric
schmidt
famously
suggested
people
angry
loss
privacy
simply
.
(
?
!
)
angry
germans
sued
ultimately
lost
.
courts
ruled
photos
taken
public
road
people
opt
having
homes
shown
privacy
violated
.
course
reason
people
object
massive
data
-
collection
schemes
gather
data
creators
intend
.
street
view
cars
example
connected
unsecured
wi
-
fi
networks
rounds
2008
2010
  
slurped
snippets
e
-
mails
photographs
passwords
chat
messages
[
]
postings
websites
social
networks
according
2012
story
new
york
times
.
google
said
mistake
apologized
germany
fined
shy
maximum
data
privacy
breach
scale
:
hilarious
145000
euros
.
(
leaving
zeroes
accident
.
)
intervening
years
like
data
privacy
scandals
forgotten
.
case
feels
freshly
relevant
light
past
months
news
clearview
ai
.
like
google
2008
clearview
slurps
public
data
  
case
photos
people
posted
publicly
internet
  
build
-
profit
tool
permission
in

focus
efforts
local
demos
new
technology
nt
want
employees
danger
catching
virus
.
ericsson
appreciates
gsma
control
risk
press
release
states
.
largest
exhibitors
ericsson
thousands
visitors
hall
day
risk
low
company
guarantee
health
safety
employees
visitors
.
gsm
association
organization
puts
said
statement
respects
ericssons
decision
encouraged
company
committed
2021
.
reemphasized
end
month
february
24th
27th
.
continuing
monitor
assess
virus
situation
appropriate
changes
.
group
previously
said
increase
medical
support
disinfection
measures
site
communicate
best
practices
attendees
.
speakers
subject
new
microphone
changing
protocol
-
handshake
policy
advised
.
xiaomi
vivo
honor
previously
told
verge
plan
attend
qualcomm
lenovo
motorola
.
ericsson
decision
comes
time
people
looking
company
business
.
attorney
general
william
barr
suggested
week
government
purchase
stake
ericsson
finland
-
based
nokia
effort
thwart
china
-
based
company
huaweis
telecom
ambitions
.
intelligence
ind

said
.
5000
-
6000
visitors
typically
come
china
worlds
premier
telecoms
industry
gathering
companies
spend
millions
stands
hospitality
fill
order
books
year
ahead
.
chinese
companies
huawei
zte
said
attend
ordering
china
-
based
staff
self
-
isolate
ahead
event
ensure
free
illness
drafting
european
staff
cover
stranded
.
china
raised
death
toll
outbreak
811
sunday
passing
number
killed
globally
sars
epidemic
total
confirmed
cases
illness
reached
37198
.
virus
spread
27
countries
territories
according
reuters
count
based
official
reports
infecting
330
people
outside
china
.
deaths
reported
outside
mainland
china
  
chinese
nationals
.
dozen
large
trade
fairs
industry
conferences
china
overseas
postponed
hit
travel
curbs
concerns
spread
virus
potentially
disrupting
billions
dollars
worth
deals
.
(
reporting
douglas
busvine
berlin
jessica
jones
madrid
;
editing
kirsten
donovan
.
)
sony
latest
tech
company
withdraw
mobile
world
conference
(
mwc
)
2020
barcelona
fears
novel
coronavirus
con

health
organization
january
30th
2020
sony
says
statement
posted
website
monday
.
place
utmost
importance
safety
wellbeing
customers
partners
media
employees
taken
difficult
decision
withdraw
exhibiting
participating
mwc
2020
barcelona
spain
.
amazon
tells
techcrunch
outbreak
continued
concerns
novel
coronavirus
amazon
withdraw
exhibiting
participating
mobile
world
congress
2020
scheduled
feb
.
24
-
27
barcelona
spain
.
amazon
historically
major
consumer
-
focusing
presence
mwc
.
sony
hand
uses
reveal
important
mobile
devices
.
year
example
company
announced
xperia
1
flagship
phone
.
sony
says
instead
announcements
online
xperia
youtube
channel
year
.
statement
tcl
said
taken
decision
cancel
press
event
mwc
.
stressed
decision
impact
mwc
2020
activities
planned
company
tcl
announce
latest
mobile
devices
showcase
booth
.
yesterday
gsm
association
organizes
mwc
updated
statement
detailing
countermeasures
taking
spread
coronavirus
.
travelers
chinas
hubei
province
outbreak
began
permitted

s
galaxy
fold
increasingly
obvious
s
motorola
razr
.
spoiler
alert
:
m
razr
tell
state
folding
phones
right
bumpy
  
literally
screen
bumpy
motorola
says
normal
.
says
creaking
noise
hinge
makes
normal
.
heres
statement
motorola
spokesperson
:
folding
unfolding
razr
hear
sound
intrinsic
mechanical
movement
phone
.
razr
undergone
rigorous
durability
testing
reported
sounds
way
affect
quality
product
.
record
think
motorola
different
views
included
considering
quality
product
.
folding
phone
ve
date
shared
following
qualities
:
fair
s
serve
examples
.
z
flip
needs
charm
consumers
rightfully
write
category
awhile
.
nt
think
z
flip
going
answer
single
bullet
points
.
fact
know
wo
nt
.
price
rumored
$
1400
.
tease
samsung
dropped
oscars
showed
hinge
design
looks
similar
galaxy
folds
hinge
  
including
gap
closed
.
fragile
screen
rumor
samsung
finally
figured
way
use
glass
instead
plastic
.
likely
need
thin
flexible
wo
nt
solve
durability
problem
fell
swoop
help
.
m
optimistic
software
z
fli

<a id='match'></a>
## Matcher for pattern recog:

In [127]:
from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

In [128]:
pattern = [{'POS': 'NOUN', 'OP': '+'},
           {'POS': 'ADJ', 'OP': '?'},
           {'POS': 'PROPN', 'OP': '?'},
           {'POS': 'VERB', 'OP': '*'},
           {'POS': 'NOUN', 'OP': '*'},
           {'POS': 'ADJ', 'OP': '?'},
           {'POS': 'PROPN', 'OP': '?'}]

In [129]:
m_tool.add('triples', None, pattern)

In [130]:
for i in kg.tokens[0]:
    matched = m_tool(kg.tokens[0])

In [131]:
for match_id, start, end in matched:
    string_id = nlp.vocab.strings[match_id]  
    span = kg.tokens[0][start:end]                   
    print(span.text)

case
case new
virus
case new virus
virus spreading
people
people killed
case
case report
centers
centers disease
case report centers
case report centers disease
centers disease control
threat
threat remains
threat remains low
virus
designation
designation indicates
designation indicates coronavirus
family
viruses
family viruses
viruses caused
viruses caused sars
family viruses caused
family viruses caused sars
people
health
15th
15th reported
symptoms
15th reported symptoms
symptoms medical
15th reported symptoms medical
provider
provider sunday
symptoms medical provider
symptoms medical provider sunday
provider sunday january
19th
reports
reports wuhan
reports wuhan virus
provider
provider positive
test
provider positive test
coronavirus
test coronavirus
provider positive test coronavirus
coronavirus came
test coronavirus came
yesterday
coronavirus came yesterday
test coronavirus came yesterday
risk
hospital
risk hospital
staff
hospital staff
risk hospital staff
staff general
staff ge

In [132]:
relations[:3]

[spreading, reported, reports]

<a id='nc'></a>
## Trying out with noun_chunks with just all extracted verbs: 

In [133]:
noun_chunks_0 = []
for chunk in kg.tokens[0].noun_chunks:
        noun_chunks_0.append(chunk)

In [134]:
noun_chunks_0[0]

case new virus

In [135]:
noun_chunks_0[1]

china

In [136]:
sub_0 = noun_chunks_0[::2]

len(sub_0)

37

In [137]:
obj_0 = noun_chunks_0[1::2]
len(obj_0)

37

In [138]:
verbs_0 = []
for each in kg.tokens[0]:
    if each.pos_=='VERB':
        verbs_0.append(each)

In [139]:
len(verbs_0)

65

In [140]:
vrb_0 = verbs_0[0:37]
len(vrb_0)

37

In [141]:
len(obj_0)

37

In [142]:
df_0 = pd.DataFrame({'ent1':sub_0,'verb':vrb_0,'ent2':obj_0})
df_0.head()

Unnamed: 0,ent1,verb,ent2
0,"(case, new, virus)",spreading,(china)
1,"(patient, seattle, washington, reuters)",reported,(patient)
2,(china),reports,"(case, virus)"
3,"(wuhan, city, central, china)",returned,"(300, people)"
4,"(case, report, centers)",monitored,"((, cdc)"


In [143]:
df_0.ent1 = list(df_0.ent1)
df_0.ent2 = list(df_0.ent2)

In [144]:
df_0.head()

Unnamed: 0,ent1,verb,ent2
0,"(case, new, virus)",spreading,(china)
1,"(patient, seattle, washington, reuters)",reported,(patient)
2,(china),reports,"(case, virus)"
3,"(wuhan, city, central, china)",returned,"(300, people)"
4,"(case, report, centers)",monitored,"((, cdc)"


In [145]:
for each in range(len(df_0.ent1)):
    for word in df_0.ent1[each]:
        df_0.ent1.iloc[each] =  str(word) + str(word)+' '

In [146]:
df_0

Unnamed: 0,ent1,verb,ent2
0,virusvirus,spreading,(china)
1,reutersreuters,reported,(patient)
2,chinachina,reports,"(case, virus)"
3,chinachina,returned,"(300, people)"
4,centerscenters,monitored,"((, cdc)"
5,briefingbriefing,detected,(threat)
6,virusvirus,sickened,(designation)
7,virusesviruses,killed,(sars)
8,outbreakoutbreak,said,"(nearly, 800, people)"
9,flashbacksflashbacks,believe,(patient)


<a id = 'redundant'></a>
## Redundant Method 2:

In [147]:
from spacy.symbols import *

np_labels = set(['NOUN','ADJ'])

def itern_nps(doc):
    for word in doc:
        if word.pos_ in np_labels:
            print(word.subtree)

In [148]:
#print(itern_nps(nlp(kg.content[0])))

In [149]:
from spacy.util import filter_spans
from spacy.matcher import Matcher

def noun_chunks(text):
    doc = nlp(text)
    pattern = [
    {'POS': 'DET', 'OP': '?'},
    {'POS': 'ADJ', 'OP': '*'},
    {'POS': 'NOUN', 'OP':'+'},
    {'POS': 'ADP', 'OP':'*'},
    {'POS': 'PROPN', 'OP':'*'},    
    {'POS':'VERB','OP':'*'},
    {'POS':'PROPN','OP':'*'}
    ]
    matcher = Matcher(nlp.vocab)
    matcher.add('NOUN_PHRASE', None, pattern)
    matches = matcher(doc)

    spans = [doc[start:end] for match_id, start, end in matches]
    
    return  filter_spans(spans)
    
   

In [150]:
noun_chunks(kg.content[0])

[a case of,
 the new virus spreading,
 a patient in seattle washington reuters reports,
 the patient,
 case of,
 the virus,
 a city in,
 people,
 the case report,
 the centers for disease control,
 a press briefing,
 the threat to,
 the virus,
 the designation indicates,
 a coronavirus,
 the family of,
 viruses,
 outbreak in,
 that outbreak killed,
 people,
 a coronavirus expert,
 assistant professor at,
 global public health,
 patient flew,
 15th,
 symptoms to,
 his medical provider on sunday january,
 19th,
 the patient,
 the reports of,
 virus,
 his provider,
 the positive test for,
 the coronavirus came,
 yesterday,
 little risk,
 hospital staff,
 the general public,
 its briefing,
 17th,
 health screenings at san francisco international airport john f. kennedy international airport,
 passengers,
 the agency will begin screening,
 this week,
 any flights from,
 those airports,
 health officials experience,
 threats from,
 coronaviruses sheahan says,
 people,
 people,
 no idea,
 cor

In [151]:
kg.content[0]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [152]:
sub_0[1]

patient seattle washington reuters

In [153]:
obj_0[1]

patient

In [154]:
doc = nlp(kg.content[0])

for sent in doc.sents:
    for np in sent.noun_chunks:
        {sent:[np]}

In [155]:
for i in range(len(kg.content)):
    for sent_i, sent in enumerate(nlp(kg.content[i]).sents):
        for token in sent:
            #for each in range(len(sub_0)):
            if token in sub_0[0]:
                print('in')
                print(i,sent_i,token)
                    #print(sent_i,sub_0)

In [156]:
type(sub_0)

list

In [157]:
sub_0[0]

case new virus

In [181]:
#for i in range(len(kg.content)):
for tok in nlp(kg.content[0]):
    #if str(tok) in str(sub_0):
     if str(sub_0) in str(tok):
            print(sub_0,tok.i)

<a id='phrase'></a>
## Regex to match Noun phrases:

In [178]:
import re

#sub_0 -> np before verb
# obj_0 -> np after verb
a = []

for tok in nlp(kg.content[0]): #
        
        for each in range(len(sub_0)):
            patt = re.compile(rf'{sub_0[each]}')
            match = re.search(patt,kg.content[0])   #rf'({each})'
            print(match)
            #if match:
                #if tok in list(sub_0):
                    #print('tok in sub_0')
                ##if str(tok) in str(sub_0[each][1]):
                #    a.append(tok.i, match.group(0))

None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(26

None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object

None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
Non

None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='w

<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), matc

<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='hea

None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2

<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced he

<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match obje

None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.M

<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14,

None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanc

None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2

None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='out

None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756),

<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(21

None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2

<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced he

<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 38

<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736

<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='b

None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='out

<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; span=(488, 502), match='press briefing'>
<re.Match object; span=(18, 23), match='virus'>
None
<re.Match object; span=(697, 705), match='outbreak'>
<re.Match object; span=(773, 788), match='sars flashbacks'>
None
<re.Match object; span=(1136, 1147), match='wuhan virus'>
None
<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14,

<re.Match object; span=(494, 502), match='briefing'>
<re.Match object; span=(1396, 1422), match='enhanced health screenings'>
None
None
None
<re.Match object; span=(286, 291), match='wuhan'>
<re.Match object; span=(1837, 1853), match='health officials'>
None
<re.Match object; span=(374, 380), match='people'>
None
<re.Match object; span=(2184, 2197), match='infect humans'>
None
<re.Match object; span=(2262, 2287), match='world health organization'>
None
<re.Match object; span=(2323, 2329), match='market'>
<re.Match object; span=(14, 23), match='new virus'>
<re.Match object; span=(1736, 1740), match='week'>
<re.Match object; span=(2680, 2686), match='person'>
<re.Match object; span=(2747, 2756), match='key issue'>
None
None
<re.Match object; span=(3078, 3108), match='global public health emergency'>
<re.Match object; span=(3130, 3136), match='3:08pm'>
<re.Match object; span=(60, 66), match='report'>
None
None
<re.Match object; span=(45, 50), match='china'>
None
None
<re.Match object; spa

## Conclusion:<br>Since we have multiple things in one noun phrase this will not work. You get a None match for more than one words which are not sequential in its position. 

In [160]:
sub_0[0][1]

new

In [170]:
list(sub_0[0])

[case, new, virus]

In [179]:
kg.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [180]:
verbs_0[:5]

[spreading, reported, reports, returned, monitored]