# <center> Path I</center>

## TOC:

1. [Cleaning the data - hashtags,links](#clean)<br>
2. [Creating a small dataset for KG, topic domain - tech & Automobile](#small)
3. [POS using NLTK](#nltk)
4. [Spacy - Noun Chunks](#spacy)
4. [Noun-Verb-Noun](#nvn)

### Importing required libraries:

In [1]:
import pandas as pd
import numpy as np 

In [2]:
import pickle

filename = 'data/only_tech_automotive_articles'
getfile = open(filename, 'rb')
df = pickle.load(getfile)
getfile.close()
df.head(2)

Unnamed: 0,title,url,crawled_time,date,domain,author,content,topic_area
159,The US has its first case of the new Wuhan cor...,https://www.theverge.com/2020/1/21/21075647/us...,2020-03-27,2020-01-21,theverge,Nicole Wetsman,A case of the new virus spreading rapidly in C...,tech
197,Transportation shut down in city where new cor...,https://www.theverge.com/2020/1/22/21077545/co...,2020-03-19,2020-01-22,theverge,Nicole Wetsman,"Disease control officials in Wuhan, the Chines...",tech


In [3]:
if 'url' and 'crawled_time'  in df:
    df.drop(['url','crawled_time'],axis=1,inplace=True)

In [4]:
df.groupby("topic_area")["domain"].value_counts()

topic_area  domain          
automotive  computerweekly       163
            autonews              55
            eenewsautomotive      25
            just-auto             24
tech        theverge            1749
            venturebeat          871
            techcrunch           715
            news.crunchbase      311
            bioworld             237
            engadget             181
            japantimes           180
            biospace              12
Name: domain, dtype: int64

In [5]:
df.describe()

Unnamed: 0,title,date,domain,author,content,topic_area
count,4523,4523,4523,4343,4523,4523
unique,4423,233,12,383,4523,2
top,Facebook introduces new livestreaming features...,2020-03-24,theverge,Kim Lyons,Giving people even more of a reason to stay ho...,tech
freq,17,95,1749,149,1,4256


In [6]:
df.content[197][:1500]

'Disease control officials in Wuhan, the Chinese city where the outbreak of the new and rapidly spreading virus began, announced that it’s shutting down transportation within the city and will close all airports and train stations. The city is home to over 11 million people. By Thursday evening, the travel ban had been extended to two more cities as officials began closing off the seven million residents of Huanggang, a city about 30 miles east of Wuhan, and nearby Ezhou, a city of one million. The virus is similar to SARS, which circulated around the world in 2002 and 2003. So far, the new virus has sickened over 500 people and killed 17. In addition to the transportation shutdown, companies like General Motors and Ford are restricting and suspending travel to Wuhan, and Olympic qualifying events have been moved out of the city. **MAJOR BREAKING**: Wuhan, ground zero for the China #coronavirus, to be on public transport lockdown as of Thursday 10am, reports @ChinaDaily. All flights an

<a id='clean'></a>
## Cleaning the text from puctuations, hastags and websites: 


In [7]:
import re
import string 

In [8]:
df.reset_index(inplace=True)

if 'index' in df:
    df.drop(['index'],axis=1,inplace=True)

In [9]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
def clean_round_1(text):
    text = text.lower()
    text = re.sub('[#|@]+[\w]+','',text)
    text = re.sub('http\S+','',text)
    #text = re.sub('\w+\d\w+', '', text)
    return text

def clean_round_2(text):
    for each in ['!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~']:
        text = text.replace(each,'')
        text = re.sub('—','',text)
        text = re.sub('’','',text)
        text = re.sub(',','',text)
        text = text.replace('“','')
        text = text.replace('”','')
    return text

In [11]:
df.content = df.content.apply(lambda x: clean_round_1(x))
df.content = df.content.apply(lambda x: clean_round_2(x))

### After cleaning:

In [12]:
df.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

<a id='small'></a>
## Creating a small KG on just 100 rows from the topic areas - tech and automotive : 

In [13]:
df_kg = df.copy(deep=True)

In [14]:
df_kg.reset_index(inplace=True)

In [15]:
if "index" and "title" and "tokens" and "unique_tokens" and "date" and "domain" and "topic_area" in df_kg:
    df_kg.drop(columns=["index","title","date","domain","topic_area"],axis=1,inplace=True)

In [16]:
df_kg = df_kg[:100]

In [17]:
df_kg[:5]

Unnamed: 0,author,content
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...
1,Nicole Wetsman,disease control officials in wuhan the chinese...
2,Nicole Wetsman,scientists think the new virus spreading rapid...
3,Sam Byford,huawei has announced the postponement of a maj...
4,Nicole Wetsman,the world health organization (who) said today...


<a id='token'></a>
## Tokenizing the text in content column:

<a id='nltk'></a>
## NLTK 

In [18]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

df_kg["tokens"] = df_kg.content.apply(lambda x: nltk.word_tokenize(x)) 
df_kg["all_pos"] = df_kg.tokens.apply(lambda x : nltk.pos_tag(x))

#txt = nltk.pos_tag(txt)
#txt

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Karthik Pyapali\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [19]:
df_kg[:5]

Unnamed: 0,author,content,tokens,all_pos
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,"[a, case, of, the, new, virus, spreading, rapi...","[(a, DT), (case, NN), (of, IN), (the, DT), (ne..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,"[disease, control, officials, in, wuhan, the, ...","[(disease, NN), (control, NN), (officials, NNS..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,"[scientists, think, the, new, virus, spreading...","[(scientists, NNS), (think, VBP), (the, DT), (..."
3,Sam Byford,huawei has announced the postponement of a maj...,"[huawei, has, announced, the, postponement, of...","[(huawei, NN), (has, VBZ), (announced, VBN), (..."
4,Nicole Wetsman,the world health organization (who) said today...,"[the, world, health, organization, (, who, ), ...","[(the, DT), (world, NN), (health, NN), (organi..."


In [20]:
df_kg.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [21]:
len(df_kg.all_pos[0])

553

In [22]:
df_kg.all_pos[0][:10]

[('a', 'DT'),
 ('case', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('new', 'JJ'),
 ('virus', 'NN'),
 ('spreading', 'VBG'),
 ('rapidly', 'RB'),
 ('in', 'IN'),
 ('china', 'NN')]

In [23]:
df_kg[:5]

Unnamed: 0,author,content,tokens,all_pos
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...,"[a, case, of, the, new, virus, spreading, rapi...","[(a, DT), (case, NN), (of, IN), (the, DT), (ne..."
1,Nicole Wetsman,disease control officials in wuhan the chinese...,"[disease, control, officials, in, wuhan, the, ...","[(disease, NN), (control, NN), (officials, NNS..."
2,Nicole Wetsman,scientists think the new virus spreading rapid...,"[scientists, think, the, new, virus, spreading...","[(scientists, NNS), (think, VBP), (the, DT), (..."
3,Sam Byford,huawei has announced the postponement of a maj...,"[huawei, has, announced, the, postponement, of...","[(huawei, NN), (has, VBZ), (announced, VBN), (..."
4,Nicole Wetsman,the world health organization (who) said today...,"[the, world, health, organization, (, who, ), ...","[(the, DT), (world, NN), (health, NN), (organi..."


<a id='spacy'></a>
## Spacy: 

In [24]:
import spacy

In [25]:
nlp = spacy.load("en")

In [26]:
len(df_kg.content[0]) 

3206

<a id='chunks'></a>
## Noun Chunks:

In [27]:
doc = nlp(df_kg.content[0])
print("Nouns:",[chunk.text for chunk in doc.noun_chunks])
print("Verbs:",[token.lemma_ for token in doc if token.pos_ == "VERB"])

Nouns: ['a case', 'the new virus', 'china', 'a patient', 'seattle washington reuters', 'the patient', 'china', 'the first us case', 'the virus', 'wuhan', 'a city', 'central china', 'late december', 'it', 'around 300 people', 'the case', 'the centers', 'disease control', 'prevention', '(cdc', 'a press briefing', 'they', 'the threat', 'the us', 'the virus', 'the designation', 'it', 'a coronavirus', 'the family', 'viruses', 'the sars outbreak', 'that outbreak', 'nearly 800 people', 'me', 'timothy', 'a coronavirus expert', 'assistant professor', 'the university', 'north carolina', 'gillings school', 'global public health', 'the us patient', 'seattle-tacoma international airport', 'january 15th', 'symptoms', 'his medical provider', 'sunday january 19th', 'the patient', 'the reports', 'the wuhan virus', 'them', 'his provider', 'the positive test', 'the coronavirus', 'he', 'little risk', 'hospital staff', 'the general public', 'the cdc', 'its briefing', 'january 17th', 'the cdc', 'enhanced he

In [28]:
# Find named entities, phrases and concepts
print('----Named Entities----')
for entity in doc.ents:
    print(entity.text, entity.label_)

----Named Entities----
china GPE
seattle GPE
washington GPE
china GPE
first ORDINAL
first ORDINAL
wuhan GPE
china GPE
late december 2019 DATE
300 CARDINAL
six CARDINAL
cdc ORG
us GPE
2003 DATE
nearly 800 CARDINAL
timothy sheahan PERSON
the university of north carolina ORG
us GPE
seattle GPE
tacoma international airport FAC
january 15th DATE
sunday january 19th DATE
the wuhan virus FAC
yesterday DATE
cdc ORG
january 17th DATE
cdc ORG
san francisco international FAC
john f. kennedy international PERSON
new york GPE
los angeles GPE
wuhan GPE
chicago GPE
hartsfield GPE
this week DATE
wuhan GPE
sheahan PERSON
2020 DATE
2002 DATE
the world health organization ORG
wuhan GPE
one CARDINAL
chinese NORP
this week DATE
at least one CARDINAL
one CARDINAL
nancy messonnier PERSON
the national center for immunization and respiratory ORG
cdc ORG
the world health organization ORG
wednesday january 22nd DATE
january 21st DATE
cdc ORG


## We have too many entities in just one article, thus we need to understand how can we extract a relationship without overwhelming information of the context. 

In [29]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [30]:
import spacy
from spacy import displacy

txt = 'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airport on january 15th and reported symptoms to his medical provider on sunday january 19th. the patient was familiar with the reports of the wuhan virus and shared them with his provider and the positive test for the coronavirus came back yesterday. he poses little risk to hospital staff or to the general public and is cooperating fully the cdc said in its briefing. on january 17th the cdc started enhanced health screenings at san francisco international airport john f. kennedy international airport in new york and los angeles international airport for passengers who flew from or connected through wuhan. the agency will begin screening at chicago ohare international airport and hartsfield-jackson atlanta international airport this week. any flights from or connecting through wuhan will also begin to be funneled to those airports. health officials experience with sars makes them more prepared to respond to threats from coronaviruses sheahan says. people are aware. before sars people had no idea this could happen. i think everyone is much more prepared now in 2020 than we were in 2002. coronaviruses are common in animals and can evolve into forms that can be passed to and infect humans. the virus that caused sars for example originated in bats. the world health organization has identified a wholesale seafood market in wuhan as a possible source of the virus but it also noted that some laboratory-confirmed patients did not report visiting this market. one of the most pressing questions about the new virus is how easily it spreads. chinese health authorities said this week that there has been at least one confirmed case where the virus passed directly from one person to another without passing through an animal reservoir. the key issue we need to understand is how easily or sustainably the virus is spread from human to human says nancy messonnier director of the national center for immunization and respiratory diseases at the cdc. an expert panel at the world health organization is meeting on wednesday january 22nd to determine if it should declare a global public health emergency. update january 21st 3:08pm et: this report was updated to include new information from the cdc.'
nlp = spacy.load("en_core_web_sm")

doc = nlp(txt)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [31]:
doc = nlp(txt)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [32]:
kg = df_kg.copy(deep=True).drop(["tokens","all_pos"],axis=1)

In [33]:
kg[:5]

Unnamed: 0,author,content
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...
1,Nicole Wetsman,disease control officials in wuhan the chinese...
2,Nicole Wetsman,scientists think the new virus spreading rapid...
3,Sam Byford,huawei has announced the postponement of a maj...
4,Nicole Wetsman,the world health organization (who) said today...


In [34]:
kg.content[0]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [45]:
if 'imp_deps_count' and 'entities' in df:
    kg.drop(['imp_deps_count','entities'],axis=1,inplace=True)

<a id='nvn'></a>
## Getting the noun - verb - noun edges and relations:

In [36]:
doc = nlp(kg.content[0])
for sent in doc.sents:
    
    for tok in sent:
        print(tok.i,tok.text,tok.pos_)

0 a DET
1 case NOUN
2 of ADP
3 the DET
4 new ADJ
5 virus NOUN
6 spreading VERB
7 rapidly ADV
8 in ADP
9 china PROPN
10 has AUX
11 been AUX
12 reported VERB
13 in ADP
14 a DET
15 patient NOUN
16 in ADP
17 seattle PROPN
18 washington PROPN
19 reuters PROPN
20 reports VERB
21 . PUNCT
22 the DET
23 patient NOUN
24 had AUX
25 recently ADV
26 returned VERB
27 from ADP
28 china PROPN
29 and CCONJ
30 is AUX
31 clinically ADV
32 healthy ADJ
33 but CCONJ
34 still ADV
35 being AUX
36 monitored VERB
37 . PUNCT
38 this DET
39 is AUX
40 the DET
41 first ADJ
42 us PROPN
43 case NOUN
44 of ADP
45 the DET
46 virus NOUN
47 which DET
48 was AUX
49 first ADV
50 detected VERB
51 in ADP
52 wuhan PROPN
53 a DET
54 city NOUN
55 in ADP
56 central ADJ
57 china PROPN
58 in ADP
59 late ADJ
60 december PROPN
61 2019 NUM
62 . PUNCT
63 it PRON
64 has AUX
65 already ADV
66 sickened VERB
67 around ADV
68 300 NUM
69 people NOUN
70 and CCONJ
71 killed VERB
72 six NUM
73 . PUNCT
74 despite SCONJ
75 the DET
76 case NOUN
7

In [37]:
l = ['.', 'monitored', 'being', 'still', 'but', 'healthy', 'clinically', 'is', 'and', 'china', 'from', 'returned',\
     'recently', 'had', 'patient']
l[26:22:-1]

[]

<a id='algo'></a>
## Noun Verb Noun algorithm:

In [38]:
#for each in kg.content[0]:
each = kg.content[0]
ent1_0 = []
ent2_0 = []
verbs_0 = []

def getBeforeVerbNoun(start,verb_i,article):
    print('Starting position:',start)
    print('Verb position:',verb_i)
    
    nn = ['NOUN','PROPN']
    
    print('The sentence:',list(article)[start:(verb_i+1):1])

    for i in (list(article)[verb_i:start:-1]):
        
        if i.pos_ in nn: 
            print('Before - POS_\'NOUN\':',i)
            return i

def getAfterVerbNoun(verb_i,article):
    
    nn = ['NOUN','PROPN']
    
    for i in (list(article)[verb_i::]):
        
        if i.pos_ in nn: 
            print('After - POS_\'NOUN\':',i)
            return i

for each in kg.content:
    for sent in nlp(each).sents:
        start = 0
        end = 0
        for tok in sent:     
            if tok.is_sent_start:
                start = tok.i

            if tok.pos_=='VERB':
                verbs_0.append(tok)

                #print(tok.i) 
                #print('verb',tok)
                ent_1 = getBeforeVerbNoun(start,tok.i,nlp(each))
                print('POS_ \'VERB\'',tok,tok.i)
                print('------------------------------------')
                ent1_0.append(ent_1)

                ent_2 = getAfterVerbNoun(tok.i,nlp(each))
                ent2_0.append(ent_2)

Starting position: 0
Verb position: 6
The sentence: [a, case, of, the, new, virus, spreading]
Before - POS_'NOUN': virus
POS_ 'VERB' spreading 6
------------------------------------
After - POS_'NOUN': china
Starting position: 0
Verb position: 12
The sentence: [a, case, of, the, new, virus, spreading, rapidly, in, china, has, been, reported]
Before - POS_'NOUN': china
POS_ 'VERB' reported 12
------------------------------------
After - POS_'NOUN': patient
Starting position: 0
Verb position: 20
The sentence: [a, case, of, the, new, virus, spreading, rapidly, in, china, has, been, reported, in, a, patient, in, seattle, washington, reuters, reports]
Before - POS_'NOUN': reuters
POS_ 'VERB' reports 20
------------------------------------
After - POS_'NOUN': patient
Starting position: 22
Verb position: 26
The sentence: [the, patient, had, recently, returned]
Before - POS_'NOUN': patient
POS_ 'VERB' returned 26
------------------------------------
After - POS_'NOUN': china
Starting position:

Verb position: 306
The sentence: [any, flights, from, or, connecting]
Before - POS_'NOUN': flights
POS_ 'VERB' connecting 306
------------------------------------
After - POS_'NOUN': wuhan
Starting position: 302
Verb position: 309
The sentence: [any, flights, from, or, connecting, through, wuhan, will]
Before - POS_'NOUN': wuhan
POS_ 'VERB' will 309
------------------------------------
After - POS_'NOUN': airports
Starting position: 302
Verb position: 311
The sentence: [any, flights, from, or, connecting, through, wuhan, will, also, begin]
Before - POS_'NOUN': wuhan
POS_ 'VERB' begin 311
------------------------------------
After - POS_'NOUN': airports
Starting position: 302
Verb position: 314
The sentence: [any, flights, from, or, connecting, through, wuhan, will, also, begin, to, be, funneled]
Before - POS_'NOUN': wuhan
POS_ 'VERB' funneled 314
------------------------------------
After - POS_'NOUN': airports


KeyboardInterrupt: 

In [39]:
len(ent1_0)

37

In [40]:
len(ent2_0)

37

In [41]:
len(verbs_0)

38

In [43]:
kg.content[0][:1000]

'a case of the new virus spreading rapidly in china has been reported in a patient in seattle washington reuters reports. the patient had recently returned from china and is clinically healthy but still being monitored. this is the first us case of the virus which was first detected in wuhan a city in central china in late december 2019. it has already sickened around 300 people and killed six. despite the case report the centers for disease control and prevention (cdc) said during a press briefing that they believe the threat to the us remains low. the virus is currently known as 2019-ncov. the designation indicates that it is a coronavirus the family of viruses that also caused the sars outbreak in 2003. that outbreak killed nearly 800 people. its bringing back sars flashbacks for me says timothy sheahan a coronavirus expert and assistant professor at the university of north carolina gillings school of global public health. the us patient flew into seattle-tacoma international airpor

In [46]:
df_article0 = pd.DataFrame({'ent1_article0':ent1_0[0:37],'relations_article0':verbs_0[0:37],'ent2_article0':ent2_0[0:37]})
df_article0.head()

Unnamed: 0,ent1_article0,relations_article0,ent2_article0
0,virus,spreading,china
1,china,reported,patient
2,reuters,reports,patient
3,patient,returned,china
4,china,monitored,us


In [47]:
len(df_article0)

37

#df_article0.to_csv('article0_triples_simple.csv')

In [48]:
kg.head()

Unnamed: 0,author,content
0,Nicole Wetsman,a case of the new virus spreading rapidly in c...
1,Nicole Wetsman,disease control officials in wuhan the chinese...
2,Nicole Wetsman,scientists think the new virus spreading rapid...
3,Sam Byford,huawei has announced the postponement of a maj...
4,Nicole Wetsman,the world health organization (who) said today...


## Pickle the kg dataframe for next step:

In [None]:
import pickle

filename = 'data/Noun-Verb-Noun.pkl'

out = open(filename,'wb')
pickle.dump(kg,out)
out.close()

## Conclusion: <br> For Noun-Verb-Noun, this dataset is ready to be fed into Neo4j.