## Using spacy for NER in NLP

Plan is to write scripts for using spacy to scalably and speedily ID company names in analyst call transcripts.

Here, I'll assume you know how to install spacy. __[See here for a start](https://spacy.io/usage/)__. 

**Note**: __[for windows, this requires installing Visual Studio build tools 2015 as well](https://visualstudio.microsoft.com/visual-cpp-build-tools/)__ .

In [1]:
# setup chunk
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

At some point, Q will arise "So what all can spacy do?" See below.

![spacy%20functionality%20tbl.png](attachment:spacy%20functionality%20tbl.png)

we'll test-drive a little bit of this functionality in this notebook.

In [2]:
# import sample data
path1 = 'C:/Users/31172/TABA/Session 4/Session 4 Materials/'
filename1 = 'samsung s7 reviews amazon.com.txt' # 'Game of Thrones IMDB reviews.txt'  # 'ISB PGP.txt'
corpus0 = open(path1 + filename1).readlines()  # corpus of stacked raw text documents
len(corpus0)  # 200 review docs

200

In [3]:
# spacy for regular or basic NLP ops. define simple func below
def spacy_ops(corpus):
    corpus_annotated = []
    corpus_lemmatized = []
    corpus_pos = []
    for i in range(len(corpus)):
        doc1 = nlp(corpus[i])  # spacy magic happens here. annotations built.
        out1 = [(token.text, token.lemma_, token.pos_, token.tag_) for token in doc1]
        corpus_lemmatized.append([token.lemma_ for token in doc1])
        corpus_pos.append([token.pos_ for token in doc1])
        corpus_annotated.append(out1)
    return corpus_lemmatized, corpus_pos, corpus_annotated

corpus_lemmatized, corpus_pos, corpus_annotated = spacy_ops(corpus0)  # since 2 objs are returned by func
# corpus_lemmatized[0][:20]  #view first 20 items in the first list elem 
corpus_annotated[0][:20]  #view first 20 items in the first list elem 

[('I', '-PRON-', 'PRON', 'PRP'),
 ('bought', 'buy', 'VERB', 'VBD'),
 ('the', 'the', 'DET', 'DT'),
 ('Samsung', 'Samsung', 'PROPN', 'NNP'),
 ('Galaxy', 'Galaxy', 'PROPN', 'NNP'),
 ('S7', 'S7', 'PROPN', 'NNP'),
 ('Edge', 'Edge', 'PROPN', 'NNP'),
 ('G935F', 'G935F', 'PROPN', 'NNP'),
 ('in', 'in', 'ADP', 'IN'),
 ('Black', 'Black', 'PROPN', 'NNP'),
 ('from', 'from', 'ADP', 'IN'),
 ('this', 'this', 'DET', 'DT'),
 ('product', 'product', 'NOUN', 'NN'),
 ('page', 'page', 'NOUN', 'NN'),
 ('a', 'a', 'DET', 'DT'),
 ('week', 'week', 'NOUN', 'NN'),
 ('ago', 'ago', 'ADV', 'RB'),
 ('.', '.', 'PUNCT', '.'),
 ('A', 'a', 'DET', 'DT'),
 ('lot', 'lot', 'NOUN', 'NN')]

In [4]:
# show above as pandas dataframe
import pandas as pd
labels1 = ['word', 'lemma', 'POS', 'xpos_tag']
ex_df = pd.DataFrame()
for i1 in range(len(corpus_annotated[:9])):
    ex_df1 = pd.DataFrame.from_records(corpus_annotated[i1], columns = labels1)
    ex_df.append(ex_df1)
    
ex_df.iloc[:8,:]   # view a few

In [5]:
len(corpus_annotated[:9])

9

#### NER with Spacy's nlp(text).ent
Spacy's NER recognizes the following entity types:
<img src="https://cdn-images-1.medium.com/max/1000/1*qQggIPMugLcy-ndJ8X_aAA.png" alt="Alt text that describes the graphic" title="Named Entities in Spacy" />

The way to obtain entities? just run the *nlp(raw_text)* func on any raw text to obtain a spacy object, say *spacy_object* which will have a *.ents* attribute that outputs all entities spacy has managed to detect. Further, this *spacy_obj.ents* can be outputted as a named list. See below.

In [6]:
# trying entity detection in one sample sentence first
from pprint import pprint  # for pretty-printing
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


#### Spacy's NER with IOB chunking

Recall the IOB scheme (Inside-Outside-Beginning) we'd used to ID text chunks of interest? Well, an expanded version of that scheme, called BILOU can be seen below and invoked in spacy to supplement NER info.

<img src = "https://cdn-images-1.medium.com/max/800/1*_sYTlDj2p_p-pcSRK25h-Q.png" alt="Alt text that describes the graphic" title="BILOU for Named Entities in Spacy" />

In [7]:
# using X.ent_iob and X.ent_type
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(European, 'B', 'NORP'),
 (authorities, 'O', ''),
 (fined, 'O', ''),
 (Google, 'B', 'ORG'),
 (a, 'O', ''),
 (record, 'O', ''),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (on, 'O', ''),
 (Wednesday, 'B', 'DATE'),
 (for, 'O', ''),
 (abusing, 'O', ''),
 (its, 'O', ''),
 (power, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (mobile, 'O', ''),
 (phone, 'O', ''),
 (market, 'O', ''),
 (and, 'O', ''),
 (ordered, 'O', ''),
 (the, 'O', ''),
 (company, 'O', ''),
 (to, 'O', ''),
 (alter, 'O', ''),
 (its, 'O', ''),
 (practices, 'O', '')]


In [8]:
# cleaning it up a little bit to drop the Os above
X1 = [(X, X.ent_iob_, X.ent_type_) for X in doc if X.ent_iob_ != 'O']
pprint(X1)

[(European, 'B', 'NORP'),
 (Google, 'B', 'ORG'),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (Wednesday, 'B', 'DATE')]


In [9]:
# analyze real data - samsung s7 with spacy
import time

a0 = [i for i in range(len(corpus0))]  # for invoking in list comp below
start_time = time.clock()
# dropping 'O' entities & enumerating doc_num 
s7_new = [[(i, X, X.ent_iob_, X.ent_type_) for X in nlp(corpus0[i]) if X.ent_iob_ != 'O'] for i in a0] # 6.87 secs
print(time.clock() - start_time, "seconds")

s7_new[0][:9]  # view a few records

  """
  


9.301490941999987 seconds


[(0, Black, 'B', 'GPE'),
 (0, a, 'B', 'DATE'),
 (0, week, 'I', 'DATE'),
 (0, ago, 'I', 'DATE'),
 (0, first, 'B', 'ORDINAL'),
 (0, Saudi, 'B', 'GPE'),
 (0, Arabia, 'I', 'GPE'),
 (0, SA, 'B', 'GPE'),
 (0, WhatsApp, 'B', 'ORG')]

In [10]:
# convert to panda dataframe
import pandas as pd
labels = ['doc_num', 'token', 'entity_IOB', 'entity_type']
ent_df = pd.DataFrame()
for i1 in range(len(s7_new)):
    ent_df1 = pd.DataFrame.from_records(s7_new[i1], columns=labels)
    ent_df = ent_df.append(ent_df1)
    
ent_df.iloc[:8,:]    

Unnamed: 0,doc_num,token,entity_IOB,entity_type
0,0,Black,B,GPE
1,0,a,B,DATE
2,0,week,I,DATE
3,0,ago,I,DATE
4,0,first,B,ORDINAL
5,0,Saudi,B,GPE
6,0,Arabia,I,GPE
7,0,SA,B,GPE


In [11]:
# filter above to retain only 'ORG' named entity type
new_ent_df = ent_df[(ent_df.entity_type == 'ORG')]
new_ent_df.iloc[:8, :]

Unnamed: 0,doc_num,token,entity_IOB,entity_type
8,0,WhatsApp,B,ORG
9,0,Samsung,B,ORG
12,0,Samsung,B,ORG
13,0,Exynos,B,ORG
14,0,8890,I,ORG
17,0,Google,B,ORG
21,0,Sprint,B,ORG
22,0,GSM,B,ORG


## Spacy for Syntactic Dependency Parsing

Recall we'd covered *syntactic dependency parsing* from computational linguistics (say, CL) while analyzing UDpipe output. While NLP focuses on the tokens/tags as predictors in machine learning models, CL digs into the *relationships* and links among parts of speech.

Hence, CL looks into token organization and inter-related contexts within sentences using word-to-word grammar relationships which are also known as *dependencies*. Dependency is the notion that syntactic units (words) are connected to each other by *directed links* which describe the relationship possessed by the connected words (see table below some top interconnection types).

<img src="https://i.ibb.co/wCX1g0Z/dep-Parsing.png" title="Dependency Parsing Table" />

Enough theory, let's get concrete with an example. Consider the old class example we'd used (from the 'NLP in Py' book):
"I shot an elephant in my pajamas." Below is how its dependency parse tree looks like:

In [12]:
doc = nlp('I shot an elephant in my pajamas'); doc
for token in doc:
    print(str(token.text),  str(token.lemma_),  str(token.pos_),  str(token.dep_))

from spacy import displacy
displacy.serve(doc, style='dep')  # type localhost:5000 (or whatever portnumber you get) in browser & refresh

I -PRON- PRON nsubj
shot shoot VERB ROOT
an an DET det
elephant elephant NOUN dobj
in in ADP prep
my -PRON- DET poss
pajamas pajama NOUN pobj


  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


Couldn't yet fond a way to get this to display inline. But go to a browser window and type *localhost:5000* (coz my portnumber was 5000) and you should see something like this show up below. 

<img src="https://i.ibb.co/89PdLyJ/spacy-dep-tree.png" title="depTree" />

Note the *directed* links (arrows) spanning out from the ROOT (verb) to different syntactic units such as NMOD etc.

In [13]:
# Now trying the 'correct' version of the same sentence below:
doc1 = nlp('I in my pajamas, shot an elephant'); doc1
displacy.serve(doc1, style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


Here's the new dependency parse tree. Notice the sorta subtle but powerful difference?

<img src="https://i.ibb.co/9TdQGyy/spacy-dep-tree1.png" title="Title text" />

Well, dassit from me for now. Ciao.

Sudhir