### POS Tagging

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's bag")

In [4]:
doc[4].pos_, doc[4].tag_

('VERB', 'VBD')

In [5]:
print(f"Text\t   POS\tFine Grained POS  Explaination")
for token in doc:
    print(f"{token.text:{10}} {token.pos_:{5}} {token.tag_:>{15}} { spacy.explain(token.tag_)}")

Text	   POS	Fine Grained POS  Explaination
The        DET                DT determiner
quick      ADJ                JJ adjective (English), other noun-modifier (Chinese)
brown      ADJ                JJ adjective (English), other noun-modifier (Chinese)
fox        NOUN               NN noun, singular or mass
jumped     VERB              VBD verb, past tense
over       ADP                IN conjunction, subordinating or preposition
the        DET                DT determiner
lazy       ADJ                JJ adjective (English), other noun-modifier (Chinese)
dog        NOUN               NN noun, singular or mass
's         PART              POS possessive ending
bag        NOUN               NN noun, singular or mass


In [6]:
spacy.explain(doc.vocab[10])

In [7]:
doc[10].tag_, spacy.explain(doc[10].tag_)

('NN', 'noun, singular or mass')

In [8]:
doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')

read       VERB     VBP    verb, non-3rd person singular present


In [9]:
doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')

read       VERB     VBP    verb, non-3rd person singular present


In [10]:
doc =  nlp(u"The quick brown fox jumped over the lazy dog's back.")
POS_counts  = doc.count_by(spacy.attrs.POS)
POS_counts

{90: 2, 84: 3, 92: 3, 100: 1, 85: 1, 94: 1, 97: 1}

In [11]:
# Calculate the POS tagging in the document
for k,v in sorted(POS_counts.items()):
    print(k,v, doc.vocab[k].text)

84 3 ADJ
85 1 ADP
90 2 DET
92 3 NOUN
94 1 PART
97 1 PUNCT
100 1 VERB


In [12]:
# Calculate the Fine grain POS tagging in the document
TAG_counts  = doc.count_by(spacy.attrs.TAG)
TAG_counts
for k,v in sorted(TAG_counts.items()):
    print(f"{k:{20}} {v:{2}} {doc.vocab[k].text:{5}} {spacy.explain(doc.vocab[k].text)}")

                  74  1 POS   possessive ending
 1292078113972184607  1 IN    conjunction, subordinating or preposition
10554686591937588953  3 JJ    adjective (English), other noun-modifier (Chinese)
12646065887601541794  1 .     punctuation mark, sentence closer
15267657372422890137  2 DT    determiner
15308085513773655218  3 NN    noun, singular or mass
17109001835818727656  1 VBD   verb, past tense


### Visualize POS

In [13]:
from spacy import displacy

In [14]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's bag")

In [15]:
displacy.render(doc, style="dep", jupyter = True)

In [16]:
options = {'distance': 110, 'compact' : True , 'color' : 'black'}

In [17]:
displacy.render(doc, style="dep", jupyter = True,options= options)

In [20]:
doc2 = nlp(u"This is a sentence. This is another, possibly longer sentence.")

In [21]:
spans = list(doc2.sents)

In [22]:
displacy.render(spans, jupyter = True)

In [37]:
docs3 = """
Tesla CEO Elon Musk has acquired a 9% stake in Twitter to become its largest shareholder while joining other critics in questioning the social media platform’s dedication to free speech and the First Amendment.

Musk’s ultimate aim in acquiring 73.5 million shares, worth about US$3bil (RM12.65bil), isn’t clear. Yet in late March Musk, who has 80 million Twitter followers and is active on the site, questioned free speech on Twitter and whether the platform is undermining democracy.

In years past, Twitter and other social platforms have taken fire for allowing harmful speech ranging from incitement to violence to coordinated harassment and racial abuse. More recently, these platforms have made concerted efforts to rein in such behaviour, often drawing criticism similar to Musk’s from the political right. Both Twitter and Facebook faced blowback after suspending the accounts run by former US President Donald Trump following the Jan 6 Capitol insurrection last year.

It’s unclear just when Musk bought the stake. A US Securities and Exchange Commission filing made public on Monday says the event triggering the filing happened March 14. Musk has also raised the possibility with his massive and loyal Twitter following, that he could create a rival social media network.

Industry analysts and legal experts say Musk could begin advocating for changes at Twitter immediately if he chooses. In a note to investors, CFRA Analyst Angelo Zino wrote that Twitter could be viewed as an acquisition target because the value of its shares have been falling since early last year.

Twitter co-founder Jack Dorsey stepped down as CEO in November. Musk’s stake in Twitter is now more than four times the size of Dorsey’s, who had been the largest individual shareholder.


"""

In [38]:
print(docs3)


Tesla CEO Elon Musk has acquired a 9% stake in Twitter to become its largest shareholder while joining other critics in questioning the social media platform’s dedication to free speech and the First Amendment.

Musk’s ultimate aim in acquiring 73.5 million shares, worth about US$3bil (RM12.65bil), isn’t clear. Yet in late March Musk, who has 80 million Twitter followers and is active on the site, questioned free speech on Twitter and whether the platform is undermining democracy.

In years past, Twitter and other social platforms have taken fire for allowing harmful speech ranging from incitement to violence to coordinated harassment and racial abuse. More recently, these platforms have made concerted efforts to rein in such behaviour, often drawing criticism similar to Musk’s from the political right. Both Twitter and Facebook faced blowback after suspending the accounts run by former US President Donald Trump following the Jan 6 Capitol insurrection last year.

It’s unclear just wh

In [39]:
docs3 = nlp(docs3)

In [40]:
print(f"Text\t   POS\tFine Grained POS  Explaination")
for token in docs3:
    print(f"{token.text:{10}} {token.pos_:{5}} {token.tag_:>{15}} { spacy.explain(token.tag_)}")

Text	   POS	Fine Grained POS  Explaination

          SPACE             _SP whitespace
Tesla      PROPN             NNP noun, proper singular
CEO        PROPN             NNP noun, proper singular
Elon       PROPN             NNP noun, proper singular
Musk       PROPN             NNP noun, proper singular
has        AUX               VBZ verb, 3rd person singular present
acquired   VERB              VBN verb, past participle
a          DET                DT determiner
9          NUM                CD cardinal number
%          NOUN               NN noun, singular or mass
stake      NOUN               NN noun, singular or mass
in         ADP                IN conjunction, subordinating or preposition
Twitter    PROPN             NNP noun, proper singular
to         PART               TO infinitival "to"
become     VERB               VB verb, base form
its        PRON             PRP$ pronoun, possessive
largest    ADJ               JJS adjective, superlative
shareholder NOUN            

In [41]:
print("Verbs in docs3: ")
for token in docs3[:50]:
    if token.pos_ == 'VERB':
        print(f"{token.text:{10}} {token.pos_:{5}} {token.tag_:>{15}} { spacy.explain(token.tag_)}")

Verbs in docs3: 
acquired   VERB              VBN verb, past participle
become     VERB               VB verb, base form
joining    VERB              VBG verb, gerund or present participle
questioning VERB              VBG verb, gerund or present participle
acquiring  VERB              VBG verb, gerund or present participle


In [44]:
print("Verbs in docs3: ")
for token in docs3:
    if token.pos_ == 'PROPN':
        print(f"{token.text:{10}} {token.pos_:{5}} {token.tag_:>{15}} { spacy.explain(token.tag_)}")

Verbs in docs3: 
Tesla      PROPN             NNP noun, proper singular
CEO        PROPN             NNP noun, proper singular
Elon       PROPN             NNP noun, proper singular
Musk       PROPN             NNP noun, proper singular
Twitter    PROPN             NNP noun, proper singular
First      PROPN             NNP noun, proper singular
Amendment  PROPN             NNP noun, proper singular
Musk       PROPN             NNP noun, proper singular
3bil       PROPN             NNP noun, proper singular
RM12.65bil PROPN             NNP noun, proper singular
March      PROPN             NNP noun, proper singular
Musk       PROPN             NNP noun, proper singular
Twitter    PROPN             NNP noun, proper singular
Twitter    PROPN             NNP noun, proper singular
Twitter    PROPN             NNP noun, proper singular
Musk       PROPN             NNP noun, proper singular
Twitter    PROPN             NNP noun, proper singular
Facebook   PROPN             NNP noun, proper si

### NER - Part One
- Show Entity
- Add new Entity to list

In [45]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + " - " + ent.label_ + " " + str(spacy.explain(ent.label_)))
    
    else:
        print("No entites found!")

In [46]:
doc = nlp(u"Hi! How are you")
show_ents(doc)

No entites found!


In [47]:
doc2 = nlp(u"May I go to Washington, DC next May to see the Washington Monument?")
show_ents(doc2)

Washington - GPE Countries, cities, states
DC - GPE Countries, cities, states
next May - DATE Absolute or relative dates or periods
the Washington Monument - ORG Companies, agencies, institutions, etc.


In [113]:
doc3 = nlp(u"Can I have 500 dollars of Microsoft stock?")
show_ents(doc3)

500 dollars - MONEY Monetary values, including unit
Microsoft - ORG Companies, agencies, institutions, etc.


In [114]:
doc4 = nlp(u"Tesla to build a U.K factory for 6$ million.")
show_ents(doc4)

U.K - ORG Companies, agencies, institutions, etc.
6$ million - MONEY Monetary values, including unit


In [115]:
from spacy.tokens import Span

In [116]:
ORG = doc.vocab.strings[u"ORG"]
ORG

381

In [117]:
new_ent = Span(doc4,0,1, label = ORG)
new_ent

Tesla

In [118]:
doc4.ents = list(doc4.ents) + [new_ent]

In [119]:
show_ents(doc4)

Tesla - ORG Companies, agencies, institutions, etc.
U.K - ORG Companies, agencies, institutions, etc.
6$ million - MONEY Monetary values, including unit


In [48]:
show_ents(docs3)

Elon Musk - PERSON People, including fictional
9% - PERCENT Percentage, including "%"
Twitter - PRODUCT Objects, vehicles, foods, etc. (not services)
the First Amendment - LAW Named documents made into laws.
73.5 million - CARDINAL Numerals that do not fall under another type
about US$3bil - MONEY Monetary values, including unit
RM12.65bil - ORG Companies, agencies, institutions, etc.
late March Musk - DATE Absolute or relative dates or periods
80 million - CARDINAL Numerals that do not fall under another type
Twitter - PRODUCT Objects, vehicles, foods, etc. (not services)
Twitter - PRODUCT Objects, vehicles, foods, etc. (not services)
years past - DATE Absolute or relative dates or periods
Twitter - PRODUCT Objects, vehicles, foods, etc. (not services)
Musk - PERSON People, including fictional
Twitter - PERSON People, including fictional
US - GPE Countries, cities, states
Donald Trump - PERSON People, including fictional
Jan 6 - DATE Absolute or relative dates or periods
Capitol - FAC

### NER - Part Two

In [131]:
doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '
          u'If successful, the vacuum-cleaner will be our first product.')
show_ents(doc)

first - ORDINAL "first", "second", etc.


In [133]:
from spacy.matcher import PhraseMatcher

In [134]:
phrases_list  = ['vacuum cleaner', 'vacuum-cleaner']
pattern_list = [nlp(phrase) for phrase in phrases_list]
pattern_list

[vacuum cleaner, vacuum-cleaner]

In [141]:
matcher = PhraseMatcher(nlp.vocab)

In [142]:
matcher.add("newproduct", None , *pattern_list)

In [152]:
pattern_match = matcher(doc)
pattern_match

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

In [153]:
PROD = doc.vocab.strings[u'PRODUCT']
PROD

384

In [155]:
new_ents = [Span(doc, match[1], match[2], label = PROD) for match in pattern_match]
new_ents

[vacuum cleaner, vacuum cleaner]

In [158]:
doc.ents = list(doc.ents) + new_ents

In [159]:
show_ents(doc)

vacuum cleaner - PRODUCT Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT Objects, vehicles, foods, etc. (not services)
first - ORDINAL "first", "second", etc.


### Visualize NER

In [160]:
from spacy import displacy

In [170]:
doc = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $20 million."
         u"By contrast, Sony only sold 8 thousand Walkman music players"
         )

In [171]:
displacy.render(doc, jupyter = True, style = 'ent')

In [172]:
for sent in doc.sents:
    displacy.render(nlp(sent.text), style = 'ent', jupyter= True)

In [183]:
# colors = {'ORG' : '#aa9cfc'}
colors = {'ORG' : 'linear-gradient(45deg, #aa9cfc, red)'}
options = {'ents' : ['PRODUCT', 'ORG'], 'colors' : colors}
displacy.render(doc, jupyter = True, style = 'ent', options = options)

In [184]:
displacy.serve(doc, style = 'ent', options = options)


[93m    Serving on port 5000...[0m
    Using the 'ent' visualizer



127.0.0.1 - - [15/May/2021 17:22:49] "GET / HTTP/1.1" 200 2155
127.0.0.1 - - [15/May/2021 17:22:50] "GET /favicon.ico HTTP/1.1" 200 2155



    Shutting down server on port 5000.



### Sentence Segmentation

In [185]:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [186]:
# ADD SEGMENTATION RULE

In [187]:
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == ";":
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_segmentation, before='parser')
nlp.pipe_names

['tagger', 'set_custom_segmentation', 'parser', 'ner']

In [189]:
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

In [190]:
for sent in doc3.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker


In [191]:
#CHANGE SEGMENTATION RULES

In [None]:
nlp = spacy.load("en_core_web_sm")

In [194]:
doc = nlp(u'This is the first sentence. This is another sentence. \n\nThis is the \nlast sentence.')

In [195]:
doc

This is the first sentence. This is another sentence. 

This is the 
last sentence.

In [196]:
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence. 


This is the 
last sentence.


In [197]:
from spacy.pipeline import SentenceSegmenter

In [198]:
def split_on_newlines(doc):
    start = 0
    seen_newline = False
    
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
            
        elif word.text.startswith("\n"):
            seen_newline = True
            
    yield doc[start:]

In [200]:
for sent in split_on_newlines(doc):
    print(sent)

This is the first sentence. This is another sentence. 


This is the 

last sentence.


In [201]:
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)

In [202]:
nlp.add_pipe(sbd)

In [205]:
doc = nlp(u'This is the first sentence. This is another sentence. \n\nThis is the \nlast sentence.')


In [206]:
for sent in doc.sents:
    print(sent)

This is the first sentence. This is another sentence. 


This is the 

last sentence.
