**Beginning with Spacy**

Statistical Models
    These models enable to perform several NLP related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.
    
en_core_web_sm: English multi-task CNN trained on OntoNotes.

Part-of-speech (POS) tagger
Named entity recognizer (NER)
Syntactic dependency parser

Processing Pipeline

text-- tokenizer-tagger-parser-ner...--doc

--> pass text to NLP object

In [12]:
#nlp object
import spacy
nlp = spacy.load('en_core_web_sm')


In [13]:
document = """Hello, i'm Sabin Acharya. I am a computer vision engineer. I wanted to learn NLP. \
So, I started my NLP journey from today dated:june 4th,2020 \
during the lockdown here in Nepal."""


In [14]:
doc = nlp(document)

Tokenization
hierarchy: Corpus > document > sentence > word > sub-word > character > subcharacter > stroke

Normally, uses whitespaces for tokenization while for some languages with no whitespaces:-

Segment text into morphemes before tokenization.

byte-pair-encoding (BPE) to create sub-word units.(Google's bert)


In [15]:
for token in doc:
    print(token.text, token.pos_, token.dep_)
    

Hello INTJ intj
, PUNCT punct
i PRON nsubj
'm AUX ROOT
Sabin PROPN compound
Acharya PROPN attr
. PUNCT punct
I PRON nsubj
am AUX ROOT
a DET det
computer NOUN compound
vision NOUN compound
engineer NOUN attr
. PUNCT punct
I PRON nsubj
wanted VERB ROOT
to PART aux
learn VERB xcomp
NLP PROPN dobj
. PUNCT punct
So ADV advmod
, PUNCT punct
I PRON nsubj
started VERB ROOT
my DET poss
NLP PROPN compound
journey NOUN dobj
from ADP prep
today NOUN pobj
dated VERB pcomp
: PUNCT punct
june PROPN compound
4th,2020 NUM npadvmod
during ADP ROOT
the DET det
lockdown NOUN pobj
here ADV advmod
in ADP prep
Nepal PROPN pobj
. PUNCT punct


In [16]:
print(doc)

Hello, i'm Sabin Acharya. I am a computer vision engineer. I wanted to learn NLP. So, I started my NLP journey from today dated:june 4th,2020 during the lockdown here in Nepal.


stop words:- 
common words not used for word frequency, topic modeling, count vectorizer (to reduce features for, say, bag-of-word text classifier).

In [17]:
spacy_stop_words = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stop_words)

326

In [18]:
print(nlp.Defaults.stop_words)

{'name', 'throughout', 'however', 'and', 'put', 'since', 'next', 'various', 'yourself', 'few', '’ve', 'nor', '‘s', 'a', 'be', 'otherwise', 'another', 'how', 'our', 'thru', "'s", 'again', 'ever', 'first', "'d", 'whereby', 'more', 'up', "n't", 'bottom', 'them', 'as', 'full', 'must', 'sometimes', 'within', 'yet', 'for', 'yourselves', 'hundred', 'whither', 'against', 'empty', 'sixty', 'even', 'whether', 'therefore', 'elsewhere', 'no', 'hereafter', 'many', 'may', 'three', 'keep', 'towards', 'thus', 'ours', 'everyone', 'four', 'once', 'unless', 'if', 'eleven', 'forty', 'hence', 'here', 'its', 'except', 'everywhere', 'beforehand', 'just', 'themselves', 'cannot', 'someone', 'had', '‘ve', 'latterly', 'ca', 'fifty', 'an', 'of', 'where', 'did', 'were', 'whatever', 'nothing', 'say', 'am', 'being', 'using', 'why', 'only', '’ll', 'becomes', 'give', 'thereupon', 'whose', "'m", 'each', 'whole', 'some', 'also', 'something', 'somewhere', 'due', 'is', '’d', 'already', 'down', 'seem', 'thereafter', 'mysel

In [19]:
nlp.vocab['myself'].is_stop#to check a stopword

True

In [31]:
nlp.Defaults.stop_words.add('btw')#to add a stopword

nlp.vocab['btw'].is_stop = True

In [32]:
len(nlp.Defaults.stop_words)


327

In [33]:
nlp.Defaults.stop_words.remove('btw')#to add a stopword

nlp.vocab['btw'].is_stop = False

In [34]:
len(nlp.Defaults.stop_words)


326

Lemmatization:-reducing inflections or variant forms to base form

is,am,are-- be

car,cars,car's -- car

Depends on part-of-speech (POS) tagging

#text normalization 

not much used for deep learning tasks.


In [18]:
for token in doc:
    print(token.text, "\t\t",token.lemma_)

Hello 		 hello
, 		 ,
i 		 -PRON-
'm 		 be
Sabin 		 Sabin
Acharya 		 Acharya
. 		 .
I 		 -PRON-
am 		 be
a 		 a
computer 		 computer
vision 		 vision
engineer 		 engineer
. 		 .
I 		 -PRON-
wanted 		 want
to 		 to
learn 		 learn
NLP 		 NLP
. 		 .
So 		 so
, 		 ,
I 		 -PRON-
started 		 start
my 		 -PRON-
NLP 		 NLP
journey 		 journey
from 		 from
today 		 today
dated 		 date
: 		 :
june 		 june
4th,2020 		 4th,2020
during 		 during
the 		 the
lockdown 		 lockdown
here 		 here
in 		 in
Nepal 		 Nepal
. 		 .


Sentence segmentation:-


In [19]:
for sentence in doc.sents:
    print(sentence)

Hello, i'm Sabin Acharya.
I am a computer vision engineer.
I wanted to learn NLP.
So, I started my NLP journey from today dated:june 4th,2020
during the lockdown here in Nepal.


NER:

PERSON People, including fictional.

FAC Buildings, airports, highways, bridges, etc.

ORG Companies, agencies, institutions, etc.

GPE Countries, cities, states.

PRODUCT Objects, vehicles, foods, etc. (Not services.)

EVENT Named hurricanes, battles, wars, sports events, etc.

DATE Absolute or relative dates or periods.

TIME Times smaller than a day.

MONEY Monetary values, including unit.

QUANTITY Measurements, as of weight or distance.

ORDINAL “first”, “second”, etc.

CARDINAL Numerals that do not fall under another type.


In [23]:
for token in doc.ents:
    print(token.text,token.label_)

Sabin Acharya PERSON
NLP ORG
NLP ORG
today DATE
june 4th,2020 DATE
Nepal GPE


In [25]:
spacy.displacy.render(doc,style = 'ent')

Syntactic dependency parser:-

extract the dependency parse of a sentence to represent it's grammatical structure.

represented as directed graph and can be used as features in graph neural network and tree-recursive NN.


In [26]:

for token in doc:
    print(token.text,"\t\t",token.dep_,"\t\t",token.head.text)


Hello 		 intj 		 'm
, 		 punct 		 'm
i 		 nsubj 		 'm
'm 		 ROOT 		 'm
Sabin 		 compound 		 Acharya
Acharya 		 attr 		 'm
. 		 punct 		 'm
I 		 nsubj 		 am
am 		 ROOT 		 am
a 		 det 		 engineer
computer 		 compound 		 engineer
vision 		 compound 		 engineer
engineer 		 attr 		 am
. 		 punct 		 am
I 		 nsubj 		 wanted
wanted 		 ROOT 		 wanted
to 		 aux 		 learn
learn 		 xcomp 		 wanted
NLP 		 dobj 		 learn
. 		 punct 		 wanted
So 		 advmod 		 started
, 		 punct 		 started
I 		 nsubj 		 started
started 		 ROOT 		 started
my 		 poss 		 journey
NLP 		 compound 		 journey
journey 		 dobj 		 started
from 		 prep 		 started
today 		 pobj 		 from
dated 		 pcomp 		 from
: 		 punct 		 started
june 		 compound 		 4th,2020
4th,2020 		 npadvmod 		 started
during 		 ROOT 		 during
the 		 det 		 lockdown
lockdown 		 pobj 		 during
here 		 advmod 		 lockdown
in 		 prep 		 here
Nepal 		 pobj 		 in
. 		 punct 		 during


In [27]:
sentence_spans = list(doc.sents)
spacy.displacy.render(sentence_spans, style="dep")

In [28]:
print(doc.vocab)

<spacy.vocab.Vocab object at 0x7f5ee97d3d48>


In [29]:
print(len(doc),len(doc.vocab))

40 513


nounchunks:-Noun chunks are "base noun phrases" – flat phrases that have a noun as their head

In [33]:
for chunk in doc.noun_chunks:
    print(chunk.text)

i
Sabin Acharya
I
a computer vision engineer
I
NLP
I
my NLP journey
today
the lockdown
Nepal


doc.sents, doc.ents, doc.noun_chunks, doc.vocab

NO stemming in spacy

In [1]:
import nltk
from nltk.stem.porter import *
#stemming tools is porter's algorithm

In [2]:
p_stemmer = PorterStemmer()

In [7]:
words = ['run','runner','running','ran','runs','sing','nicely','fairly']

In [8]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
sing --> sing
nicely --> nice
fairly --> fairli


In [9]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer 
s_stemmer = SnowballStemmer(language='english')

In [10]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
sing --> sing
nicely --> nice
fairly --> fairli


# Rule based matching
find words and phrases in the text using user-defined rules.


In [72]:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [73]:
doc = nlp("""The page layout of a document is how information is graphically arranged in the space of the document, e.g., on a page. If the appearance of the document is of concern, the page layout is generally the responsibility of a graphic designer. Typography concerns the design of letter and symbol forms and their physical arrangement in the document (see typesetting). Information design concerns the effective communication of information, especially in industrial-document and public signs. Simple textual-documents may not require visual design and may be drafted only by an author, clerk, or transcriber. Forms may require a visual design for their initial fields, but not to complete the forms.""")

In [74]:
pattern = [{'LOWER': 'document'}]

In [75]:
matcher.add('matchername1', None, pattern)

In [76]:
found_matches = matcher(doc)
print(found_matches)

[(8469217569700685470, 5, 6), (8469217569700685470, 17, 18), (8469217569700685470, 30, 31), (8469217569700685470, 62, 63), (8469217569700685470, 81, 82)]


In [77]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8469217569700685470 matchername1 5 6 document
8469217569700685470 matchername1 17 18 document
8469217569700685470 matchername1 30 31 document
8469217569700685470 matchername1 62 63 document
8469217569700685470 matchername1 81 82 document


In [78]:
matcher.remove('matchername1')

In [79]:
pattern1 = [{'IS_PUNCT': True},{'LOWER': 'documents'}]

In [80]:
matcher.add('matchername2', None, pattern1)

In [81]:
found_matches = matcher(doc)
print(found_matches)

[(13298170584691041502, 88, 90)]


In [82]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

13298170584691041502 matchername2 88 90 -documents


In [83]:
matcher.remove('matchername2')

# PhraseMatcher

In [84]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [85]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [106]:
doc = nlp("""The page layout of a document is how information is graphically arranged in the space of the document, e.g., on a page. If the appearance of the document is of concern, the page layout is generally the responsibility of a graphic designer. Typography concerns the design of letter and symbol forms and their physical arrangement in the document (see typesetting). Information design concerns the effective communication of information, especially in industrial-document and public signs. Simple textual-documents may not require visual design and may be drafted only by an author, clerk, or transcriber. Forms may require a visual design for their initial fields, but not to complete the forms.""")

In [112]:
phrase_list = ['document','documents']

In [113]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [114]:
matcher.add('matcher1', None, *phrase_patterns)

matches = matcher(doc)

In [115]:
print(len(matches))

6


In [117]:
print(matches)

[(10800829559984205610, 5, 6), (10800829559984205610, 17, 18), (10800829559984205610, 30, 31), (10800829559984205610, 62, 63), (10800829559984205610, 81, 82), (10800829559984205610, 89, 90)]


In [118]:
for match_id,start,end in matches:
    print(doc[start:end])

document
document
document
document
document
documents
