## 0. Load Dependencies

In [69]:
import time
import re
from nltk.stem.porter import PorterStemmer

# Optional - For wider screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## 1. Load Dataset From Gutenberg

In [70]:
! wget https://www.gutenberg.org/files/4300/4300-0.txt  #Ulyses

--2020-08-08 15:58:03--  https://www.gutenberg.org/files/4300/4300-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1586488 (1.5M) [text/plain]
Saving to: ‘4300-0.txt.13’


2020-08-08 15:58:04 (1.71 MB/s) - ‘4300-0.txt.13’ saved [1586488/1586488]



In [71]:
import pickle
with open('library.pkl', 'rb') as fh:
    lib = pickle.load(fh)

## 2. SpaCy 

First let's build our process using SpaCy. 

In [73]:
t0=time.time()

In [74]:
import spacy
nlp=spacy.load('en_core_web_md')

Use Sentencizer to splitthe book into sentences. We can't use the book as it is here because the book is scanned line by line. Therefore, we read the book as a continuous text file and then split it into sentences.

In [75]:
# Construction from class
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()

In [76]:
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated

Due to SpaCy restrictions wrt Jupyter, we are limited to 1M characters. I will split the book into halves, precess sentences and then concatenate them.

In [77]:
def div_docs_sent(doc):
    
    length = len(doc)
    
    mil = 1000*1000
    
    sentences = []
    
    if length <= mil:
        
        doc = nlp(doc)
        
        sentences = [sent.string.strip() for sent in doc.sents]
        
    else:
        
        n = (length // mil)+1
        
        for i in range(n):
            
            doc_ = nlp(doc[mil*(i):mil*(i+1)])
            
            sentences.extend([sent.string.strip() for sent in doc_.sents])
            
            
            
    return sentences
    

In [78]:
books=[div_docs_sent(book) for book in lib]

In [230]:
len(' '.join(lib))

4876562

In [79]:
flatlist = lambda l : [item  for sublist in l for item in sublist]

In [80]:
sents=flatlist(books)

In [81]:
len(sents)

96699

In [82]:
len(sents)

96699

In [83]:
import pandas as pd
df=pd.DataFrame(sents,columns=['text'])

In [84]:
df

Unnamed: 0,text
0,Project Gutenberg's The Hound of the Baskervil...
1,This eBook is for the use of anyone anywhere a...
2,
3,"You may copy it, give it away or\nre-use it un..."
4,Gutenberg License included\nwith this eBook or...
...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co..."
96695,
96696,Most people start at our Web site which has th...
96697,This Web site includes information about Proje...


In [85]:
sent=nlp(df.loc[0,'text'])

In [86]:
for token in sent:
    print(token.text, token.has_vector, token.vector_norm, token.is_stop)

Project True 6.227214 False
Gutenberg True 6.859441 False
's True 5.1889863 True
The True 4.70935 True
Hound True 7.0336733 False
of True 4.97793 True
the True 4.70935 True
Baskervilles True 6.8786654 False
, True 5.094723 False
by True 6.015159 True
Arthur True 6.281174 False
Conan True 7.009008 False
Doyle True 5.8634067 False


### 2.1 Clean Tabs and whitespace

In [213]:
clean_shrink = lambda text : text.replace(r'\n|\t|\s+',' ').replace('\s+',' ').strip()

In [218]:
df.loc[:,'document']=df.text.map(clean_shrink)

In [220]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,stem,lemma
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[project, gutenberg, 's, the, hound, of, the, ...","[Project, Gutenberg, Hound, Baskervilles, Arth..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[thi, ebook, is, for, the, use, of, anyon, any...","[eBook, use, cost, restriction, whatsoever, .]"
2,,,[],[],[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[you, may, copi, it, ,, give, it, away, or, re...","[copy, away, use, term, project]"
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[gutenberg, licens, includ, with, thi, ebook, ...","[Gutenberg, License, include, eBook, online, w..."
...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[thu, ,, we, do, not, necessarili, keep, ebook...","[necessarily, ebook, compliance, particular, p..."
96695,,,[],[],[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[most, peopl, start, at, our, web, site, which...","[people, start, web, site, main, pg, search, f..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[thi, web, site, includ, inform, about, projec...","[web, site, include, information, Project, Gut..."


### 2.2 Tokenize Sentences

In [131]:
sentence_tokenizer = lambda sent : [token for token in nlp(sent)]

In [132]:
df.loc[:,'token']=df.document.map(sentence_tokenizer)

In [133]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]"
2,,,[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]"
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ..."
...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ..."
96695,,,[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu..."


### 2.3 Normalize Tokens

In [144]:
punct="[^\w\d\s\.\!\?]"

In [93]:
normalizer = lambda tokens : [re.sub(punct,'',token.text)  for token in tokens if re.sub(punct,'',token.text) != '']

In [94]:
df.loc[:,'normalized']=df.token.map(normalizer)

In [100]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]"
2,,,[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]"
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ..."
...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ..."
96695,,,[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu..."


### 2.4 Remove Stop Words

In [140]:
normalizer_and_stop = lambda tokens : [re.sub(punct,'',token.text)  for token in tokens if re.sub(punct,'',token.text) != '' and not token.is_stop]

In [143]:
df.loc[:,'cleanTokens']=df.token.map(normalizer_and_stop)

In [31]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]"
2,,,[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]"
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ..."
...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ..."
96695,,,[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu..."


### 2.5 Lemmatize

In [208]:
normalizer_and_stop_lemma = lambda tokens : [re.sub(punct,'',token.lemma_)  for token in tokens if re.sub(punct,'',token.text) != '' and not token.is_stop]

In [209]:
df.loc[:,'lemma']=df.token.map(normalizer_and_stop_lemma)

In [210]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,stem,lemma
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[project, gutenberg, 's, the, hound, of, the, ...","[Project, Gutenberg, Hound, Baskervilles, Arth..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[thi, ebook, is, for, the, use, of, anyon, any...","[eBook, use, cost, restriction, whatsoever, .]"
2,,,[],[],[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[you, may, copi, it, ,, give, it, away, or, re...","[copy, away, use, term, project]"
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[gutenberg, licens, includ, with, thi, ebook, ...","[Gutenberg, License, include, eBook, online, w..."
...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[thu, ,, we, do, not, necessarili, keep, ebook...","[necessarily, ebook, compliance, particular, p..."
96695,,,[],[],[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[most, peopl, start, at, our, web, site, which...","[people, start, web, site, main, pg, search, f..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[thi, web, site, includ, inform, about, projec...","[web, site, include, information, Project, Gut..."


### 2.6 Stemmer

In [202]:
stemmer = PorterStemmer()

In [205]:
stems = lambda tokens : [stemmer.stem(token.text) if len(tokens)>0 else [] for token in tokens]

In [206]:
df.loc[:,'stem']=df.token.map(stems)

In [207]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,stem
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[project, gutenberg, 's, the, hound, of, the, ..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[thi, ebook, is, for, the, use, of, anyon, any..."
2,,,[],[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[you, may, copi, it, ,, give, it, away, or, re..."
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[gutenberg, licens, includ, with, thi, ebook, ..."
...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[thu, ,, we, do, not, necessarili, keep, ebook..."
96695,,,[],[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[most, peopl, start, at, our, web, site, which..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[thi, web, site, includ, inform, about, projec..."


### 2.7 Part of Speech Tagging

In [211]:
normalizer_and_stop_pos = lambda tokens : [re.sub(punct,'',token.pos_)  for token in tokens if re.sub(punct,'',token.text) != '' and not token.is_stop]

In [36]:
df.loc[:,'pos']=df.cleanTokens.map(normalizer_and_stop_pos)

In [37]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,lemma,pos
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[PROPN, PROPN, PROPN, PROPN, PROPN, PROPN, PROPN]"
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[eBook, use, cost, restriction, whatsoever, .]","[PROPN, NOUN, NOUN, NOUN, ADV, PUNCT]"
2,,,[],[],[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[copy, away, use, term, project]","[VERB, ADV, VERB, NOUN, NOUN]"
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[Gutenberg, License, include, eBook, online, w...","[PROPN, PROPN, VERB, PROPN, ADV, PROPN]"
...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[necessarily, ebook, compliance, particular, p...","[ADV, NOUN, NOUN, ADJ, NOUN, NOUN, PUNCT]"
96695,,,[],[],[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[people, start, web, site, main, pg, search, f...","[NOUN, VERB, NOUN, NOUN, ADJ, NOUN, NOUN, NOUN..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[web, site, include, information, Project, Gut...","[NOUN, NOUN, VERB, NOUN, PROPN, PROPN, PROPN, ..."


### 2.8 Token Assembler

In [38]:
token_assembler = lambda tokens : " ".join(tokens)

In [39]:
df.loc[:,'clean_text']=df.cleanTokens.map(token_assembler)

In [40]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,lemma,pos,clean_text
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[PROPN, PROPN, PROPN, PROPN, PROPN, PROPN, PROPN]",Project Gutenberg Hound Baskervilles Arthur Co...
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[eBook, use, cost, restriction, whatsoever, .]","[PROPN, NOUN, NOUN, NOUN, ADV, PUNCT]",eBook use cost restrictions whatsoever .
2,,,[],[],[],[],[],
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[copy, away, use, term, project]","[VERB, ADV, VERB, NOUN, NOUN]",copy away use terms Project
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[Gutenberg, License, include, eBook, online, w...","[PROPN, PROPN, VERB, PROPN, ADV, PROPN]",Gutenberg License included eBook online www.gu...
...,...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[necessarily, ebook, compliance, particular, p...","[ADV, NOUN, NOUN, ADJ, NOUN, NOUN, PUNCT]",necessarily eBooks compliance particular paper...
96695,,,[],[],[],[],[],
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[people, start, web, site, main, pg, search, f...","[NOUN, VERB, NOUN, NOUN, ADJ, NOUN, NOUN, NOUN...",people start Web site main PG search facility ...
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[web, site, include, information, Project, Gut...","[NOUN, NOUN, VERB, NOUN, PROPN, PROPN, PROPN, ...",Web site includes information Project Gutenber...


### 2.9 Tagger

Produces Ner Chunks

In [41]:
tagger = lambda text : [(ent.text, ent.label_) for ent in nlp(text).ents]

In [42]:
df.loc[:,'ner_chunks']=df.loc[:,'document'].map(tagger)

In [43]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,lemma,pos,clean_text,ner_chunks
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[PROPN, PROPN, PROPN, PROPN, PROPN, PROPN, PROPN]",Project Gutenberg Hound Baskervilles Arthur Co...,"[(Gutenberg, PERSON), (The Hound of the Basker..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[eBook, use, cost, restriction, whatsoever, .]","[PROPN, NOUN, NOUN, NOUN, ADV, PUNCT]",eBook use cost restrictions whatsoever .,[]
2,,,[],[],[],[],[],,[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[copy, away, use, term, project]","[VERB, ADV, VERB, NOUN, NOUN]",copy away use terms Project,[]
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[Gutenberg, License, include, eBook, online, w...","[PROPN, PROPN, VERB, PROPN, ADV, PROPN]",Gutenberg License included eBook online www.gu...,"[(Gutenberg, PERSON), (eBook, WORK_OF_ART)]"
...,...,...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[necessarily, ebook, compliance, particular, p...","[ADV, NOUN, NOUN, ADJ, NOUN, NOUN, PUNCT]",necessarily eBooks compliance particular paper...,[]
96695,,,[],[],[],[],[],,[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[people, start, web, site, main, pg, search, f...","[NOUN, VERB, NOUN, NOUN, ADJ, NOUN, NOUN, NOUN...",people start Web site main PG search facility ...,[]
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[web, site, include, information, Project, Gut...","[NOUN, NOUN, VERB, NOUN, PROPN, PROPN, PROPN, ...",Web site includes information Project Gutenber...,[(the Project Gutenberg Literary Archive Found...


### 2.10 Regex Parser

In [44]:
noun_chunker = lambda text : [(chnk,(chnk[0].pos_,chnk[1].pos_,chnk[2].tag_ ))for chnk in nlp(text).noun_chunks if len(chnk.text.split())==3\
                              and  chnk.text.replace(' ','').isalpha()   and chnk[0].pos_ == 'DET'and chnk[1].pos_ == 'ADJ' and chnk[2].tag_ in ['NN','NNP']
                             ] # and not chnk[0].is_stop

In [45]:
df.loc[:,'RegexpParser'] =df.loc[:,'document'].map(noun_chunker)

In [46]:
[chunk for chunk in df.RegexpParser.values if chunk!=[]]

[[(My dear Robinson, ('DET', 'ADJ', 'NNP'))],
 [(this accidental souvenir, ('DET', 'ADJ', 'NN'))],
 [(a great deal, ('DET', 'ADJ', 'NN'))],
 [(a great amount, ('DET', 'ADJ', 'NN'))],
 [(the local hunt, ('DET', 'ADJ', 'NN')),
  (some surgical assistance, ('DET', 'ADJ', 'NN'))],
 [(a remarkable power, ('DET', 'ADJ', 'NN'))],
 [(a good deal, ('DET', 'ADJ', 'NN'))],
 [(a fresh basis, ('DET', 'ADJ', 'NN')),
  (this unknown visitor, ('DET', 'ADJ', 'NN'))],
 [(the obvious conclusion, ('DET', 'ADJ', 'NN'))],
 [(their good will, ('DET', 'ADJ', 'NN'))],
 [(my dear Watson, ('DET', 'ADJ', 'NNP')),
  (a young fellow, ('DET', 'ADJ', 'NN'))],
 [(a favourite dog, ('DET', 'ADJ', 'NN'))],
 [(the latter part, ('DET', 'ADJ', 'NN'))],
 [(that local hunt, ('DET', 'ADJ', 'NN'))],
 [(a heavy stick, ('DET', 'ADJ', 'NN'))],
 [(a professional brother, ('DET', 'ADJ', 'NN'))],
 [(the dramatic moment, ('DET', 'ADJ', 'NN'))],
 [(a long nose, ('DET', 'ADJ', 'NN'))],
 [(a forward thrust, ('DET', 'ADJ', 'NN')),
  (a ge

### 2.11 N-Gram Generator

In [47]:
ngram_generator = lambda input_list: [*zip(*[input_list[i:] for i in range(n)])]

In [48]:
n=3
df.loc[:,'triGrams'] = df.loc[:,'token'].map(ngram_generator)

In [49]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,lemma,pos,clean_text,ner_chunks,RegexpParser,triGrams
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[PROPN, PROPN, PROPN, PROPN, PROPN, PROPN, PROPN]",Project Gutenberg Hound Baskervilles Arthur Co...,"[(Gutenberg, PERSON), (The Hound of the Basker...",[],"[(Project, Gutenberg, 's), (Gutenberg, 's, The..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[eBook, use, cost, restriction, whatsoever, .]","[PROPN, NOUN, NOUN, NOUN, ADV, PUNCT]",eBook use cost restrictions whatsoever .,[],[],"[(This, eBook, is), (eBook, is, for), (is, for..."
2,,,[],[],[],[],[],,[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[copy, away, use, term, project]","[VERB, ADV, VERB, NOUN, NOUN]",copy away use terms Project,[],[],"[(You, may, copy), (may, copy, it), (copy, it,..."
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[Gutenberg, License, include, eBook, online, w...","[PROPN, PROPN, VERB, PROPN, ADV, PROPN]",Gutenberg License included eBook online www.gu...,"[(Gutenberg, PERSON), (eBook, WORK_OF_ART)]",[],"[(Gutenberg, License, included), (License, inc..."
...,...,...,...,...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[necessarily, ebook, compliance, particular, p...","[ADV, NOUN, NOUN, ADJ, NOUN, NOUN, PUNCT]",necessarily eBooks compliance particular paper...,[],[],"[(Thus, ,, we), (,, we, do), (we, do, not), (d..."
96695,,,[],[],[],[],[],,[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[people, start, web, site, main, pg, search, f...","[NOUN, VERB, NOUN, NOUN, ADJ, NOUN, NOUN, NOUN...",people start Web site main PG search facility ...,[],[],"[(Most, people, start), (people, start, at), (..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[web, site, include, information, Project, Gut...","[NOUN, NOUN, VERB, NOUN, PROPN, PROPN, PROPN, ...",Web site includes information Project Gutenber...,[(the Project Gutenberg Literary Archive Found...,[],"[(This, Web, site), (Web, site, includes), (si..."


### 2.12 Word2Vec Embeddings

In [50]:
vector = lambda tokens: [(token.text, token.has_vector, token.vector, token.is_oov) for token in tokens]

In [51]:
df.loc[:,'vectors'] = df.loc[:,'token'].map(vector)

In [52]:
df

Unnamed: 0,text,document,token,normalized,cleanTokens,lemma,pos,clean_text,ner_chunks,RegexpParser,triGrams,vectors
0,Project Gutenberg's The Hound of the Baskervil...,Project Gutenberg's The Hound of the Baskervil...,"[Project, Gutenberg, 's, The, Hound, of, the, ...","[Project, Gutenberg, s, The, Hound, of, the, B...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[Project, Gutenberg, Hound, Baskervilles, Arth...","[PROPN, PROPN, PROPN, PROPN, PROPN, PROPN, PROPN]",Project Gutenberg Hound Baskervilles Arthur Co...,"[(Gutenberg, PERSON), (The Hound of the Basker...",[],"[(Project, Gutenberg, 's), (Gutenberg, 's, The...","[(Project, True, [0.085476, -0.56849, 0.20372,..."
1,This eBook is for the use of anyone anywhere a...,This eBook is for the use of anyone anywhere a...,"[This, eBook, is, for, the, use, of, anyone, a...","[This, eBook, is, for, the, use, of, anyone, a...","[eBook, use, cost, restrictions, whatsoever, .]","[eBook, use, cost, restriction, whatsoever, .]","[PROPN, NOUN, NOUN, NOUN, ADV, PUNCT]",eBook use cost restrictions whatsoever .,[],[],"[(This, eBook, is), (eBook, is, for), (is, for...","[(This, True, [-0.087595, 0.35502, 0.063868, 0..."
2,,,[],[],[],[],[],,[],[],[],[]
3,"You may copy it, give it away or\nre-use it un...","You may copy it, give it away or re-use it und...","[You, may, copy, it, ,, give, it, away, or, re...","[You, may, copy, it, give, it, away, or, re, u...","[copy, away, use, terms, Project]","[copy, away, use, term, project]","[VERB, ADV, VERB, NOUN, NOUN]",copy away use terms Project,[],[],"[(You, may, copy), (may, copy, it), (copy, it,...","[(You, True, [-0.11076, 0.30786, -0.5198, 0.03..."
4,Gutenberg License included\nwith this eBook or...,Gutenberg License included with this eBook or ...,"[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, with, this, eBo...","[Gutenberg, License, included, eBook, online, ...","[Gutenberg, License, include, eBook, online, w...","[PROPN, PROPN, VERB, PROPN, ADV, PROPN]",Gutenberg License included eBook online www.gu...,"[(Gutenberg, PERSON), (eBook, WORK_OF_ART)]",[],"[(Gutenberg, License, included), (License, inc...","[(Gutenberg, True, [0.17115, 0.5546, 0.27933, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...
96694,"Thus, we do not\nnecessarily keep eBooks in co...","Thus, we do not necessarily keep eBooks in com...","[Thus, ,, we, do, not, necessarily, keep, eBoo...","[Thus, we, do, not, necessarily, keep, eBooks,...","[necessarily, eBooks, compliance, particular, ...","[necessarily, ebook, compliance, particular, p...","[ADV, NOUN, NOUN, ADJ, NOUN, NOUN, PUNCT]",necessarily eBooks compliance particular paper...,[],[],"[(Thus, ,, we), (,, we, do), (we, do, not), (d...","[(Thus, True, [-0.070575, 0.0741, -0.096746, -..."
96695,,,[],[],[],[],[],,[],[],[],[]
96696,Most people start at our Web site which has th...,Most people start at our Web site which has th...,"[Most, people, start, at, our, Web, site, whic...","[Most, people, start, at, our, Web, site, whic...","[people, start, Web, site, main, PG, search, f...","[people, start, web, site, main, pg, search, f...","[NOUN, VERB, NOUN, NOUN, ADJ, NOUN, NOUN, NOUN...",people start Web site main PG search facility ...,[],[],"[(Most, people, start), (people, start, at), (...","[(Most, True, [-0.19571, 0.056275, -0.096518, ..."
96697,This Web site includes information about Proje...,This Web site includes information about Proje...,"[This, Web, site, includes, information, about...","[This, Web, site, includes, information, about...","[Web, site, includes, information, Project, Gu...","[web, site, include, information, Project, Gut...","[NOUN, NOUN, VERB, NOUN, PROPN, PROPN, PROPN, ...",Web site includes information Project Gutenber...,[(the Project Gutenberg Literary Archive Found...,[],"[(This, Web, site), (Web, site, includes), (si...","[(This, True, [-0.087595, 0.35502, 0.063868, 0..."


### 2.13 Regex Matcher

In [53]:
rules = r'''\b[A-Z]\w+ly\b|Stephen\s(?!Proto|Cardinal)[A-Z]\w+|Simon\s[A-Z]\w+'''
regex_matchers = lambda text : re.findall(rules,text)

In [54]:
df.loc[:,'Regex_matches'] =df.loc[:,'document'].map(regex_matchers)

In [55]:
df.Regex_matches[df.Regex_matches.map(len)>1]

13123                      [Polly, Polly]
25669                      [Sally, Sally]
27262                      [Sally, Sally]
27273                      [Polly, Sally]
27340                      [Polly, Sally]
28311                      [Pelly, Dolly]
42016                      [Feely, Feely]
49802                    [Finally, Emily]
52129                    [Lively, Lively]
58295                      [Silly, Milly]
62141                      [Silly, Milly]
64811                       [Only, Molly]
71650                        [Hely, Daly]
74427                      [Healy, Dolly]
77404                      [Molly, Milly]
77437                      [Milly, Molly]
81557                     [Molly, Reilly]
84023          [Szombathely, Szombathely]
89594                       [Healy, Joly]
92206    [Simon Dedalus, Stephen Dedalus]
92980      [Firstly, Nelly, Nelly, Nelly]
93046               [Szombathely, Karoly]
94402       [Reilly, Simon Dedalus, Hely]
94489    [Stephen Dedalus, Simon D

### 2.14 Let's Venture Into The Characters...

Now that we have a dataset with many features, we have a plethora of options to dive into. Let's examine the characters that are in the book...

Let's find NER Chunks that have 'PERSON' tag, consisting of 2 words.

In [57]:
flatlist = lambda l : [re.sub("[^a-zA-Z\s\']","",item[0]).title().strip()  for sublist in l for item in sublist if item[1]=='PERSON' and len(item[0].split())==2]

In [56]:
ner_chunks = df.ner_chunks.to_list()

In [58]:
names=(flatlist(ner_chunks))

In [59]:
len(sorted(names))

4832

4832 names...That looks like a bit too much.

In [60]:
from collections import Counter

In [61]:
counter=Counter(names).most_common(350) 

Here are the most common 350 names. Many mislabeled instances here.

In [62]:
counter

[('St Clare', 306),
 ('Buck Mulligan', 96),
 ('Aunt Chloe', 76),
 ('Martin Cunningham', 74),
 ('Masr George', 45),
 ('Ned Lambert', 44),
 ('Solomon Northup', 43),
 ('Tom Sawyer', 42),
 ('Mary Jane', 38),
 ('John Thornton', 38),
 ('Aunt Sally', 37),
 ('Uncle Tom', 37),
 ('Ben Dollard', 36),
 ('Myles Crawford', 35),
 ('John Eglinton', 32),
 ('Blazes Boylan', 28),
 ('Uncle Abram', 24),
 ('J J', 22),
 ('Sherlock Holmes', 21),
 ('Cissy Caffrey', 21),
 ('Henry Baskerville', 20),
 ('Nosey Flynn', 20),
 ('Mintus Northup', 19),
 ('Bob Doran', 19),
 ('Omadden Burke', 19),
 ('Davy Byrne', 19),
 ('Simon Dedalus', 18),
 ('Edwin Epps', 17),
 ('John Wyse', 17),
 ('Aunt Phebe', 16),
 ('Bantam Lyons', 16),
 ('Tom Rochford', 16),
 ('Edy Boardman', 16),
 ('Charles Baskerville', 15),
 ('Huck Finn', 15),
 ('Leopold Bloom', 15),
 ('Coombe Tracey', 14),
 ('Tom Kernan', 14),
 ('Paddy Dignam', 14),
 ('Richie Goulding', 14),
 ('Mr Power', 13),
 ('Paddy Leonard', 13),
 ('Mrs Breen', 13),
 ('David Widger', 12),
 

Let's examine some NER Chunks and their tags.

In [63]:
ner_chunks

[[('Gutenberg', 'PERSON'),
  ('The Hound of the Baskervilles', 'WORK_OF_ART'),
  ('Arthur Conan Doyle', 'PERSON')],
 [],
 [],
 [],
 [('Gutenberg', 'PERSON'), ('eBook', 'WORK_OF_ART')],
 [('The Hound of the Baskervilles', 'WORK_OF_ART')],
 [],
 [('Arthur Conan Doyle', 'PERSON')],
 [],
 [('December 8, 2008', 'DATE')],
 [('2852', 'MONEY')],
 [('July 19, 2019', 'DATE')],
 [('English', 'LANGUAGE')],
 [],
 [],
 [],
 [],
 [],
 [],
 [('Shreevatsa R', 'PERSON'), ('David Widger', 'PERSON')],
 [('A. Conan Doyle', 'PERSON')],
 [('Robinson', 'PERSON')],
 [],
 [],
 [],
 [('A. Conan Doyle', 'PERSON')],
 [],
 [('Hindhead', 'ORG'), ('Haslemere', 'ORG')],
 [],
 [('Chapter 1', 'LAW')],
 [('Sherlock Holmes', 'PERSON')],
 [('Chapter 2', 'LAW')],
 [],
 [],
 [('Chapter 4', 'LAW')],
 [('Henry Baskerville', 'PERSON')],
 [('Chapter 5', 'LAW')],
 [('Three', 'CARDINAL')],
 [],
 [('Baskerville Hall  ', 'PERSON')],
 [('Stapletons', 'PERSON'), ('Merripit', 'NORP')],
 [('Chapter 8', 'LAW')],
 [('First Report of Dr.',

In [64]:
flatlist = lambda l : [item[1]  for sublist in l for item in sublist]

These are the tags generated by SpaCy.

In [65]:
set(flatlist(ner_chunks))

{'CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART'}

Here is the list of names that have the occurance of more than once.

In [224]:
counter=Counter(names).most_common()

In [225]:
counter

[('Buck Mulligan', 93),
 ('Aunt Chloe', 82),
 ('Martin Cunningham', 71),
 ('Bayou Boeuf', 48),
 ('Aunt Sally', 39),
 ('Ned Lambert', 39),
 ('Mary Jane', 38),
 ('Solomon Northup', 36),
 ('John Thornton', 34),
 ('Myles Crawford', 33),
 ('Ben Dollard', 31),
 ('Sherlock Holmes', 30),
 ('Tom Sawyer', 30),
 ('John Eglinton', 29),
 ('Nosey Flynn', 28),
 ('Corny Kelleher', 27),
 ('Mrs Breen', 27),
 ('Father Conmee', 26),
 ('Uncle Tom', 25),
 ('John Wyse', 24),
 ('Henry Baskerville', 23),
 ('Uncle Abram', 22),
 ('Blazes Boylan', 19),
 ('Bob Doran', 18),
 ('Davy Byrne', 18),
 ('Coombe Tracey', 17),
 ('Aunt Phebe', 17),
 ('Simon Dedalus', 17),
 ('Cissy Caffrey', 17),
 ('Edy Boardman', 16),
 ('Mintus Northup', 15),
 ('Edwin Epps', 14),
 ('Mr Bloom', 14),
 ('Tom Rochford', 14),
 ('Tom Kernan', 13),
 ('Paddy Leonard', 13),
 ('Charles Baskerville', 12),
 ('Uncle Silas', 12),
 ('Master Ford', 12),
 ('Anne Northup', 12),
 ('Richie Goulding', 12),
 ('Paddy Dignam', 12),
 ('Grimpen Mire', 11),
 ('Huck Fi

In [226]:
count_2plus=[(name,count) for name,count in counter if count>1]

In [227]:
len(count_2plus)

334

In [68]:
count_2plus

[('St Clare', 306),
 ('Buck Mulligan', 96),
 ('Aunt Chloe', 76),
 ('Martin Cunningham', 74),
 ('Masr George', 45),
 ('Ned Lambert', 44),
 ('Solomon Northup', 43),
 ('Tom Sawyer', 42),
 ('Mary Jane', 38),
 ('John Thornton', 38),
 ('Aunt Sally', 37),
 ('Uncle Tom', 37),
 ('Ben Dollard', 36),
 ('Myles Crawford', 35),
 ('John Eglinton', 32),
 ('Blazes Boylan', 28),
 ('Uncle Abram', 24),
 ('J J', 22),
 ('Sherlock Holmes', 21),
 ('Cissy Caffrey', 21),
 ('Henry Baskerville', 20),
 ('Nosey Flynn', 20),
 ('Mintus Northup', 19),
 ('Bob Doran', 19),
 ('Omadden Burke', 19),
 ('Davy Byrne', 19),
 ('Simon Dedalus', 18),
 ('Edwin Epps', 17),
 ('John Wyse', 17),
 ('Aunt Phebe', 16),
 ('Bantam Lyons', 16),
 ('Tom Rochford', 16),
 ('Edy Boardman', 16),
 ('Charles Baskerville', 15),
 ('Huck Finn', 15),
 ('Leopold Bloom', 15),
 ('Coombe Tracey', 14),
 ('Tom Kernan', 14),
 ('Paddy Dignam', 14),
 ('Richie Goulding', 14),
 ('Mr Power', 13),
 ('Paddy Leonard', 13),
 ('Mrs Breen', 13),
 ('David Widger', 12),
 

In [69]:
df.document

0        Project Gutenberg's The Hound of the Baskervil...
1        This eBook is for the use of anyone anywhere a...
2                                                         
3        You may copy it, give it away or re-use it und...
4        Gutenberg License included with this eBook or ...
                               ...                        
96694    Thus, we do not necessarily keep eBooks in com...
96695                                                     
96696    Most people start at our Web site which has th...
96697    This Web site includes information about Proje...
96698                                                     
Name: document, Length: 96699, dtype: object

In [71]:
t1=time.time()
t1-t0

1081.4753739833832

In [None]:
start 2.7 gb cache 2 swap 0
end 5.4 gb cache 2 swap 0 peak 7.1 gb

## 3. Spark NLP

In [235]:
t0=time.time()

In [236]:
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.sql.functions import col
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import sparknlp
#spark = SparkSession.builder.getOrCreate()


In [237]:
spark=sparknlp.start(gpu=False)

In [238]:
#spark = SparkSession.builder \
#    .appName("Spark NLP")\
#    .master("local[6]")\
#    .config("spark.driver.memory","24G")\
#    .config("spark.driver.maxResultSize", "2G") \
#    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
#    .config("spark.kryoserializer.buffer.max", "1500M")\
#    .getOrCreate()

In [239]:
#create or get Spark Session
#spark = sparknlp.start()
sparknlp.version()


'2.5.5'

In [240]:
lib=" ".join(lib)

In [241]:
spark_df = spark.createDataFrame([[lib]]).toDF("text")

In [242]:
spark_df.show()

+--------------------+
|                text|
+--------------------+
|
Project Gutenber...|
+--------------------+




### 3.1 Document Assembler¶
In Spark NLP, we have five different transformers that are mainly used for getting the data in or transform the data from one AnnotatorType to another.

That is, the DataFrame you have needs to have a column from one of these types if that column will be fed into an annotator; otherwise, you’d need to use one of the Spark NLP transformers. Here is the list of transformers: DocumentAssembler, TokenAssembler, Doc2Chunk, Chunk2Doc, and the Finisher.

So, let’s start with DocumentAssembler(), an entry point to Spark NLP annotators.

To get through the process in Spark NLP, we need to get raw data transformed into Document type at first.

DocumentAssembler() is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.

DocumentAssembler() comes from sparknlp.base class and has the following settable parameters. See the full list here and the source code here.

setInputCol() -> the name of the column that will be converted. We can specify only one column here. It can read either a String column or an Array[String]

setOutputCol() -> optional : the name of the column in Document type that is generated. We can specify only one column here. Default is ‘document’

setIdCol() -> optional: String type column with id information

setMetadataCol() -> optional: Map type column with metadata information

setCleanupMode() -> optional: Cleaning up options,

possible values:

disabled: Source kept as original. This is a default.
inplace: removes new lines and tabs.
inplace_full: removes new lines and tabs but also those which were converted to strings (i.e. \n)
shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.
shrink_full: remove new lines and tabs, including stringified values, plus shrinking spaces and blank lines.

In [243]:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
.setCleanupMode("shrink")

doc_df = documentAssembler.transform(spark_df)

doc_df.show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                                text|                                                                                            document|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|
Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle

This eBook is for the ...|[[document, 0, 4811701, Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle ...|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------

In [244]:
doc_df.select('document.result','document.begin','document.end').show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----+---------+
|                                                                                              result|begin|      end|
+----------------------------------------------------------------------------------------------------+-----+---------+
|[Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle This eBook is for the u...|  [0]|[4811701]|
+----------------------------------------------------------------------------------------------------+-----+---------+



In [245]:
doc_df.withColumn(
    "tmp", 
    F.explode("document"))\
    .select("tmp.*")\
    .show(truncate=300)

+-------------+-----+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------+
|annotatorType|begin|    end|                                                                                                                                                                                                                                                                                                      result|       metadata|embeddings|
+-------------+-----+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [246]:
doc_df.select("document.result").take(1)



### 3.2 Sentence Detector


Finds sentence bounds in raw text.

setCustomBounds(string): Custom sentence separator text e.g. ["\n"]

setUseCustomOnly(bool): Use only custom bounds without considering those of Pragmatic Segmenter. Defaults to false. Needs customBounds.

setUseAbbreviations(bool): Whether to consider abbreviation strategies for better accuracy but slower performance. Defaults to true.

setExplodeSentences(bool): Whether to split sentences into different Dataset rows. Useful for higher parallelism in fat rows. Defaults to false.

In [247]:
# we feed the document column coming from Document Assembler

sentenceDetector = SentenceDetector().setInputCols('document').setOutputCol('sentences').setExplodeSentences(True)

In [248]:
sent_df = sentenceDetector.transform(doc_df)



In [249]:
sent_df.select(sent_df.sentences.result[0]).take(1)

[Row(sentences.result[0]="Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.")]

In [250]:
library=sent_df.select(F.explode('sentences.result').alias('text'))#.show(truncate=100)

In [251]:
library.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle This eBook is for the us...|
|You may copy it, give it away or re-use it under the terms of the Project Gutenberg License inclu...|
|                                               For this and for your help in the details all thanks.|
|                                                                   Yours most truly, A. Conan Doyle.|
|                                                                                Hindhead, Haslemere.|
|Contents Chapter 1 Mr. Sherlock Holmes Chapter 2 The Curse of the Baskervilles Chapter 3 The Prob...|
|                                                                        

In [252]:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
.setCleanupMode("shrink")

In [253]:
sentenceDetector = SentenceDetector().\
    setInputCols(['document']).\
    setOutputCol('sentences')

### 3.3 Tokenizer
Identifies tokens with tokenization open standards. It is an Annotator Approach, so it requires .fit().

A few rules will help customizing it if defaults do not fit user needs.

setExceptions(StringArray): List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split.

addException(String): Add a single exception

setExceptionsPath(String): Path to txt file with list of token exceptions

caseSensitiveExceptions(bool): Whether to follow case sensitiveness for matching exceptions in text

contextChars(StringArray): List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns.

splitChars(StringArray): List of 1 character string to split tokens inside, such as hyphens. Ignored if using infix, prefix or suffix patterns.

splitPattern (String): pattern to separate from the inside of tokens. takes priority over splitChars. setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \S+ which means anything not a space

setSuffixPattern: Regex to identify subtokens that are in the end of the token. Regex has to end with \z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis

setPrefixPattern: Regex to identify subtokens that come in the beginning of the token. Regex has to start with \A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis

addInfixPattern: Add an extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).

minLength: Set the minimum allowed legth for each token

maxLength: Set the maximum allowed legth for each token

In [254]:
tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

### 3.4 NGram Generator
NGramGenerator annotator takes as input a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Stemmer, Lemmatizer, and StopWordsCleaner).

The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words with annotatorType CHUNK same as the Chunker annotator.

Functions:

setN: number elements per n-gram (>=1)

setEnableCumulative: whether to calculate just the actual n-grams or all n-grams from 1 through n

setDelimiter: Glue character used to join the tokens

In [255]:
ngrams = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("ngrams") \
            .setN(3) \
            .setEnableCumulative(False)\
            .setDelimiter("_") # Default is space

### 3.5 Normalizer
Text cleaning
Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary
Output type: Token
Input types: Token
Reference: Normalizer | NormalizerModel
Functions:

setCleanupPatterns(patterns): Regular expressions list for normalization, defaults [^A-Za-z]
setLowercase(value): lowercase tokens, default true
setSlangDictionary(path): txt file with delimited words to be transformed into something else

In [256]:
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\
    .setLowercase(False)\
    .setCleanupPatterns(["[^\w\d\s\.\!\?]"])

### 3.6 Stopwords Cleaner
This annotator excludes from a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

Functions:

setStopWords: The words to be filtered out. Array[String]

setCaseSensitive: Whether to do a case sensitive comparison over the stop words.

In [257]:
stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)\

### 3.7 Lemmatizer
Retrieves lemmas out of words with the objective of returning a base dictionary word

In [258]:
lemma = LemmatizerModel.pretrained('lemma_antbnc') \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("lemma")

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


### 3.8 Stemmer
Returns hard-stems out of words with the objective of retrieving the meaningful part of the word

In [259]:
stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

### 3.9 POSTagger
Part of speech tagger
Sets a POS tag to each word within a sentence. Its train data (train_pos) is a spark dataset of POS format values with Annotation columns.
Output type: POS
Input types: Document, Token
Reference: PerceptronApproach | PerceptronModel
Functions:

setNIterations(number): Number of iterations for training. May improve accuracy but takes longer. Default 5.
setPosColumn(colname): Column containing an array of POS Tags matching every token on the line.

In [260]:
pos = PerceptronModel.pretrained("pos_anc", 'en')\
      .setInputCols("clean_text", "cleanTokens")\
      .setOutputCol("pos")

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]


### 3.10 Chunker

Meaningful phrase matching

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document

Output type: Chunk
Input types: Document, POS
Reference: Chunker
Functions:

setRegexParsers(patterns): A list of regex patterns to match chunks, for example: Array(“‹DT›?‹JJ›*‹NN›”)
addRegexParser(patterns): adds a pattern to the current list of chunk patterns, for example: “‹DT›?‹JJ›*‹NN›”

In [261]:
chunker = Chunker()\
    .setInputCols(["sentences", "pos"])\
    .setOutputCol("chunk")\
    .setRegexParsers(["<DT>+<JJ>*<NN>"])  ## Determiner - adjective - singular noun

### 3.11 TokenAssembler: Getting data reshaped
This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.

Settable parameters are:

setInputCol(inputs:Array(String))
setOutputCol(output:String)
setPreservePosition(preservePosition:bool): Whether to preserve the actual position of the tokens or reduce them to one space

In [262]:
tokenassembler = TokenAssembler()\
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("clean_text")\
 #   .setPreservePosition(True)

In [263]:
tokenizer2 = Tokenizer() \
    .setInputCols(["clean_text"]) \
    .setOutputCol("token2")

### 3.12 WordEmbeddings
Word Embeddings lookup annotator that maps tokens to vectors

Output type: Word_Embeddings

Input types: Document, Token

Reference: WordEmbeddings | WordEmbeddingsModel
Functions:

setStoragePath(path, format): sets word embeddings options.
path: word embeddings file
format: format of word embeddings files:
TEXT -> This format is usually used by Glove
BINARY -> This format is usually used by Word2Vec
setCaseSensitive: whether to ignore case in tokens for embeddings matching

In [264]:
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
          .setInputCols(["document", "lemma"])\
          .setOutputCol("embeddings")\
          .setCaseSensitive(False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


### 3.13 NER DL
Named Entity Recognition Deep Learning annotator
This Named Entity recognition annotator allows to train generic NER model based on Neural Networks. Its train data (train_ner) is either a labeled or an external CoNLL 2003 IOB based spark dataset with Annotations columns. Also the user has to provide word embeddings annotation column.
Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
Output type: Named_Entity
Input types: Document, Token, Word_Embeddings
Reference: NerDLApproach | NerDLModel
Functions:

setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token.
setMaxEpochs: Maximum number of epochs to train.
setLr: Initial learning rate.
setPo: Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch).
setBatchSize: Batch size for training.
setDropout: Dropout coefficient.
setVerbose: Verbosity level.
setRandomSeed: Random seed.
setOutputLogsPath: Folder path to save training logs.
Note: Please check here in case you get an IllegalArgumentException error with a description such as: Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Generate graph by python code in python/tensorflow/ner/create_models before usage and use setGraphFolder Param to point to output.

In [265]:
onto_ner = NerDLModel.pretrained("onto_100", 'en') \
          .setInputCols(["document", "token", "embeddings"]) \
          .setOutputCol("ner")

onto_100 download started this may take some time.
Approximate size to download 13.5 MB
[OK!]


### 3.14 NER Converter
Converts IOB or IOB2 representation of NER to user-friendly
NER Converter used to finalize work of NER annotators. Combines entites with types B-, I- and etc. to the Chunks with Named entity in the metadata field (if LightPipeline is used can be extracted after fullAnnotate()) This NER converter can be used to the output of a NER model into the ner chunk format which is expected for the DeepSentenceDetector annotator.

Output type: Chunk
Input types: Document, Token, Named_Entity
Reference: NerConverter
Functions:

setWhiteList(Array(String)): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
setPreservePosition(Boolean): Whether to preserve the original position of the tokens in the original document or use the modified tokens.

In [266]:
ner_converter = NerConverter() \
  .setInputCols(["sentences", "token", "ner"]) \
  .setOutputCol("ner_chunk")

### 3.15 TextMatcher
Annotator to match entire phrases (by token) provided in a file against a Document

Functions:

setEntities(path, format, options): Provides a file with phrases to match. Default: Looks up path in configuration.

path: a path to a file that contains the entities in the specified format.

readAs: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.

options: a map of additional parameters. Defaults to {“format”: “text”}.

entityValue : Value for the entity metadata field to indicate which chunk comes from which textMatcher when there are multiple textMatchers.

mergeOverlapping : whether to merge overlapping matched chunks. Defaults false

caseSensitive : whether to match regardless of case. Defaults true

In [267]:
rules = r'''
\b[A-Z]\w+ly\b, staring with a capital letter ending with 'ly'
Stephen\s(?!Proto|Cardinal)[A-Z]\w+, followed by "Stephen"
Simon\s[A-Z]\w+, followed by "Simon"
'''

with open('ulyses_regex_rules.txt', 'w') as f:
    
    f.write(rules)

regex_matcher = RegexMatcher()\
    .setInputCols('sentences')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./ulyses_regex_rules.txt', delimiter=',')

### 3.16 Spark NLP Pipeline



Stacking Spark NLP Annotators in Spark ML Pipeline
Spark NLP provides an easy API to integrate with Spark ML Pipelines and all the Spark NLP annotators and transformers can be used within Spark ML Pipelines. So, it’s better to explain Pipeline concept through Spark ML official documentation.

What is a Pipeline anyway? In machine learning, it is common to run a sequence of algorithms to process and learn from data.

Apache Spark ML represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.

In simple terms, a pipeline chains multiple Transformers and Estimators together to specify an ML workflow. We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow.


A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage. With the help of Pipelines, we can ensure that training and test data go through identical feature processing steps.

Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.

Split text into sentences
Tokenize
And here is how we code this pipeline up in Spark NLP.





In [268]:
nlpPipeline = Pipeline(stages=[
     documentAssembler,
     sentenceDetector,
     tokenizer,
     ngrams,
     normalizer,
     stopwords_cleaner,
     lemma,
     stemmer,
     tokenassembler,
     tokenizer2,
     pos,
     chunker,
     glove_embeddings,
     onto_ner,
     ner_converter,
     regex_matcher
    
 ])


empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)



In [269]:
lib_result = pipelineModel.transform(library)

lib_result.show(truncate=10)

+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+-----+----------+----------+----------+-------------+
|      text|  document| sentences|     token|    ngrams|normalized|cleanTokens|     lemma|      stem|clean_text|    token2|       pos|chunk|embeddings|       ner| ner_chunk|regex_matches|
+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+-----+----------+----------+----------+-------------+
|Project...|[[docum...|[[docum...|[[token...|[[chunk...|[[token...| [[token...|[[token...|[[token...|[[docum...|[[token...|[[pos, ...|   []|[[word_...|[[named...|[[chunk...|           []|
|You may...|[[docum...|[[docum...|[[token...|[[chunk...|[[token...| [[token...|[[token...|[[token...|[[docum...|[[token...|[[pos, ...|   []|[[word_...|[[named...|[[chunk...|   [[chunk...|
|For thi...|[[docum...|[[docum...|[[token...|[[chunk...|[[to

In [39]:
lib_result.select('text', 'clean_text.result').show(20,truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                                text|                                                                                              result|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle This eBook is for the us...|[Project Gutenbergs Hound Baskervilles Arthur Conan Doyle eBook use anyone anywhere cost almost r...|
|You may copy it, give it away or re-use it under the terms of the Project Gutenberg License inclu...|[may copy give away reuse terms Project Gutenberg License included eBook online ww

In [231]:
lib_result.withColumn(
    "tmp", 
    F.explode("chunk")) \
    .select("tmp.*").select("begin","end","result","metadata.sentence").show(20,truncate = 100)

+-----+---+----------------------------------+--------+
|begin|end|                            result|sentence|
+-----+---+----------------------------------+--------+
|   41| 51|                       every night|       0|
|    0| 12|                     Another point|       0|
|  142|155|                    every particle|       0|
|    0| 10|                       “Every inch|       0|
|    0| 11|                      Another item|       0|
|    3| 12|                        every turn|       0|
|   52| 61|                        every road|       0|
|   28| 40|                     every variety|       0|
|   31| 41|                       every nerve|       0|
|   32| 41|                        every part|       0|
|   20| 53|another of those miserable ponies!|       0|
|   32| 45|                    another member|       0|
|    3| 11|                         every way|       0|
|   32| 42|                       every night|       0|
|   55| 68|                    another glance|  

In [41]:
lib_result.select('clean_text.result', 'cleanTokens.result').show(50,truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                              result|                                                                                              result|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|[Project Gutenbergs Hound Baskervilles Arthur Conan Doyle eBook use anyone anywhere cost almost r...|[Project, Gutenbergs, Hound, Baskervilles, Arthur, Conan, Doyle, eBook, use, anyone, anywhere, co...|
|[may copy give away reuse terms Project Gutenberg License included eBook online www.gutenberg.org...|[may, copy, give, away, reuse, terms, Project, Gutenberg, License, included, eBook

### NerDL OntoNotes 100D¶
Entities

'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART'

In [42]:
lib_result.select(F.explode(F.arrays_zip('cleanTokens.result', 'ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ner_label")).show(20,truncate = 100)


+-----------------+-------------+
|            token|    ner_label|
+-----------------+-------------+
|          Project|            O|
|       Gutenbergs|B-WORK_OF_ART|
|            Hound|I-WORK_OF_ART|
|     Baskervilles|I-WORK_OF_ART|
|           Arthur|I-WORK_OF_ART|
|            Conan|I-WORK_OF_ART|
|            Doyle|I-WORK_OF_ART|
|            eBook|I-WORK_OF_ART|
|              use|            O|
|           anyone|            O|
|         anywhere|            O|
|             cost|            O|
|           almost|            O|
|     restrictions|            O|
|       whatsoever|            O|
|                .|            O|
|              may|            O|
|             copy|            O|
|             give|            O|
|             away|            O|
|            reuse|            O|
|            terms|            O|
|          Project|B-WORK_OF_ART|
|        Gutenberg|I-WORK_OF_ART|
|          License|I-WORK_OF_ART|
|         included|            O|
|            e

In [43]:
result_ner=lib_result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("ner_chunk"),
        F.expr("cols['1']['entity']").alias("ner_label"))

In [44]:
result_ner.show(10)

+--------------------+-----------+
|           ner_chunk|  ner_label|
+--------------------+-----------+
|         Gutenberg's|WORK_OF_ART|
|               Hound|WORK_OF_ART|
|        Baskervilles|WORK_OF_ART|
|  Arthur Conan Doyle|WORK_OF_ART|
|               eBook|WORK_OF_ART|
|Project Gutenberg...|WORK_OF_ART|
|               eBook|WORK_OF_ART|
|               Title|WORK_OF_ART|
|               Hound|WORK_OF_ART|
| Baskervilles Author|WORK_OF_ART|
+--------------------+-----------+
only showing top 10 rows



In [45]:
result_ner.filter(result_ner.ner_label == "PERSON").take(2)

[Row(ner_chunk='Conan Doyle', ner_label='PERSON'),
 Row(ner_chunk='Robinson', ner_label='PERSON')]

### Let's Venture Into The Characters...Spark NLP Way.

Let's examine the characters that are in the book...This time we will be using Spark NLP mechanics. Please note differences in accuracy as compared to SpaCy.

In [46]:
l = result_ner.filter(result_ner.ner_label == "PERSON").select(F.expr("ner_chunk")).collect()

In [47]:
names = list([re.sub("[^a-zA-Z\s\']","",l_[0]).title() for l_ in l if l_[0].replace(' ','').isalpha() and len(l_[0].strip().split())==2 and "’" not in l_[0]])

In [48]:
set([l_[0] for l_ in l if l_[0].strip().isalpha() and len(l_[0].strip().split())==2])

set()

The number of names in the book look more accurate.

In [49]:
len(set(names))

1284

In [50]:
set(names)

{'Edward Fitzgerald',
 'Arthur Chace',
 'East Lynne',
 'Harry Hughes',
 'Cousin Stephen',
 'William Delany',
 'Mr Love',
 'Glory Allelujurum',
 'Cormac Macart',
 'Mrs Talboys',
 'Ape Swillale',
 'Timothy Harrington',
 'Lambert Simnel',
 'Brian Confucius',
 'Bernard Corrigan',
 'Pooles Myriorama',
 'Bandmann Palmer',
 'Georgina Johnson',
 'Samuel F',
 'Arthur Wellesley',
 'Widow Douglas',
 'Galway Lynches',
 'Barney Kiernan',
 'Samuel Clemens',
 'George Harris',
 'Mr Geo',
 'Martha P',
 'Massa Ford',
 'Father Cowley',
 'Tim Kelly',
 'Louis Veuillot',
 'Baby Boardman',
 'Joking Jesus',
 'Stark Ruth',
 'Tommy Barnes',
 'Uncle Alfred',
 'Baskerville Hall',
 'Xxiii Henrique',
 'Beau Mount',
 'Irving Bishop',
 'Joe Maas',
 'Lei Si',
 'Val Dillon',
 'Lotty Clarke',
 'Thereupon Tibeats',
 'Thomas Scott',
 'Baton Rouge',
 'Unfallen Adam',
 'Theobald Mathew',
 'Mrs Moll',
 'Dr Horne',
 'Constance Louisa',
 'Richie Poldy',
 'Madcap Ciss',
 'Mortimer Edward',
 'Doctor Diet',
 'Michael Cross',
 'Th

In [51]:
from collections import Counter

In [52]:
counter=Counter(names).most_common(350)   # Spark NLP

In [270]:
counter

[('Buck Mulligan', 93),
 ('Aunt Chloe', 82),
 ('Martin Cunningham', 71),
 ('Bayou Boeuf', 48),
 ('Aunt Sally', 39),
 ('Ned Lambert', 39),
 ('Mary Jane', 38),
 ('Solomon Northup', 36),
 ('John Thornton', 34),
 ('Myles Crawford', 33),
 ('Ben Dollard', 31),
 ('Sherlock Holmes', 30),
 ('Tom Sawyer', 30),
 ('John Eglinton', 29),
 ('Nosey Flynn', 28),
 ('Corny Kelleher', 27),
 ('Mrs Breen', 27),
 ('Father Conmee', 26),
 ('Uncle Tom', 25),
 ('John Wyse', 24),
 ('Henry Baskerville', 23),
 ('Uncle Abram', 22),
 ('Blazes Boylan', 19),
 ('Bob Doran', 18),
 ('Davy Byrne', 18),
 ('Coombe Tracey', 17),
 ('Aunt Phebe', 17),
 ('Simon Dedalus', 17),
 ('Cissy Caffrey', 17),
 ('Edy Boardman', 16),
 ('Mintus Northup', 15),
 ('Edwin Epps', 14),
 ('Mr Bloom', 14),
 ('Tom Rochford', 14),
 ('Tom Kernan', 13),
 ('Paddy Leonard', 13),
 ('Charles Baskerville', 12),
 ('Uncle Silas', 12),
 ('Master Ford', 12),
 ('Anne Northup', 12),
 ('Richie Goulding', 12),
 ('Paddy Dignam', 12),
 ('Grimpen Mire', 11),
 ('Huck Fi

In [53]:
count_2plus=[(name,count) for name,count in counter if count>1]

In [54]:
len(count_2plus)

334

In [55]:
t1=time.time()

In [56]:
t1-t0

580.898642539978

In [None]:
start memory 2.6Gb cache 2 swap0
peak 6.2 
End 4.6Gb  cache 2.9 swap

580 sec -- 9 m 40 sec

In [None]:
start 2.7 gb cache 2 swap 0
peak 7.1 gb
end 5.4 gb cache 2 swap 0 

1081 sec -- 18 minutes 1 sec

In [57]:

lib_result.select('lemma.result').show(truncate=80)

+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[Project, Gutenbergs, Hound, Baskervilles, Arthur, Conan, Doyle, eBook, use, ...|
|[may, copy, give, away, reuse, term, Project, Gutenberg, License, include, eB...|
|                                                        [help, detail, thank, .]|
|                                                     [truly, ., Conan, Doyle, .]|
|                                                        [Hindhead, Haslemere, .]|
|[Contents, Chapter, 1, Mr, ., Sherlock, Holmes, Chapter, 2, Curse, Baskervill...|
|                                                                   [1, ., Mr, .]|
|[Sherlock, Holmes, Mr, ., Sherlock, Holmes, usually, late, morning, save, upo...|
|         [stand, upon, hearthrug, pick, stick, visitor, leave, behind, night, .]|
|   

In [58]:

result_df = lib_result.select(F.explode(F.arrays_zip('token.result', 'stem.result',  'lemma.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("stem"),
        F.expr("cols['2']").alias("lemma")).toPandas()

result_df.head(10)

Unnamed: 0,token,stem,lemma
0,Project,project,Project
1,Gutenberg's,gutenberg',Gutenbergs
2,The,the,Hound
3,Hound,hound,Baskervilles
4,of,of,Arthur
5,the,the,Conan
6,Baskervilles,baskervil,Doyle
7,",",",",eBook
8,by,by,use
9,Arthur,arthur,anyone


In [59]:
lib_result.select('text','clean_text').show(100,truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                                text|                                                                                          clean_text|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle This eBook is for the us...|[[document, 0, 118, Project Gutenbergs Hound Baskervilles Arthur Conan Doyle eBook use anyone any...|
|You may copy it, give it away or re-use it under the terms of the Project Gutenberg License inclu...|[[document, 0, 476, may copy give away reuse terms Project Gutenberg License inclu

In [60]:
lib_result.select('regex_matches.result').alias('result').filter(F.size('result')>0).show(10)

+-----------+
|     result|
+-----------+
|     [July]|
|[Perfectly]|
|   [Really]|
|    [Apply]|
|[Obviously]|
|     [Only]|
|    [Early]|
|     [Holy]|
|  [Exactly]|
|     [Only]|
+-----------+
only showing top 10 rows



In [61]:
lib_result.select('sentences.result','regex_matches.result')\
.toDF('sentences','matches').filter(F.size('matches')==1)\
.show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------+
|                                                                                           sentences|    matches|
+----------------------------------------------------------------------------------------------------+-----------+
|[You may copy it, give it away or re-use it under the terms of the Project Gutenberg License incl...|     [July]|
|                                                                 [” “Perfectly sound!” said Holmes.]|[Perfectly]|
|[” “Really, Watson, you excel yourself,” said Holmes, pushing back his chair and lighting a cigar...|   [Really]|
|[Apply them!” “I can only think of the obvious conclusion that the man has practised in town befo...|    [Apply]|
|[Obviously at the moment when Dr. Mortimer withdrew from the service of the hospital in order to ...|[Obviously]|
|                          [“Why was it bad?” “Only that you have disarranged ou

In [62]:
lib_result.filter(F.size('regex_matches.result')>1).show(5,truncate = 10)

+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+-----+----------+----------+----------+-------------+
|      text|  document| sentences|     token|    ngrams|normalized|cleanTokens|     lemma|      stem|clean_text|    token2|       pos|chunk|embeddings|       ner| ner_chunk|regex_matches|
+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+-----+----------+----------+----------+-------------+
|Hotchki...|[[docum...|[[docum...|[[token...|[[chunk...|[[token...| [[token...|[[token...|[[token...|[[docum...|[[token...|[[pos, ...|   []|[[word_...|[[named...|[[chunk...|   [[chunk...|
| Aunt P...|[[docum...|[[docum...|[[token...|[[chunk...|[[token...| [[token...|[[token...|[[token...|[[docum...|[[token...|[[pos, ...|   []|[[word_...|[[named...|[[chunk...|   [[chunk...|
| So the...|[[docum...|[[docum...|[[token...|[[chunk...|[[to

### 3.10 Finisher
Finisher: Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into a string.

If we just want the desired output column in the final dataframe, we can use Finisher to drop previous stages in the final output and get the result from the process.

This is very handy when you want to use the output from Spark NLP annotator as an input to another Spark ML transformer.

Settable parameters are:

setInputCols()

setOutputCols()

setCleanAnnotations(True) -> Whether to remove intermediate annotations

setValueSplitSymbol(“#”) -> split values within an annotation character

setAnnotationSplitSymbol(“@”) -> split values between annotations character

setIncludeMetadata(False) -> Whether to include metadata keys. Sometimes useful in some annotations.

setOutputAsArray(False) -> Whether to output as Array. Useful as input for other Spark transformers.

In [63]:
finisher = Finisher() \
    .setInputCols(["regex_matches"]) \
    .setIncludeMetadata(False) # set to False to remove metadata

nlpPipeline = Pipeline(stages=[
 documentAssembler, 
 regex_matcher,
 finisher
 ])
 
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

match_df = pipelineModel.transform(lib_result)

match_df.show(truncate = 50)

+--------------------------------------------------+----------------------+
|                                              text|finished_regex_matches|
+--------------------------------------------------+----------------------+
|Project Gutenberg's The Hound of the Baskervill...|                    []|
|You may copy it, give it away or re-use it unde...|                [July]|
|For this and for your help in the details all t...|                    []|
|                 Yours most truly, A. Conan Doyle.|                    []|
|                              Hindhead, Haslemere.|                    []|
|Contents Chapter 1 Mr. Sherlock Holmes Chapter ...|                    []|
|                                            1. Mr.|                    []|
|Sherlock Holmes Mr. Sherlock Holmes, who was us...|                    []|
|I stood upon the hearth-rug and picked up the s...|                    []|
|It was a fine, thick piece of wood, bulbous-hea...|                    []|
|” Just unde

In [64]:
match_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- finished_regex_matches: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [65]:
match_df.filter(F.size('finished_regex_matches')>1).show(truncate = 50)

+--------------------------------------------------+-----------------------+
|                                              text| finished_regex_matches|
+--------------------------------------------------+-----------------------+
|Hotchkiss Aunt Sally talks to Huck Tom Sawyer w...|         [Sally, Truly]|
| Aunt Polly--Tom's Aunt Polly, she is--and Mary...|         [Polly, Polly]|
| So then we went away and went to the rubbage-p...|         [Sally, Sally]|
|then we went and waited around the spoon-basket...|         [Sally, Sally]|
|Phelps took me for Tom Sawyer--she chipped in a...|         [Sally, Sally]|
|Well, Aunt Polly she said that when Aunt Sally ...|         [Polly, Sally]|
|We had Jim out of the chains in no time, and wh...|         [Polly, Sally]|
|At the Pelly one morning, as they were harnessi...|         [Pelly, Dolly]|
|O'Niel, the Tanner--Conversation with Aunt Pheb...|     [Ugly, Melancholy]|
|Finally, after much more of supplication, the p...|       [Finally, Emily]|