Testing features of the spaCy NLP package 

In [35]:
import spacy
from spacy import displacy
import pandas as pd
from pathlib import Path
from itertools import islice
from time import time
import matplotlib.pyplot 

%matplotlib inline 

In [2]:
nlp = spacy.load('en_core_web_md')

In [3]:
DATA_DIR = Path('../../data/wiki10/text')

In [4]:
sample_text = list(DATA_DIR.iterdir())[100].read_text()

Passing text to the language object returns a Doc object

In [5]:
doc = nlp(sample_text)
type(doc)

spacy.tokens.doc.Doc

Iterating through a Doc returns a Token object

In [6]:
records = ((token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
           for token in doc)

columns = ["TEXT","LEMMA","POS","TAG","DEP","SHAPE","ALPHA","STOP"]

pd.DataFrame.from_records(records, columns=columns).head(20)

Unnamed: 0,TEXT,LEMMA,POS,TAG,DEP,SHAPE,ALPHA,STOP
0,The,the,DET,DT,det,Xxx,True,False
1,purchasing,purchase,VERB,VBG,compound,xxxx,True,False
2,power,power,NOUN,NN,compound,xxxx,True,False
3,parity,parity,NOUN,NN,nmod,xxxx,True,False
4,(,(,PUNCT,-LRB-,punct,(,False,False
5,PPP,ppp,PROPN,NNP,appos,XXX,True,False
6,),),PUNCT,-RRB-,punct,),False,False
7,theory,theory,NOUN,NN,nsubj,xxxx,True,False
8,uses,use,VERB,VBZ,ROOT,xxxx,True,False
9,the,the,DET,DT,det,xxx,True,False


Docs can also contain "entities" (NER) 

In [7]:
records = ((ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents)

columns = ["TEXT","START_CHAR","END_CHAR","LABEL_"]

pd.DataFrame.from_records(records, columns=columns).head(20)

Unnamed: 0,TEXT,START_CHAR,END_CHAR,LABEL_
0,PPP,29,32,ORG
1,two,89,92,CARDINAL
2,Gustav Cassel,153,166,PERSON
3,1920,170,174,DATE
4,one,202,205,CARDINAL
5,only one,295,303,CARDINAL
6,SEM,333,336,ORG
7,PPP,458,461,ORG
8,PPP,576,579,ORG
9,Geary-Khamis,817,829,PERSON


The model can also contain word vectors for each token. Tokens have attributes to check if the vector exists, its raw values, l2 norm and a method to compare its similarity with other tokens. 

Doc also have a similarity method which uses a simple average of the token vecs in that document. 

In [8]:
token = doc[1]

In [9]:
print(token.text);
print('has vector:', token.has_vector);
print('l2 norm:', token.vector_norm);
print(f'similarity with "{doc[2].text}":', token.similarity(doc[2])) 

purchasing
has vector: True
l2 norm: 5.8623
similarity with "power": 0.24701


Checking which tokens in our Doc do not vectors in the currently loaded model "en_core_web_md" 

In [10]:
{token.text for token in doc if not token.has_vector}

{' ',
 '  ',
 '/£)=',
 '1.01/1.03',
 '1.50/£.',
 '1.80/£',
 '30,615',
 '7,204',
 'Balassa',
 'CPIs',
 'CommSec',
 'GBP£3',
 'GDPs',
 'Index[3',
 'US$',
 'USD$4',
 'underemphasise',
 '\xa0'}

Clearly most cases here are very obscure tokens, although some of them may exists in a model with a larger vocabulary e.g. "en_core_web_lg" which has 1 million word vectors. 

A Doc also has a sentence attribute which return a generator of sentences where language specific sentence splitting is more accurate than simply looking for full stops. 

In [11]:
sum(1 for _ in doc.sents)

94

Example of spaCy visualizer. Can also be used for dependency parsing etc. 

In [12]:
displacy.render(doc, style='ent', jupyter=True) 

Loading multiple documents using nlp pipe which is a generator that supports multi-threading. Link also describes use case where we want to tie input meta data to document stream https://spacy.io/usage/processing-pipelines#multithreading

Checking processing time for tokenizing documents

In [58]:
# creating fresh language object to reset vocab
# also excluding everything in pipeline apart from tokenizer 
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner']) 

In [59]:
%%time
for doc in nlp.pipe(path.read_text() for path in islice(DATA_DIR.iterdir(), 5000)):
    pass 

CPU times: user 1min 51s, sys: 306 ms, total: 1min 51s
Wall time: 1min 56s


2 minutes for tokenizing 5k documents. The Language.pipe() method utilizes multithreading and is supposed to be quicker than than simply tokenizing in series. Will test that now 

In [61]:
# creating fresh language object to reset vocab
# also excluding everything in pipeline apart from tokenizer 
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner']) 

In [62]:
%%time
for text in (path.read_text() for path in islice(DATA_DIR.iterdir(), 5000)):
    nlp(text)

CPU times: user 1min 52s, sys: 483 ms, total: 1min 52s
Wall time: 1min 55s


We do not see any slowdown by processing documents without multithreading. Investigating this it seems that only the tagger and parser utilize the multithreading capabilities so we don't see any difference only using the tokenizer. We may be able to speed things up with multiprocessing https://github.com/explosion/spaCy/issues/1321

We will now compare the speed of tokenizing the same documents with NLTK

In [63]:
import nltk
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [65]:
%%time
# doing both sentence and unigram tokenization for fair comparison
for text in (path.read_text() for path in islice(DATA_DIR.iterdir(), 5000)):
    for sent in sentence_tokenizer.tokenize(text):
        nltk.word_tokenize(sent)

CPU times: user 4min 44s, sys: 237 ms, total: 4min 44s
Wall time: 4min 49s


NLTK tokenizer is slower, and actually spaCy is building a vocabulary object at the same time as tokenizing, whereas NLTK is just returning strings. 