##  Cython NLP

The below is an example NeuralCoref had about using Cython for 'blazing fast NLP'...however, it requires that a spacy doc be passed into it, whereas in our current process, the pipe is the bottleneck...so not super useful.

In [1]:
import setuptools
%load_ext Cython

In [2]:
import urllib.request
import spacy
# Build a dataset of 10 parsed document extracted from the Wikitext-2 dataset
with urllib.request.urlopen('https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/valid.txt') as response:
    text = response.read()
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger'])
doc_list = list(nlp(text[:800000].decode('utf8')) for i in range(10))

In [3]:
%%cython -+
import numpy # Sometime we have a fail to import numpy compilation error if we don't import numpy
from cymem.cymem cimport Pool
from spacy.tokens.doc cimport Doc
from spacy.typedefs cimport hash_t
from spacy.structs cimport TokenC

cdef struct DocElement:
    TokenC* c
    int length

cdef int fast_loop(DocElement* docs, int n_docs, hash_t word, hash_t tag):
    cdef int n_out = 0
    for doc in docs[:n_docs]:
        for c in doc.c[:doc.length]:
            if c.lex.lower == word and c.tag == tag:
                n_out += 1
    return n_out

def main_nlp_fast(doc_list):
    cdef int i, n_out, n_docs = len(doc_list)
    cdef Pool mem = Pool()
    cdef DocElement* docs = <DocElement*>mem.alloc(n_docs, sizeof(DocElement))
    cdef Doc doc
    for i, doc in enumerate(doc_list): # Populate our database structure
        docs[i].c = doc.c
        docs[i].length = (<Doc>doc).length
    word_hash = doc.vocab.strings.add('run')
    tag_hash = doc.vocab.strings.add('NN')
    n_out = fast_loop(docs, n_docs, word_hash, tag_hash)
    print(n_out)

In [6]:
%%time
main_nlp_fast(doc_list)

0
Wall time: 34.9 ms


------
##  Using spacy's built-in nlp.pipe

In [None]:
# OPENBLAS_NUM_THREADS=1

In [7]:
import spacy
import pandas as pd
emails_df = pd.read_csv('emails_processed.csv', nrows=10000)

In [12]:
%%time

nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger'])
# import neuralcoref
# neuralcoref.add_to_pipe(nlp)

emails_processed = []
for email in nlp.pipe(emails_df['content'],batch_size=75,n_threads=5):
    emails_processed.append(email)

Wall time: 3min 30s


### Without neuralcoref, parser, tagger
- 2000,batch_size=500, 24s
- 5000,batch_size=500, 1min
- 10000,batch_size=500, 2min43
- 10000,batch_size=250, 2min40
- 10000,batch_size=150, 2min34
- 10000,batch_size=100, 2min30
- 10000,batch_size=75, 3min30

### Without neuralcoref
- 10000,batch_size=50, 7min5s
- 20000, batch_size=50, 16min29s

### With neuralcoref
- 10000, batch_size=30, 1hr4min43s
