# Named Entity Recognition in Python using spaCy

Note that the final step (linking the downloaded models to a simplified name) failed here, because the Python interpreter has insufficient privileges to make the link. After a couple of emails with the DS team, they suggested the following. They also pointed out that we could make a new environment and have the necessary installs in the Docker file, which is a better long term solution if we're expecting to use a package a lot, since the modification that we're making here will disappear as soon as this environment is shutdown.

I thought I would give the spaCy NLP library a test in the DS environment to see what it's like to work in the environment for some NLP tasks.

In this environment, we used the PIP requirements file when starting up the environment to download and install the spaCy python module. In addition to the module, we need the actual models that are available for spaCy. So, from the Python interpreter, we shell out and call another python command to download the models.

In [1]:
!python -m spacy download en

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |################################| 37.4MB 727kB/s ta 0:00:0101                 | 9.3MB 108.9MB/s eta 0:00:010:01|###########                     | 13.8MB 88.0MB/s eta 0:00:01 |######################          | 26.0MB 79.4MB/s eta 0:00:01|########################        | 28.6MB 83.4MB/s eta 0:00:017% |############################### | 36.6MB 82.6MB/s eta 0:00:01
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.0.0
[33mYou are using pip version 9.0.2, however version 9.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

[93m    Error: Couldn't link model to 'en'[0m
    Cr

In [2]:
!sudo python -m spacy link en_core_web_sm en --force


[93m    Linking successful[0m
    /usr/local/lib/python2.7/dist-packages/en_core_web_sm -->
    /usr/local/lib/python2.7/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')



Now we can load the English models using the "simple" name, like so:

In [1]:
import spacy
nlp = spacy.load('en')

Now here's the tricky part with DS: much the same way that there's no model management currently (although they are working on it), there's also no abstraction for data sets.

In some of the examples that we've seen from them, the data that you would be working on is checked into the same GitHub repo as the code for building and evaluating the models. While this may be tolerable for small data sets (perhaps a few hundred examples in a CSV?) it's problematic for larger datasets. For example, the OntoNotes 5 data set that is used to train one of the spaCy NER models is about 4.7GB.

It's more likely that datasets will be drawn from object storage (S3 in the case of AWS) or from an HDFS instance that's running in the same cloud space as the DS system. This is currently decided by each user of DS. I expect that both of these methods are used by their current customers in an adhoc way by the data scientists.

Most of these NLP tools presume that the data that you're going to be training from or evaluating has been processed down into some simple format, like having annotations in JSON for training spaCy models or files of plain text data for annotation.

We'll punt and give you an idea of the capabilities by looking at a comple of examples that we checked into our source code. We're using a couple of examples we pulled out of RCV1, which are from a previous era. Note that spaCy operates only on unicode strings.


In [2]:
text = unicode(open('/home/datascience/stones/examples/sun.txt').read())
text

u'Sun Microsystems unveils network computer.\n\nSun Microsystems Inc. said Tuesday it introduced its first network\ncomputer, called the JavaStation.  The scaled-down personal computer,\ndesigned to access corporate networks and the Internet, carries a\nprice tag of $742, the company said.  Sun said the entry level model,\ncontaining eight megabytes of memory, will ship beginning in\nDecember. A fully configured system, which includes memory, keyboard,\na mouse and fourteen-inch monitor will cost $995.  The company also\nsaid it will offer a JavaStation with 16 megabytes of memory for\n$1,565.\n'

Ah, memories. Let's run this through the NLP and see what we get

In [3]:
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

(u'Sun Microsystems', 0, 16, u'ORG')
(u'Sun Microsystems Inc.', 44, 65, u'ORG')
(u'Tuesday', 71, 78, u'DATE')
(u'\n', 110, 111, u'GPE')
(u'JavaStation', 132, 143, u'ORG')
(u'\n', 180, 181, u'GPE')
(u'\n', 246, 247, u'GPE')
(u'742', 261, 264, u'MONEY')
(u'Sun', 285, 288, u'ORG')
(u'\n', 316, 317, u'GPE')
(u'eight megabytes', 328, 343, u'QUANTITY')
(u'\n', 377, 378, u'GPE')
(u'December', 378, 386, u'DATE')
(u'\n', 447, 448, u'GPE')
(u'fourteen-inch', 460, 473, u'QUANTITY')
(u'995', 493, 496, u'MONEY')
(u'\n', 515, 516, u'GPE')
(u'JavaStation', 537, 548, u'ORG')
(u'16 megabytes', 554, 566, u'QUANTITY')
(u'\n', 580, 581, u'GPE')
(u'1,565', 582, 587, u'MONEY')
(u'\n', 588, 589, u'GPE')


Looks like it did a reasonable job, but it's tagging whitespace as GPE? That's weird, and it turns out it's a problem in the current code, which we can deal with by adding a post-processing step to the pipeline:

In [4]:
def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

nlp.add_pipe(remove_whitespace_entities, after='ner')
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

(u'Sun Microsystems', 0, 16, u'ORG')
(u'Sun Microsystems Inc.', 44, 65, u'ORG')
(u'Tuesday', 71, 78, u'DATE')
(u'JavaStation', 132, 143, u'ORG')
(u'742', 261, 264, u'MONEY')
(u'Sun', 285, 288, u'ORG')
(u'eight megabytes', 328, 343, u'QUANTITY')
(u'December', 378, 386, u'DATE')
(u'fourteen-inch', 460, 473, u'QUANTITY')
(u'995', 493, 496, u'MONEY')
(u'JavaStation', 537, 548, u'ORG')
(u'16 megabytes', 554, 566, u'QUANTITY')
(u'1,565', 582, 587, u'MONEY')


That's better, let's see what Oracle's up to:

In [5]:
text = unicode(open('/home/datascience/stones/examples/oracle.txt').read())
text

u"Oracle licenses Borland Java.\n\nOracle Corp will license Borland International Inc's Java and\nC++Builder technologies, both companies said.  Under terms, Oracle\nwill integrate and distribute Borland's C++Builder, a new object\ndevelopment tool, and JBuilder software tools with several of its\nexisting and future products.  The license is worldwide and\nnon-exclusive.  Oracle will incorporate Borland's Java technologies\ninto many of its tools, including Developer/2000, Designer/2000 and\nSedona.\n"

In [6]:
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

(u'Borland Java', 16, 28, u'ORG')
(u'Oracle Corp', 31, 42, u'PERSON')
(u"Borland International Inc's", 56, 83, u'ORG')
(u'Java', 84, 88, u'NORP')
(u'Oracle\n', 153, 160, u'ORG')
(u'Borland', 190, 197, u'GPE')
(u'C++Builder', 200, 210, u'ORG')
(u'JBuilder', 247, 255, u'ORG')
(u'Oracle', 367, 373, u'PERSON')
(u'Borland', 391, 398, u'GPE')
(u'Java', 401, 405, u'NORP')


Not so great on the entities in that one. So let's try to add some examples to the NER and see if we can make it any better. We need to generate training data in the format that spaCy needs.

Let's run a few examples through the NER, correct the mistakes that it makes, and then use the fixed up annotated versions to update the NER model. 

We'll use a function that does a list comprehension on the results of the NER to make our initial list. Note that this data is not exactly in the training format for spaCy: I've added the text of the extracted entity to the tuples with the entity labels so that it'll be easier to understand how to fix when cleaning it up.

In [8]:
def annotate_example(doc):
    return (doc.text, {'entities': [(e.start_char, e.end_char, e.label_, e.text) for e in doc.ents]})

unclean_train = [annotate_example(nlp(unicode(open('/home/datascience/stones/examples/'+f).read()))) for f in ['sun_one.txt', 'oracle_one.txt', 'oracle2_one.txt', 'oracle3_one.txt']]
unclean_train


[(u'Sun Microsystems unveils network computer.  Sun Microsystems Inc. said Tuesday it introduced its first network computer, called the JavaStation.  The scaled-down personal computer, designed to access corporate networks and the Internet, carries a price tag of $742, the company said.  Sun said the entry level model, containing eight megabytes of memory, will ship beginning in December. A fully configured system, which includes memory, keyboard, a mouse and fourteen-inch monitor will cost $995.  The company also said it will offer a JavaStation with 16 megabytes of memory for $1,565. \n',
  {'entities': [(0, 16, u'ORG', u'Sun Microsystems'),
    (44, 65, u'ORG', u'Sun Microsystems Inc.'),
    (71, 78, u'DATE', u'Tuesday'),
    (97, 102, u'ORDINAL', u'first'),
    (132, 143, u'ORG', u'JavaStation'),
    (261, 264, u'MONEY', u'742'),
    (285, 288, u'ORG', u'Sun'),
    (328, 343, u'QUANTITY', u'eight megabytes'),
    (378, 386, u'DATE', u'December'),
    (460, 473, u'QUANTITY', u'fourt

I'm not going to show the editing. I'm doing that like a gentleman (i.e., in emacs) and I'll paste the results into the session when I'm finished.

*time passes*

And here we go:

In [13]:
clean_train = [(u'Sun Microsystems unveils network computer.  Sun Microsystems Inc. said Tuesday it introduced its first network computer, called the JavaStation.  The scaled-down personal computer, designed to access corporate networks and the Internet, carries a price tag of $742, the company said.  Sun said the entry level model, containing eight megabytes of memory, will ship beginning in December. A fully configured system, which includes memory, keyboard, a mouse and fourteen-inch monitor will cost $995.  The company also said it will offer a JavaStation with 16 megabytes of memory for $1,565. \n',
  {'entities': [(0, 16, u'ORG'),
    (44, 65, u'ORG'),
    (71, 78, u'DATE'),
    (97, 102, u'ORDINAL'),
    (132, 143, u'PRODUCT'),
    (261, 264, u'MONEY'),
    (285, 288, u'ORG'),
    (328, 343, u'QUANTITY'),
    (378, 386, u'DATE'),
    (460, 473, u'QUANTITY'),
    (493, 496, u'MONEY'),
    (537, 548, u'PRODUCT'),
    (554, 566, u'QUANTITY'),
    (582, 587, u'MONEY')]}),
 (u"Oracle licenses Borland Java.  Oracle Corp will license Borland International Inc's Java and C++Builder technologies, both companies said.  Under terms, Oracle will integrate and distribute Borland's C++Builder, a new object development tool, and JBuilder software tools with several of its existing and future products.  The license is worldwide and non-exclusive.  Oracle will incorporate Borland's Java technologies into many of its tools, including Developer/2000, Designer/2000 and Sedona. \n",
  {'entities': [(16, 28, u'PRODUCT'),
    (31, 42, u'ORG'),
    (56, 83, u'ORG'),
    (84, 88, u'PRODUCT'),
    (190, 197, u'ORG'),
    (200, 210, u'PRODUCT'),
    (247, 255, u'PRODUCT'),
    (367, 373, u'ORG'),
    (391, 398, u'ORG'),
    (401, 405, u'PRODUCT'),
    (487, 493, u'PRODUCT')]}),
 (u"Oracle's Oracle8 universal server in beta. Oracle Corp said on Wednesday that its new Oracle Universal Server, Oracle8, entered beta testing last week.  Oracle said the software will be tested by customers on five different Unix platforms and Windows NT.  The company said more than 300 users had been trained on the product, which will undergo intensive testing before becoming generally available.\n",
  {'entities': [(43, 54, u'ORG'),
    (63, 72, u'DATE'),
    (86, 109, u'PRODUCT'),
    (141, 150, u'DATE'),
    (153, 159, u'ORG'),
    (209, 213, u'CARDINAL'),
    (243, 253, u'PRODUCT'),
    (273, 286, u'CARDINAL')]}),
 (u'Oracle to buy Datalogix for $94 mln. Oracle Corp and Datalogix International Inc have signed a definitive agreement under which Oracle is to acquire Datalogix for about $94 million in cash, the two companies said Tuesday. Oracle currently owns 13.4 percent of the outstanding shares of Datalogix, a Valhalla, N.Y.-base company that provides software for process manufacturing. Oracle is to purchase all of the remaining shares for $81 million, or $8.00 a share. The deal will close once it is approved by shareholders and regulators.\n',
  {'entities': [(14, 23, u'ORG'),
    (29, 31, u'MONEY'),
    (37, 48, u'ORG'),
    (53, 80, u'ORG'),
    (128, 134, u'ORG'),
    (149, 158, u'ORG'),
    (163, 180, u'MONEY'),
    (194, 197, u'CARDINAL'),
    (213, 220, u'DATE'),
    (244, 256, u'PERCENT'),
    (286, 295, u'ORG'),
    (299, 307, u'GPE'),
    (309, 312, u'GPE'),
    (431, 442, u'MONEY'),
    (448, 452, u'MONEY')]})]

The keen-eyed will note that this documents-and-offsets approach for representing a document and it's annotations is essentially what our DocView toolkit does. All right, lets update our NER model. Note that we're only adding training data for the existing entity types. We could add data for a new label here.

This code is taken pretty much verbatim from the spaCy documentation.

In [14]:
import random
ner = nlp.get_pipe('ner')
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    for itn in range(30):
        random.shuffle(clean_train)
        losses = {}
        for text, annotations in clean_train:
            nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
        print(losses)

{u'ner': 62.12408423431214}
{u'ner': 47.842529071138095}
{u'ner': 46.09913378181367}
{u'ner': 31.179111287359255}
{u'ner': 32.94380223598171}
{u'ner': 19.204441226902027}
{u'ner': 21.73971905057339}
{u'ner': 34.78534688709221}
{u'ner': 19.85029929094718}
{u'ner': 8.297226552338868}
{u'ner': 10.496909284611247}
{u'ner': 5.026812666582371}
{u'ner': 12.235058422822858}
{u'ner': 9.222539850029868}
{u'ner': 5.215412125393285}
{u'ner': 6.1230676908295045}
{u'ner': 4.792570929734541}
{u'ner': 5.440482463634183}
{u'ner': 4.078647335200401}
{u'ner': 11.612032433464192}
{u'ner': 6.163535577891217}
{u'ner': 4.854314655370253}
{u'ner': 7.771697830316763}
{u'ner': 5.7436849484252726}
{u'ner': 15.860697749271731}
{u'ner': 8.735517317146856}
{u'ner': 3.921501150986006}
{u'ner': 1.866006226812808}
{u'ner': 2.1762267108019775}
{u'ner': 5.779164969960698}


The model's been updated. At this point we could save it somewhere, which is where the DS model management infrastructure would come into play. Since we're just messing around, we won't save it for now. In a perfect world, at this point we would run some unseen test examples through the system, but for now let's just see how it does on the training data that we used.

In [18]:
for text, _ in clean_train:
    doc = nlp(text)
    print text
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    print ''

Oracle licenses Borland Java.  Oracle Corp will license Borland International Inc's Java and C++Builder technologies, both companies said.  Under terms, Oracle will integrate and distribute Borland's C++Builder, a new object development tool, and JBuilder software tools with several of its existing and future products.  The license is worldwide and non-exclusive.  Oracle will incorporate Borland's Java technologies into many of its tools, including Developer/2000, Designer/2000 and Sedona. 

(u'Borland Java', 16, 28, u'PRODUCT')
(u'Oracle Corp', 31, 42, u'ORG')
(u"Borland International Inc's", 56, 83, u'ORG')
(u'Java', 84, 88, u'PRODUCT')
(u'Borland', 190, 197, u'ORG')
(u'C++Builder', 200, 210, u'PRODUCT')
(u'JBuilder', 247, 255, u'PRODUCT')
(u'Oracle', 367, 373, u'ORG')
(u'Borland', 391, 398, u'ORG')
(u'Java', 401, 405, u'PRODUCT')
(u'Sedona', 487, 493, u'PRODUCT')

Oracle to buy Datalogix for $94 mln. Oracle Corp and Datalogix International Inc have signed a definitive agreement unde

It appears that the model is getting more of the entity types right. Note that spaCy uses neural models, so it's possible that the neural net has simply memorized these examples, which is why we need unseen testing data.