## NER tagging via Stanford NER service

This is actually an instance of the CoreNLP server that also does a bunch of other stuff that you may find useful. See here:

https://stanfordnlp.github.io/CoreNLP/annotators.html

Requires that CoreNLP is running on your local machine or a server. E.g.

```
$ mkdir /usr/local/stanford
$ cd /usr/local/stanford
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
$ unzip stanford-corenlp-full-2017-06-09.zip
$ cd stanford-corenlp-full-2017-06-09
$ nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
```

The code below requires https://github.com/nltk/nltk/releases/tag/3.2.5



In [None]:
from nltk.tag.stanford import CoreNLPNERTagger

It'll be must faster if you run using a notebook in the server

In [None]:
server = 'localhost' # or your servername (without 'http://')
port = 9000 # change if you decide on a different port

There are some simpler function in NLTK than I've used below, but they are seriously limiting. See here:
http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford

In [None]:
s_ner = CoreNLPNERTagger(url='http://'+server+":"+str(port))
def socket_ner(text, properties=None, regexner=False):
    
    sents = []
    
    if regexner:
        ann = 'tokenize,ssplit,ner,regexner,entitymentions'
    else:
        ann = 'tokenize,ssplit,ner,entitymentions'
    props = {
        'ssplit.isOneSentence': 'true',
        'annotators': ann
    }
    # if you override 'annotators' this will likely break. 
    # If you add more properties, be sure to check the results carefully
    props.update(properties or {})
    
    if type(text) == str:
        text = [text]
        
    for s in text:
        # returns a dict
        ret = s_ner.api_call(s, properties=props)
        
        for r in ret['sentences']:
            res = {
                'sentence' : s,
                'entities' : [(m['text'], m['ner']) for m in r['entitymentions']],
                'tokens' : [(t['word'], t['ner']) for t in r['tokens']]
            }
            sents.append(res)
    return sents

Some institution strings taken from pubmed papers

In [None]:
text = [
    'Department of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Kaohsiung, Taiwan.',
    'Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan.',
    'Department of Radiation Oncology, Cancer Center, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan.',
    'Department of Gynecology, Renmin Hospital of Wuhan University Wuhan 430060, China.',
    'Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5.',
    'Department of Chemistry and Biochemistry, University of Colorado, 215 UCB, Boulder, CO 80309, USA. callie.cole@colorado.edu.'
]

Returns a dict with individual entities and also tokens

In [None]:
socket_ner(text)

Gives extended tagset as per this page https://stanfordnlp.github.io/CoreNLP/regexner.html

In [None]:
socket_ner(text, regexner=True)