This notebook will perform NER by using StanfordCoreNLP Server

You will need the Python environment in the below cell.

You will need Java 8 (64 bit) available here:
    
https://www.java.com/en/download/manual.jsp
    
Java may need to be added to your path
    
You will need StanfordCoreNLP available here:
    
https://stanfordnlp.github.io/CoreNLP/
    
Then follow the instructions further down to start a server and run the notebook

In [1]:
print(__import__('sys').version)
!conda list -n NLP37

3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
# packages in environment at C:\Anaconda3\envs\NLP37:
#
# Name                    Version                   Build  Channel
_pytorch_select           1.1.0                       cpu  
altair                    3.1.0                    py37_0    conda-forge
asn1crypto                0.24.0                   py37_0  
atomicwrites              1.3.0                    py37_1  
attrs                     19.1.0                   py37_1  
backcall                  0.1.0                    py37_0  
beautifulsoup4            4.7.1                    pypi_0    pypi
blas                      1.0                         mkl  
boto                      2.49.0                   py37_0    anaconda
boto3                     1.9.162                    py_0    anaconda
botocore                  1.12.163                   py_0    anaconda
branca                    0.3.1                      py_0    conda-forge
bs4                       0.

In [1]:
%pylab

from pycorenlp import StanfordCoreNLP
from datetime import datetime as dt
from tqdm import tqdm
from toolz import compose, curry, concat
from unicodedata import normalize
from string import punctuation

import datetime, os, re, string

try:
    import cPickle as pickle
except:
    import pickle

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


Open a shell in your Stanford core download folder i.e.:
    
cd C:\StanfordNLP\stanford-corenlp-full-2018-10-05

Then run the following command to start a server:

java -mx12g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -maxCharLength 100000000000 -timeout 10000000000 -tokenize.options untokenizable=allDelete

"-mx12g" designates the RAM you are allocating to the server (i.e. 12 GB) so makes sure you have it available


Make sure the maxCharLength is equal to or larger than the MAX_CHARACTER_LENGTH variable in one of the below cells.

To shut down the server at any time hold CTRL and press C when in the shell screen.

The basic approach to this code is to exploit the fact that StanfordCoreNLPServer is faster when run on server, when only required annotators are used, when only required classese are loaded and when working on fewer larger strings rather than many smaller strings. With this in mind, we split the corpus into batches of roughly equal character length, add the date and article id information to the strings and then join each article together to create a single large string per batch. This leaves us with fewer larger strings and allows us to recapture the date and id information by searching in the tokens which are output by StanfordCoreNLP. 

In [2]:
###########################################################################
#   These are the three main classifiers that perform NER. One may be     #
#   selected and used for the server. Testing shows CLF4 gives the best   #
#   results.                                                              #
###########################################################################

CLF3 = r'edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz'
CLF4 = r'edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz'
CLF7 = r'edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz'

In [3]:
def load_pickle(filename):
    print('Loading file...')
    with open(os.path.normpath(filename), 'rb') as open_file:
        return pickle.load(open_file)

def save_pickle(filename, data):
    print('saving...')
    with open(os.path.normpath(filename), 'wb') as open_file:
        pickle.dump(data, open_file)
        
def directory_explorer(extension, directory):
    '''
    A generator to find filenames with a given extension within a given 
    directory
    '''
    ext_upper, ext_lower = extension.lower(), extension.upper()
    for filename in os.listdir(os.path.normpath(directory)):
        if filename.endswith(ext_upper) \
        or filename.endswith(ext_lower):
            yield '%s/%s' % (directory, filename)

In [4]:
def unpack_dictionary(dic):
    '''
    Unpack dictionary data structure into a list for easier batch processing
    '''
    print('Unpacking data...')
    output = []
    for date in dic:
        for doc in dic[date]:
            copy = {
                    'id': doc['id'],
                    'content': doc['content'],
                    'datetime':date,
                    }
            
            output.append(copy)
    return output

In [5]:
def divide_into_batches(max_len, lis):
    '''
    Divides the corpus documents into batches
    '''
    output = []
    batch = []
    batch_len = 0
    for doc in lis:
        doc_len = len(doc['content'])
        if  (batch_len + doc_len) < max_len:
            batch.append(doc)
            batch_len += doc_len
        else:
            output.append(batch)
            batch = [doc]
            batch_len = doc_len
    if len(batch) > 0:
        output.append(batch)
    return output

In [6]:
def aggregate_articles(lis):
    '''
    Collects multiple articles into a single list
    '''
    print('Building input for NER...')
    output = []
    for doc in tqdm(lis):
           output.extend(['HEREISTHEDATE',
                          doc['datetime'].strftime('%d-%b-%Y'),
                          'HEREISTHEID', 
                          str(doc['id']),
                          doc['content'],
                          'ENDOFDOC.'])
    return output

In [7]:
def directory_explorer(extension, directory):
    '''
    A generator to find filenames with a given extension within a given 
    directory
    '''
    ext_upper, ext_lower = extension.lower(), extension.upper()
    for filename in os.listdir(os.path.normpath(directory)):
        if filename.endswith(ext_upper) \
        or filename.endswith(ext_lower):
            yield '%s/%s' % (directory, filename)

In [8]:
def unpack_ner_output(ner_output):
    '''
    Unpacks NER output
    
    Input:
        StanfordCoreNLP output
        
    Output:
        a list of dictionaries where each dictionary represents a document with
        a datetime object, unique id and a list of lists where each nested list is
        a sentence containing the entities that occurred in that sentence in the document.
        Entities are represented as objects
    '''

    print('Unpacking NER output...')
    output = {'n_articles': 0}
    doc = {}
    sentencewise_ents = []
    date_flag = None
    id_flag = None
    stop_flag = False
    
#     if len(custom_entities) > 0:
#         search_for_custom_entities = True
#     else:
#         False
    if type(ner_output) == str:
        raise Exception('StanfordCoreNLPSever as encountered an error: %s' % ner_output)
        
    for sentence in tqdm(ner_output['sentences']):
            
        ent_flag = False
        ents = []
            
        for token in sentence['tokens']:

            if token['word'] == 'ENDOFDOC':
                stop_flag = True

            ######################################################################
            #                           Collect Date                             #
            ######################################################################

            if 'HEREISTHEDATE' in token['word']:
                date_flag = True
                continue
            if date_flag:
                date = dt.strptime(token['word'], '%d-%b-%Y')
                if date not in output:
                    output[date] = {}
                date_flag = False
                continue

            ######################################################################
            #                           Collect ID                               #
            ######################################################################

            if 'HEREISTHEID' in token['word']:
                id_flag = True
                continue
            if id_flag:
                doc['id'] = int(token['word'])
                id_flag = False
                continue

#             ######################################################################
#             #                        Collect Entities                            #
#             ######################################################################
            
#             if search_for_custom_entities:
#                 if token['word'] in custom_entities:
#                     ents.append({'word': token['word'], 'tag': token['ner']})
#                     ent_flag = False

            if token['ner'] != 'O': # Entity
                if ent_flag:
                    # Same entity class
                    if token['ner'] == entity['tag']:
                        entity['word'] = '%s %s' % (entity['word'], token['word'])
                    # Different entity classes
                    else:
                        ents.append(entity)
                        entity = {'word': token['word'], 'tag': token['ner']}
                else:
                    entity = {'word': token['word'], 'tag': token['ner']}
                    ent_flag = True # Collect following entities of same class

            elif token['ner'] == 'O': # Non-Entity
                if ent_flag:
                    ents.append(entity) # Capture previous entity
                    entity = {'word': None, 'tag': None}
                ent_flag = False
            
        if len(ents) > 0:
            sentencewise_ents.append(ents)
                
        if stop_flag:
            if ent_flag:
                ents.append(entity)
                entity = {'word': None, 'tag': None}
                ent_flag = False
            doc['sentences'] = sentencewise_ents
            output[date][doc['id']] = doc
            output['n_articles'] += 1
            sentencewise_ents = []
            doc = {}
            stop_flag = False

    return output

In [9]:
def extract_persons(ner_output):
    print('Extracting persons...')
    output = []
    person = []
    for sentence in ner_output['sentences']:
        for token in sentence['tokens']:
            if token['ner'] != 'PERSON':
                if len(person) > 1:
                    output.append(' '.join(person))
                    person = []
                elif person:
                    person = []
            elif token['ner'] == 'PERSON':
                person.append(token['word'])
    return output

In [10]:
def directory_explorer(extension, directory):
    '''
    A generator to find filenames with a given extension within a given 
    directory
    '''
    ext_upper, ext_lower = extension.lower(), extension.upper()
    for filename in os.listdir(os.path.normpath(directory)):
        if filename.endswith(ext_upper) \
        or filename.endswith(ext_lower):
            yield '%s/%s' % (directory, filename)

In [11]:
class NER:
    '''
    A simple wrapper for StanfordCoreNLP mainly for simpler syntax
    '''
    def __init__(self, **kwargs):
        port = kwargs.get('port', '9000')
        self.clf = kwargs.get('clf', __import__('pycorenlp').StanfordCoreNLP('http://localhost:%s' % port))
        self.properties = kwargs.get(
            'properties', {
                           'annotators': 'tokenize, ssplit, pos, lemma, ner',
                           'ner.model': 'edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz',
                           'ner.useSUTime': 'false',
                           'ner.applyNumericClassifiers': 'false',
                           #'tokenizerFactory': 'edu/stanford/nlp.process/PTBLexer',
                           'tokenizer.options': "untokenizable=noneDelete",
                           'outputFormat': 'json',
                           #'maxCharLength': 1000000000,
#                            'timeout': 6000000,
                           })
        print('Initializing...')
        self.annotate('Initialization in progress...')
        print('Complete.')
        
    def annotate(self, text):
        print('Performing NER...')
        return self.clf.annotate(text, properties=self.properties)
    
#java -mx11g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 6000000 -maxCharLength 1000000000 -ner.model edu.stanford.nlp.models.ner.english.conll.4class.distsim.crf.ser.gz -ner.useSUTime false -ner.applyNumericClassifiers false -tokenizerFactory edu.stanford.nlp.process.PTBTokenizer -tokenizerOptions "untokenizable=noneDelete"

#### Main Ner pipe 

The following code was used to perform NER on the filtered corpus.

In [12]:
###########################################################################
#                              Constants                                  #
###########################################################################

TEMPORARY_FOLDER = r'temp'
INPUT_FILENAME = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\05 Filter Docs\temp\filtered.pkl'
PATH_TO_STANFORDNLP = r'C:\StanfordNLP\stanford-corenlp-full-2018-10-05'

MAX_CHARACTER_LENGTH = 1e7

You need a StanfordCoreNLP Server running to run this cell.

In [13]:
ner = NER()

##########################################################################
#                       For splitting pipe flow                          #
##########################################################################

left_fork = lambda f, x: (f(x), x)

right_fork = lambda f, x: (x, f(x))

right_deactivated_fork = lambda x: (x, y) 

unpack_tuple = lambda f, x: f(*x)

##########################################################################
#                         More auxillary functions                       #
##########################################################################

def loss_check(x, y):
    '''
    To Check all articles are recovered
    '''
    assert len(x) == y['n_articles']
    return y

i = 0

def create_savename(temp, x):
    '''
    To generate a single savename for a batch
    '''
    savename = '%s/batch %s.pkl' % (temp, i)
    i += 1
    #global i
    return savename

create_savenames = compose(
                           list,
                           curry(map)(lambda x: '%s/batch %s.pkl' % (TEMPORARY_FOLDER, x)),
                           range,
                           len,
                           )

save_batches = curry(unpack_tuple)(curry(map)(save_pickle))

concatonate_by_space = lambda x: ' '.join(x)

join = lambda x: ''.join(x)

save_batch = curry(unpack_tuple)(save_pickle) #lambda x: save_pickle(*x)

file_finder = curry(directory_explorer)('.pkl')

Initializing...
Performing NER...
Complete.


In [15]:
batch_creation_pipe = compose( 
                              list,
                              save_batches,
                              curry(left_fork)(create_savenames),
                              curry(divide_into_batches)(MAX_CHARACTER_LENGTH),
                              load_pickle,
                              )

core_pipe = compose(
                    unpack_ner_output,
                    ner.annotate,
                    join,
                    list,
                    concatonate_by_space, # ' '.join(x)
                    aggregate_articles,   # formats a batch into a list of strings 
                    )

inner_pipe = compose(
                     curry(unpack_tuple)(loss_check), # Check all docs are recovered
                     curry(right_fork)(core_pipe),
                     load_pickle,
                     )

outer_pipe = compose(
                     save_batch,
                     curry(right_fork)(inner_pipe),
                     )

ner_pipe = compose(
                   list,
                   curry(map)(outer_pipe),
                   list,
                   file_finder,
                   )

Running this cell will execute the NER process in two stages:

1) An initial partitioning of the input data is performed, each batch is saved in a temporary folder.

2) The NER pipe searches the temporary folder and performs NER on each batch and then overwrites it with
found entities.

In [16]:
batch_creation_pipe(INPUT_FILENAME)
ner_pipe(TEMPORARY_FOLDER)

Loading file...
saving...
Loading file...
Building input for NER...


100%|██████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<?, ?it/s]


Performing NER...
Unpacking NER output...


100%|█████████████████████████████████████████████████████████████████████████████| 319/319 [00:00<00:00, 79632.36it/s]


saving...


[None]

In [17]:
load_pickle(r'temp\batch 0.pkl')

Loading file...


{'n_articles': 13,
 datetime.datetime(2017, 1, 19, 0, 0): {20: {'id': 20,
   'sentences': [[{'word': 'Pakistani', 'tag': 'NATIONALITY'}],
    [{'word': 'Qamar Javed Bajwa', 'tag': 'PERSON'},
     {'word': 'commander', 'tag': 'TITLE'},
     {'word': 'Nawaz Sharif', 'tag': 'PERSON'},
     {'word': 'prime minister', 'tag': 'TITLE'}],
    [{'word': 'Sharif', 'tag': 'PERSON'}, {'word': 'general', 'tag': 'TITLE'}],
    [{'word': 'General', 'tag': 'TITLE'},
     {'word': 'Bajwa', 'tag': 'PERSON'},
     {'word': 'Sharif', 'tag': 'PERSON'}],
    [{'word': 'Pakistan', 'tag': 'COUNTRY'},
     {'word': 'war', 'tag': 'CAUSE_OF_DEATH'},
     {'word': 'Islamist', 'tag': 'IDEOLOGY'}],
    [{'word': 'Pakistan', 'tag': 'COUNTRY'},
     {'word': 'terrorism', 'tag': 'CRIMINAL_CHARGE'}],
    [{'word': 'general', 'tag': 'TITLE'}],
    [{'word': 'senior minister', 'tag': 'TITLE'},
     {'word': 'terrorism', 'tag': 'CRIMINAL_CHARGE'},
     {'word': 'Pakistan', 'tag': 'COUNTRY'},
     {'word': 'intelligence ch

#### Extract persons of interest from GTD data

This pipe was used to extract person entities from the terror event descriptions in the GTD database.

You need a StanfordCoreNLP Server running to run this cell.

In [13]:
INPUT_FILENAME = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\GTD\Preprocessed Info.pkl'
OUTPUT_FILENAME = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\GTD\GTD Persons.pkl'

ner = NER()

extract_gtd_persons_pipe = compose(
                                   curry(save_pickle)(OUTPUT_FILENAME),
                                   set,
                                   extract_persons,
                                   ner.annotate,
                                   lambda x: ' '.join(x),
                                   load_pickle,
                                   )


entities = extract_gtd_persons_pipe(INPUT_FILENAME)              