# Named Entity Extraction
*Named entity extraction* refers to the identification and extraction of real names from a text. This can be used to automatically parse text documents to find the names of people of companies.

This notebook provides a simple example of how to get strated with named entity extraction, using it to pull out the names of companies (as well as a lot of junk!) from the UK MPs' register of interests.

Note that simple named entity exraction only takes you so far... but it's maybe a start...

The python `nltk` package contains a wide range of natural langauge and text processing utilities that can be used to develop named entity extraction tools.

In [29]:
#!pip3 install nltk
import nltk 
nltk.download('punkt')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /home/jovyan/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [37]:
#via https://gist.github.com/onyxfish/322906
import nltk 
#with open('sample.txt', 'r') as f:
#    sample = f.read()

sample="a John Smith of Mad Up Name Ltd and some Person Names as well"
    
def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

def extract_entity_names_txt(sample):
    sentences = nltk.sent_tokenize(sample)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
    entity_names = []
    for tree in chunked_sentences:
        entity_names.extend(extract_entity_names(tree))
    return set(entity_names)


extract_entity_names_txt(sample)

{'John Smith', 'Mad Up Name Ltd', 'Person Names'}

## Example Data  - UK MPs' Register of of Interests
In the UK, the Register of Interests of MPs is available as open data. It is republished at http://www.membersinterests.org.uk/ as CSV data files.

In [45]:
#Download some data
!wget http://downloads.membersinterests.org.uk/register/161031.zip -P data/mpinterests/

--2016-11-10 17:40:24--  http://downloads.membersinterests.org.uk/register/161031.zip
Resolving downloads.membersinterests.org.uk (downloads.membersinterests.org.uk)... 191.239.203.8
Connecting to downloads.membersinterests.org.uk (downloads.membersinterests.org.uk)|191.239.203.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 240328 (235K) [application/octet-stream]
Saving to: ‘data/mpinterests/161031.zip’


2016-11-10 17:40:24 (2.46 MB/s) - ‘data/mpinterests/161031.zip’ saved [240328/240328]



In [41]:
!ls data/mpinterests/

100927.zip  1017.zip  161031.csv  911.zip  925.zip


Preview the data using a *pandas* dataframe (*pandas* is a powerful library for dealing with tabular data).

In [42]:
import pandas as pd
pd.read_csv('data/mpinterests/1017.zip',compression='zip',skiprows=50,nrows=10,header=None)
#Hmm - the following table looks like structured data in a record format?

Unnamed: 0,0,1,2,3,4
0,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Amount of donation or nature and value if dona...
1,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Date of receipt: 28 June 2013
2,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Date of acceptance: 5 July 2013
3,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"Donor status: private limited company, registr..."
4,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,(Registered 21 July 2013)
5,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Name of donor: Stonegrave Properties Limited
6,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"Address of donor: Stonegrave House, Stonegrav..."
7,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Amount of donation or nature and value if dona...
8,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Date of receipt: 28 June 2013
9,Nigel Adams,Selby and Ainsty,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,Date of acceptance: 5 July 2013


In [39]:
#Import some sample data from UK Members of Parliament register of interests
df=pd.read_csv('data/mpinterests/161031.csv',skiprows=50,nrows=10,header=None)
df
#But this looks scruffier?

Unnamed: 0,0,1,2,3,4
0,Heidi Allen,South Cambridgeshire,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,House in St Albans: (i). (Registered 22 May 20...
1,Heidi Allen,South Cambridgeshire,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"House in London, owned jointly with my husband..."
2,Heidi Allen,South Cambridgeshire,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"RS Bike Paint Ltd, a paint manufacturing compa..."
3,Graham Allen,Nottingham North,Labour,http://www.publications.parliament.uk/pa/cm/cm...,I am unremunerated Founding Chair of trustees ...
4,Graham Allen,Nottingham North,Labour,http://www.publications.parliament.uk/pa/cm/cm...,I am unremunerated Chair of the Rebalancing th...
5,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,Payments received from St George's Hospital NH...
6,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,"16 June 2016, received £1,946.96. This include..."
7,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,Allowances received as a Councillor for Wandsw...
8,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,"13 July 2016, received £680.95. (Registered 29..."
9,David Amess,Southend West,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"Until further notice, I am Parliamentary Advis..."


We can apply the simple named entity extraction function to the registered interest string in column 4.

In [40]:
df['entities']=df[4].apply(lambda x: extract_entity_names_txt(x))
df

Unnamed: 0,0,1,2,3,4,entities
0,Heidi Allen,South Cambridgeshire,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,House in St Albans: (i). (Registered 22 May 20...,"{St Albans, House}"
1,Heidi Allen,South Cambridgeshire,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"House in London, owned jointly with my husband...","{London, House}"
2,Heidi Allen,South Cambridgeshire,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"RS Bike Paint Ltd, a paint manufacturing compa...",{RS Bike Paint Ltd}
3,Graham Allen,Nottingham North,Labour,http://www.publications.parliament.uk/pa/cm/cm...,I am unremunerated Founding Chair of trustees ...,{Early Intervention Foundation}
4,Graham Allen,Nottingham North,Labour,http://www.publications.parliament.uk/pa/cm/cm...,I am unremunerated Chair of the Rebalancing th...,{}
5,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,Payments received from St George's Hospital NH...,"{Blackshaw Road, Hospital, London}"
6,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,"16 June 2016, received £1,946.96. This include...",{}
7,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,Allowances received as a Councillor for Wandsw...,"{Town Hall, Wandsworth High Street, Wandsworth..."
8,Rosena Allin-Khan,Tooting,Labour,http://www.publications.parliament.uk/pa/cm/cm...,"13 July 2016, received £680.95. (Registered 29...",{}
9,David Amess,Southend West,Conservative,http://www.publications.parliament.uk/pa/cm/cm...,"Until further notice, I am Parliamentary Advis...","{East Grinstead House, East Grinstead, Caravan..."


In [9]:
#Chunk loader - handy for loading in data a bit at a time if the file is stupidly large...
import os

PATH='data/mpinterests'

for fname in os.listdir(PATH):
    #if a file is a CSV file, process it
    if fname.endswith('.csv'):
        fname='{}/{}'.format(PATH.rstrip('/'),fname)
        #Read in 10,000 rows at a time
        chunks=pd.read_csv(fname,chunksize=10000)
        for chunk in chunks:
            #do something
            pass

## Shared Named Entities

Simple example showing how we can parse the named entitites out of the MP register of interests and construct a graph from MP to named entitity.

The *networkx* library is great for constructing graphs.

In [68]:
import networkx as nx
DG=nx.DiGraph()

Create a simple function that will add edges from node to extracted named entity. Use a *directed graph* going from MP to named entity extracted from register entry.

In [69]:
def mpinterestsGrapher(x,DG):
    for d in extract_entity_names_txt(x[4]):
        DG.add_node(x[0], Label=x[0])
        DG.add_node(d, Label=d)
        DG.add_edge(x[0],d)

In [70]:
#Construct graph based on records in 2016 dataset
for chunk in pd.read_csv('data/mpinterests/161031.csv',chunksize=1000):
    tmp=chunk.apply(lambda x: mpinterestsGrapher(x,DG),axis=1)

In [71]:
#How many nodes are there?
DG.number_of_nodes()

4222

In [80]:
#Save the graph as a file we can visualise in something like Gephi
nx.write_gexf(DG, "membersInterests.gexf")

Having built the graph, we can look to see which nodes have highest in degree (nodes most heavily referenced from MPs) and which nodes have highest out degree (MPs with most interests associated with a named entity).

Note that at this stage the named entity nodes could be really dirty...

In [81]:
from operator import itemgetter

#Show the most popular named entity nodes
for node in sorted(DG.in_degree_iter(),key=itemgetter(1),reverse=True)[:20]:
    print(node)

('Name', 385)
('London', 339)
('Date', 198)
('Ipsos MORI', 67)
('House', 57)
('ComRes', 52)
('London E1W', 46)
('Square', 45)
('Address', 44)
('Payment', 39)
('Flat', 37)
('Foreign Affairs', 36)
('Fee', 35)
('Four Millbank', 35)
('Israel', 33)
('APPG', 33)
('Office Manager', 29)
('British', 26)
('London W1A', 26)
('London N1', 24)


In [107]:
#look for something more meaningful - dig a bit deeper...
for node in sorted(DG.in_degree_iter(),key=itemgetter(1),reverse=True):
    name=node[0]
    indegree=node[1]
    if indegree>1 and [n for n in ['Ltd', 'Limited', 'LLP', 'Inc', 'BBC', 'ITV', 'News'] if n in name]:
        print(node)

('BBC', 19)
('Israel Ltd Address', 18)
('China Forum Ltd Address', 9)
('Guardian News', 8)
('UK China Forum Ltd', 7)
('News UK', 6)
('Media Ltd', 5)
('Australia Israel Cultural Exchange Limited', 5)
('Ireland Ltd', 4)
('Georgina Capel Associates Ltd', 3)
('Northumbrian Water Ltd Address', 2)
('DCD Properties Ltd Address', 2)
('Random House Group Ltd', 2)
('LLP Address', 2)
('BBC Broadcasting House', 2)
('Newsquest', 2)
('GO Movement Ltd', 2)
('Telegraph Media Group Ltd', 2)
('Partners Ltd', 2)
('Hat Trick Productions Ltd', 2)
('News Building', 2)
('Australia Israel Cultural Exchange Ltd Address', 2)
('Grassroots Out Ltd', 2)
('QUBRIC Ltd', 2)
('Gleneagles Hotel Ltd Address', 2)
('Weightmans LLP', 2)
('Carlton Rock Ltd Address', 2)
('Saxton Green LLP', 2)
('Associated Newspapers', 2)
('Guardian News Media Ltd', 2)


In [82]:
#SHow the members with most linked items in their register entries
for node in sorted(DG.out_degree_iter(),key=itemgetter(1),reverse=True)[:20]:
    print(node)

('Kenneth Clarke', 104)
('Liam Fox', 93)
('Mark Pritchard', 59)
('Michael Gove', 58)
('Alex Salmond', 56)
('Nick Clegg', 54)
('Graham Brady', 52)
('Tasmina Ahmed-Sheikh', 50)
('Henry Smith', 49)
('Gisela Stuart', 49)
('Geoffrey Cox', 49)
('Bob Blackman', 49)
('Virendra Sharma', 46)
('Jack Lopresti', 46)
('Nigel Evans', 46)
('Thomas Tugendhat', 45)
('Tristram Hunt', 44)
('Simon Danczuk', 44)
('Graham Jones', 43)
('David Lammy', 43)


In [118]:
#How about another file?

def mpinterestsGrapher2(x,DG):
    if x[4].startswith("Name of donor: "):
        DG.add_node(x[0], Label=x[0])
        d=x[4].replace("Name of donor: ",'').strip()
        DG.add_node(d, Label=d)
        DG.add_edge(x[0],d)
        
DG=nx.DiGraph()
for chunk in pd.read_csv('data/mpinterests/1017.zip',compression='zip',header=None,chunksize=1000):
    tmp=chunk.apply(lambda x: mpinterestsGrapher2(x,DG),axis=1)

In [120]:
for node in sorted(DG.in_degree_iter(),key=itemgetter(1),reverse=True)[:20]:
    print(node)

('', 24)
('United and Cecil Club', 18)
('Labour Friends of Israel', 11)
('Government of the United Arab Emirates', 10)
('Conservative Friends of Israel', 9)
('JTI', 9)
('HM Government of Gibraltar', 7)
('Majlis As Shura, Shura Council of the Kingdom of Saudi Arabia', 7)
('British Association for Shooting and Conservation (BASC)', 6)
('Sun Mark Ltd', 6)
('Results UK', 5)
('BASF PLC', 5)
('BASF plc', 5)
('Motor Sports Association', 4)
('European Parliamentary Forum on Population and Development', 4)
('Catholic Bishops Conference of England and Wales', 4)
('The European Azerbaijan Society', 4)
('RESULTS UK', 3)
('Christian Aid', 3)
('(1) UK-Korea forum for the Future; (2) The Korea Foundation', 3)


In [121]:
for node in sorted(DG.in_degree_iter(),key=itemgetter(1),reverse=True):
    name=node[0]
    indegree=node[1]
    if indegree>1 and [n for n in ['Ltd', 'Limited', 'LLP', 'Inc', 'BBC', 'ITV', 'News'] if n in name]:
        print(node)

('Sun Mark Ltd', 6)
('Brompton Capital Ltd', 3)
('Dukehill Services Ltd', 3)
('BBC Cymru Wales', 2)
('JCB Research Limited', 2)
('PricewaterhouseCoopers LLP', 2)
('JCB Research Ltd', 2)
('Ministry of Sound Group Ltd', 2)
('ITV plc', 2)
('DCD Properties Ltd', 2)
('Hitachi Rail Europe Ltd', 2)
