## Getting started

In [1]:
import pandas as pd
import requests

import spacy
# Import the English language class
from spacy.lang.en import English

# make use of widescreen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


## Reading in GOV.UK data
Currently this data comes from the taxonomy pipeline. We should make our own bespoke one, as that's all lowercase with some punctuation removed that might be useful. i.e. £/$/ percentages etc.

In [120]:
# available from GOV.UK data scientists
# data has been pre-processed for taxonomy work
# this loses us useful information such as capital letters
# will want to adjust pre-processing pipeline
df = pd.read_csv("data/11-02-19/labelled.csv",
                 usecols=["base_path", "content_id",
                          "description", "locale",
                          "title", "body", "combined_text"])

In [3]:
df.head()

Unnamed: 0,base_path,content_id,description,locale,title,body,combined_text
0,/government/publications/list-of-psychologists...,04a0cc0d-0b9f-45ad-bf57-7c54cbab9df9,list of english speaking psychologists and psy...,en,chile - list of psychologists and psychiatrist...,prepared by british embassy/consulate santiago...,chile - list of psychologists and psychiatrist...
1,/government/news/charity-commission-names-furt...,5fa49c52-7631-11e4-a3cb-005056011aef,regulator increases transparency of its work.,en,charity commission names further charities und...,the charity commission has today named further...,charity commission names further charities und...
2,/government/publications/trust-and-confidence-...,d0341424-12a1-4b4c-9045-2e74ba17f2d5,independent research into trust and confidence...,en,trust and confidence in the charity commission...,the charity commission commissioned populus to...,trust and confidence in the charity commission...
3,/government/speeches/william-shawcross-speech-...,9245dfca-4210-41d9-9ffd-7fcc35dc1642,william shawcross asks charities to pull toget...,en,william shawcross speech at commission’s publi...,good morning and thank you for joining us here...,william shawcross speech at commission’s publi...
4,/government/statistics/crime-statistics-focus-...,5fec046a-7631-11e4-a3cb-005056011aef,crime statistics from the crime survey for eng...,en,public perceptions of crime and the police and...,official statistics are produced impartially a...,public perceptions of crime and the police and...


## Documents, spans and tokens

In [4]:
# Create the nlp object
nlp = English()

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.

In [5]:
df.at[3,"title"]

'william shawcross speech at commission’s public meeting in southampton'

In [6]:
# Created by processing a string of text with the nlp object
doc = nlp(df.at[3,"title"])

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

william
shawcross
speech
at
commission
’s
public
meeting
in
southampton


When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [7]:
doc[0]

william

In [8]:
# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

william


When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

In [9]:
# can take slices
doc[0:-1]

william shawcross speech at commission’s public meeting in

## Lexical attributes

In [10]:
doc = nlp(df.at[18,"description"])
doc

the resting place of 4 members of the royal warwickshire regiment has finally been marked more than 100 years after they gave their lives for their country.

In [11]:
# Find a duration or period by number followed by years
# Could also be used for percentages, but we have removed % from text
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "years":
            print("Time period found (years):", token.text)

Time period found (years): 100


## Statistical models
Let's add some more power to the nlp object!

In this lesson, you'll learn about spaCy's statistical models.

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.

In [12]:
# https://spacy.io/usage/models
# https://stackoverflow.com/questions/52677634/pycharm-cant-find-spacy-model-en

nlp = spacy.load("en_core_web_sm")

In [13]:
df.title

0         chile - list of psychologists and psychiatrist...
1         charity commission names further charities und...
2         trust and confidence in the charity commission...
3         william shawcross speech at commission’s publi...
4         public perceptions of crime and the police and...
5                      britain honours its holocaust heroes
6                            esf funding for the north east
7         charities: holding moving and receiving funds ...
8                       english indices of deprivation 2015
9         dcms improves efficiency and cuts costs with r...
10                  advice for british nationals in kolkata
11        wales office minister welcomes prime minister’...
12          birth summary tables in england and wales: 2013
13        punishment and reform: effective community sen...
14                           lord-lieutenant for midlothian
15        cic36: application to form a community interes...
16        long-term effects of childhood

### Predicting part of speech (POS) tags

In [14]:
# Process a text
doc = nlp(df.at[22,"description"])

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

the DET
queen NOUN
has VERB
been VERB
pleased ADJ
to PART
approve VERB
that ADP
the DET
honour NOUN
of ADP
knighthood NOUN
be VERB
conferred VERB
upon ADP
oliver NOUN
heald NOUN
mp PRON
. PUNCT


Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English model and receive an nlp object.

Next, we're processing the text from a pages description.

For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

Here, the model correctly predicted "pleased" as a verb and "knighthood" as a noun but got "mp" wrong.

### Predicting syntactic dependencies

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

To describe syntactic dependencies, spaCy uses a standardized label scheme.



In [15]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

the DET det queen
queen NOUN nsubj been
has VERB aux been
been VERB ROOT been
pleased ADJ acomp been
to PART aux approve
approve VERB xcomp pleased
that ADP mark conferred
the DET det honour
honour NOUN nsubjpass conferred
of ADP prep honour
knighthood NOUN pobj of
be VERB auxpass conferred
conferred VERB ccomp approve
upon ADP prep conferred
oliver NOUN compound mp
heald NOUN compound mp
mp PRON pobj upon
. PUNCT punct been


### Predicting named entities
Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc dot ents property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [16]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


However, it performs poorly on our data possibly due to 

In [17]:
#df.body

In [18]:
# show what we are looking at
print(df.at[0,"body"])
# Process a text
doc = nlp(df.at[0,"body"])

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print("--------")
    print(ent.text, ent.label_)

prepared by british embassy/consulate santiago chile. list of psychologists and psychiatrists 2017 pdf 412kb 8 pages
--------
british NORP
--------
2017 DATE
--------
412 CARDINAL
--------
8 CARDINAL


In [126]:
# And if we don;t use lower case
# Process a text
doc = nlp("It's Gareth Heyes not Gareth Hayes!")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print("--------")
    print(ent.text, ent.label_)

--------
Gareth Heyes PERSON
--------
Gareth Hayes PERSON


In [19]:
# And if we don;t use lower case
# Process a text
doc = nlp("Prepared by British embassy/consulate Santiago, Chile. List of psychologists and psychiatrists 2017 pdf 412kb 8 pages")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print("--------")
    print(ent.text, ent.label_)

--------
British NORP
--------
Santiago GPE
--------
Chile GPE
--------
2017 CARDINAL
--------
412 CARDINAL
--------
8 CARDINAL


In [20]:
# show what we are looking at
print(df.at[305692,"body"])
# Process a text
doc = nlp(df.at[305692,"body"])

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print("--------")
    print(ent.text, ent.label_)

the dame lesley strathie operational excellence award recognises excellence in the delivery of public services. this includes putting user needs at the heart of a project and significantly improving the quality value for money or productivity of services to the public. gov.uk notify is delivered by a multidisciplinary team of 12 people including designers user researchers and developers. they work closely with service teams across the country to constantly iterate and improve it. the judges said: gov.uk notify is a great example of a small diverse set of civil servants challenging established ways of doing things to rapidly deliver a product benefitting millions of people whilst saving taxpayers millions. gov.uk notify product manager pete herlihy said: it’s ace. obviously we’re incredibly proud of notify and the impact it’s having right across the public sector but this recognition for how our little team goes about delivering it really means so much to us. about gov.uk notify gov.uk 

#### Tip: explain the method

In [21]:
spacy.explain('CARDINAL')
# Could extract dates discussed and adjust search results accordingly.

'Numerals that do not fall under another type'

In [22]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [23]:
spacy.explain('GPE')

'Countries, cities, states'

In [24]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

#### Pre-processing is important
By lower casing we lose information.

## Predicting named entities in context
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

In [25]:
# show what we are looking at
print(df.at[305692,"title"])
# Process a text
doc = nlp(df.at[305692,"title"])

gov.uk notify wins civil service operational excellence award


In [26]:
# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "GOV.UK" and "Notify"
govuk = doc[0:1]
notify = doc[1:2]

# Print the span text
print("Missing entity:", govuk.text)
print("Missing entity:", notify.text)

Missing entity: gov.uk
Missing entity: notify


Of course, you don't always have to do this manually. In the
next exercise, you'll learn about spaCy's rule-based matcher, which can help you
find certain words and phrases in text.
## Rule based matching 
We'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.

Why not just use regular expressions?

Consider: "duck" (verb) vs. "duck" (noun)

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

#### Match patterns
Lists of dictionaries, one per token

* Match exact token texts  
[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]  
* Match lexical attributes  
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
* Match any token attributes  
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]  

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

#### Government context specific matching
Fortunately registers exist which are [canonical sources of lists about Government](https://www.registers.service.gov.uk/category/government); there are registers for [organisations](https://www.registers.service.gov.uk/registers/government-organisation) and [services](https://www.registers.service.gov.uk/registers/government-service) (although if Departments have changed name we won't be able to spot these). We can use the API to get these and create a Dictionary of match patterns. 

This is better than relying on regex, as often an organisation might preceed one if it's services in the text, but the text is referring to the service not the parent organisation. i.e. GOV.UK Notify

In [31]:
import requests
import pandas as pd
from io import StringIO

In [32]:
orgs = requests.get('https://government-organisation.register.gov.uk/records.csv?page-size=5000')


In [33]:
orgs = orgs.text
orgs = StringIO(orgs)
orgs = pd.read_csv(orgs)
print(orgs.head())
print(orgs.tail())

   index-entry-number  entry-number       entry-timestamp     key  \
0                1011          1011  2019-01-21T16:44:12Z  OT1268   
1                1010          1010  2019-01-21T16:31:46Z  OT1267   
2                1009          1009  2019-01-21T16:23:43Z  OT1266   
3                1008          1008  2019-01-07T14:39:48Z    EA41   
4                1007          1007  2018-10-18T09:50:55Z  OT1067   

  government-organisation                                      name  \
0                  OT1268            UK Council for Internet Safety   
1                  OT1267  Employment Agency Standards Inspectorate   
2                  OT1266            Single Financial Guidance Body   
3                    EA41                               Royal Parks   
4                  OT1067                            UKTI Education   

                                             website start-date    end-date  
0  https://www.gov.uk/government/organisations/uk...        NaN         NaN  
1 

In [34]:
GOV_ORGS = list(orgs.name.values)
GOV_ORGS

['UK Council for Internet Safety',
 'Employment Agency Standards Inspectorate',
 'Single Financial Guidance Body',
 'Royal Parks',
 'UKTI Education',
 'Department for Digital, Culture, Media and Sport',
 'Department of Culture, Arts and Leisure Northern Ireland',
 'Export Control Joint Unit',
 'Office of the Secretary of State for Scotland',
 'Race Disparity Unit',
 'Government Property Agency',
 'Teaching Regulation Agency',
 'National College for Teaching and Leadership',
 'Office for Civil Society',
 'Commission for Countering Extremism',
 'Department of Culture, Arts and Leisure',
 'Department for Constitutional Affairs',
 'Department for Children, Schools and Families',
 'Office for Product Safety and Standards',
 'Department for Business, Enterprise and Regulatory Reform',
 'HM Inspectorate of Constabulary and Fire & Rescue Services',
 'Ministry of Housing, Communities and Local Government',
 'Department of Health and Social Care',
 'Gangmasters and Labour Abuse Authority',
 'NHS

In [35]:
isinstance(GOV_ORGS, list)

True

In [36]:
# actually this doesnt give us what we need which is the name of the service as it would appear in website text
# however it could be used to create metadata, refers to service, if we check for strucutral links that have ho

#services = requests.get('https://government-service.register.gov.uk/records.csv?page-size=5000')
#services = StringIO(services.text)
#df = pd.read_csv(services)
#df.head()


#### Using the matcher
We use the API call above to create a Matcher

In [37]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'GOV.UK'}, {'TEXT': 'Verify'}]
matcher.add('GOVUK_SERVICE_PATTERN', None, pattern)

# Process some text
doc = nlp("GOV.UK is home to Verify - a secure way to prove who you are online.\
You need to have a UK address to use GOV.UK Verify. You don’t have to be a UK citizen.")

# Call the matcher on the doc
matches = matcher(doc)

The matcher is initialized with the shared vocabulary, nlp dot vocab. You'll learn more about this later – for now, just remember to always pass it in.

The matcher dot add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [38]:
matches

[(8064984008469307775, 25, 27)]

When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.

In [39]:
# Call the matcher on the doc
doc = nlp("GOV.UK is home to Verify - a secure way to prove who you are online.\
You need to have a UK address to use GOV.UK Verify. You don’t have to be a UK citizen.")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

GOV.UK Verify


Here's an example of a more complex pattern using lexical attributes.

We're looking for three tokens:

Two case-insensitive tokens for "general" and "election".

A token consisting of only digits.

The pattern matches the tokens "General election 2017".

In [40]:
pattern = [
    {'LOWER': 'general'},
    {'LOWER': 'election'},
    {'IS_DIGIT': True}
]

matcher.add('GENERAL_ELECTION_DATE_PATTERN', None, pattern)


doc = nlp("PM statement: General election 2017 \
Prime Minister Theresa May made a speech outside Downing Street following the 2017 general election.")

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

General election 2017


#### Operators and quantifiers
But we miss 'the 2017 general election'. Let's adjsut our pattern to cope. 

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [41]:
pattern = [
    {'IS_DIGIT': True, 'OP': '?'}, # optional match 0 or 1 times
    {'LOWER': 'general'},
    {'LOWER': 'election'},
    {'IS_DIGIT': True, 'OP': '?'} # optional match 0 or 1 times
]

matcher.add('GENERAL_ELECTION_PATTERN', None, pattern)


doc = nlp("PM statement: General election 2017 \
Prime Minister Theresa May made a speech outside Downing Street following the 2017 general election.")

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

General election 2017
General election 2017
2017 general election
general election


"OP" can have one of four values:

An "!" negates the token, A "?" makes the token optional, A "+" matches a token 1 or more times. And finally, an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.

#### Lemmas and verbs
Lemmas give you flexibility. We could determine whether a page contained anything about an entity 'buying' a thing.

## Enriching metadata with Matcher
Now that we've learnt some basics, let's try to loop through all our content and enrich the metadata about these pages.

Let's use the simple general election matcher pattern from above. We'll identify pages that are have general election in them.

## Enriching metadata with PhraseMatcher or spacy-lookup
From reading the docs [PhraseMatcher](https://spacy.io/api/phrasematcher) might be better suited to our fixed list of org and services names from the Registers.

The PhraseMatcher will match on the `ORTH` value, i.e. the exact text. This lets it match large terminology lists and exact occurrences of strings, without having to worry about spaCy's tokenization. For more background on this, why the PhraseMatcher can't work on other attributes, and possible solutions for case-insensitivity, see this [discussion](https://github.com/explosion/spaCy/issues/1579) on the issue tracker.

There's also a community plug-in that looks handy called [spacy-lookup](https://github.com/mpuig/spacy-lookup). `spacy-lookup` only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded model. If you're loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the entity component as last=True, so the spans are merged at the end of the pipeline.

In [73]:
import spacy
from spacy_lookup import Entity

nlp = spacy.load('en_core_web_sm')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True

print([(token.text, token._.canonical) for token in doc if token._.is_entity])

[('product manager', 'product manager'), ('python', 'python')]


In [74]:
import spacy
from spacy_lookup import Entity

nlp = spacy.load('en_core_web_sm')
# our list of all Government Organisations from the Register
entity = Entity(keywords_list=GOV_ORGS)
nlp.add_pipe(entity, last=True)

doc = nlp(u"This document contains the following information, the deepcut review.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[8]._.entity_desc == 'deepcut review'
assert doc[8]._.is_entity == True

print([(token.text, token._.canonical) for token in doc if token._.is_entity])

[('deepcut review', 'Deepcut Review')]


### Looping through many docs
We want to pass a bunch of content, as a column of a pandas dataframe. We then want to loop through each doc and return the GOV_ORGS that are found therein as a list.

Start off with a function for one doc, then iterate through.

In [86]:
import spacy
from spacy_lookup import Entity

# this should be specified on import
nlp = spacy.load('en_core_web_sm')
# GOV_ORGS needs to be specified
entity = Entity(keywords_list=GOV_ORGS)
nlp.add_pipe(entity, last=True)

def text_gov_org_match(text):
    """Return a list of GOV_ORG entities found in a str."""
    doc = nlp(text)
    return [token._.canonical for token in doc if token._.is_entity]

text_gov_org_match(df.at[3,"body"])


1 loop, best of 5: 469 ms per loop


Now if we pass a dataframe column with text in?

In [89]:
df.shape

(305703, 7)

In [102]:
df_small = df.head(30).copy

In [118]:
def text_gov_org_match(text):
    """Return a list of GOV_ORG entities found at least once in a str."""
    doc = nlp(text)
    # remove any duplicates because dictionaries cannot have duplicate keys
    list_of_gov_org_entities_with_duplicates = [token._.canonical for token in doc if token._.is_entity]
    list_no_duplicates = list(dict.fromkeys(list_of_gov_org_entities_with_duplicates))
    return list_no_duplicates

text_gov_org_match(df.at[3,"body"])

['The Charity Commission']

In [115]:
df.columns

Index(['base_path', 'content_id', 'description', 'locale', 'title', 'body',
       'combined_text'],
      dtype='object')

In [119]:
# iterate over rows with iterrows()
for index, row in df.head(100).iterrows():
     # access data using column names
     print(row['base_path'], "REFERS_TO", text_gov_org_match(df.at[index,"combined_text"]))

/government/publications/list-of-psychologists-and-psychiatrists-2017 REFERS_TO []
/government/news/charity-commission-names-further-charities-under-investigation REFERS_TO ['The Charity Commission']
/government/publications/trust-and-confidence-in-the-charity-commission-2015 REFERS_TO ['The Charity Commission']
/government/speeches/william-shawcross-speech-at-commissions-public-meeting-in-southampton REFERS_TO ['The Charity Commission']
/government/statistics/crime-statistics-focus-on-public-perceptions-of-crime-and-the-police-and-the-personal-well-being-of-victims-april-2013-to-march-2014 REFERS_TO []
/government/news/britain-honours-its-holocaust-heroes REFERS_TO ['Foreign & Commonwealth Office']
/government/publications/esf-funding-allocated-to-the-north-east REFERS_TO ['Department for Work and Pensions', 'Skills Funding Agency', 'National Offender Management Service']
/government/publications/charities-holding-moving-and-receiving-funds-safely REFERS_TO []
/government/statistics/e

/government/speeches/david-camerons-holocaust-commission-speech REFERS_TO []
/government/publications/growing-a-culture-of-social-impact-investing-in-the-uk REFERS_TO ['Financial Conduct Authority']
/government/news/communities-minister-celebrates-english-language-learners-in-bradford REFERS_TO []
/government/publications/businesses-who-have-signed-the-armed-forces-covenant-company-names-beginning-with-d REFERS_TO []
/guidance/anti-social-behaviour-on-public-transport-safety-measures REFERS_TO ['Crown Prosecution Service', 'Home Office']
/government/news/a-slice-of-the-big-apple-coming-to-a-neighbourhood-near-you REFERS_TO ['Monitor']
/government/statistics/deaths-registered-in-england-and-wales-provisional-week-ending-24-mar-2017 REFERS_TO []
/government/publications/community-server REFERS_TO ['The Charity Commission']
/government/publications/form-n11m-defence-form-mortgaged-residential-premises REFERS_TO []
/government/news/councils-are-free-to-avoid-charges-for-royal-wedding-stree

Caveat. This is returning all found entities, we might want to limit it to just GOV_ORG entities. The only reason other entities aren't showing up at the moment is cos of the lower case ness of the text. Perhaps we can subset entities by their type? Where we filter for our custom label. Also might be some conflicts.

Actually after testing not so sure about this. Will need to review spacy-lookup and the canonical bit.

In [125]:
text_gov_org_match(u"The UK is nice. Scotland also is lovely.")

[]

In [123]:
# problems if overlap between a GPE and GOV_ORG
text_gov_org_match(u"British Film Institute the UK Sport blah blah Canada")

ValueError: [E103] Trying to set conflicting doc.ents: '(4, 5, 'GPE')' and '(4, 6, '')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

## Convert into edge list
We want our nodes to be base_paths and GOV_ORGS, with edges REFERS_TO. This will let us 

In [144]:
# https://stackoverflow.com/questions/31674557/how-to-append-rows-in-a-pandas-dataframe-in-a-for-loop
cols = ['base_path', 'gov_org']
lst = []
for index, row in df.head(100).iterrows():
    lst.append([row['base_path'], text_gov_org_match(df.at[index,"combined_text"])])
df1 = pd.DataFrame(lst, columns=cols)
df1

Unnamed: 0,base_path,gov_org
0,/government/publications/list-of-psychologists...,[]
1,/government/news/charity-commission-names-furt...,[The Charity Commission]
2,/government/publications/trust-and-confidence-...,[The Charity Commission]
3,/government/speeches/william-shawcross-speech-...,[The Charity Commission]
4,/government/statistics/crime-statistics-focus-...,[]
5,/government/news/britain-honours-its-holocaust...,[Foreign & Commonwealth Office]
6,/government/publications/esf-funding-allocated...,"[Department for Work and Pensions, Skills Fund..."
7,/government/publications/charities-holding-mov...,[]
8,/government/statistics/english-indices-of-depr...,[]
9,/government/news/dcms-improves-efficiency-and-...,"[UK Film Council, British Film Institute, UK S..."


In [145]:
# https://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list
# Actually we want it like this, as we need a unique id 
s = df1.apply(lambda x: pd.Series(x['gov_org']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'gov_org_entity'

df1.drop('gov_org', axis=1).join(s)

Unnamed: 0,base_path,gov_org_entity
0,/government/publications/list-of-psychologists...,
1,/government/news/charity-commission-names-furt...,The Charity Commission
2,/government/publications/trust-and-confidence-...,The Charity Commission
3,/government/speeches/william-shawcross-speech-...,The Charity Commission
4,/government/statistics/crime-statistics-focus-...,
5,/government/news/britain-honours-its-holocaust...,Foreign & Commonwealth Office
6,/government/publications/esf-funding-allocated...,Department for Work and Pensions
6,/government/publications/esf-funding-allocated...,Skills Funding Agency
6,/government/publications/esf-funding-allocated...,National Offender Management Service
7,/government/publications/charities-holding-mov...,


In [147]:
# this is our edge list, ideally we need unique ids for all base-paths and gov_org
# we have these
res = df1.set_index(['base_path'])['gov_org'].apply(pd.Series).stack()
res = res.reset_index()
res.rename(columns={'level_1':'count', 0:'gov_org'}, inplace=True)
# don't need the number of unique gov_org for graph db
#res.drop(columns=['count'], inplace=True)
res

Unnamed: 0,base_path,count,gov_org
0,/government/news/charity-commission-names-furt...,0,The Charity Commission
1,/government/publications/trust-and-confidence-...,0,The Charity Commission
2,/government/speeches/william-shawcross-speech-...,0,The Charity Commission
3,/government/news/britain-honours-its-holocaust...,0,Foreign & Commonwealth Office
4,/government/publications/esf-funding-allocated...,0,Department for Work and Pensions
5,/government/publications/esf-funding-allocated...,1,Skills Funding Agency
6,/government/publications/esf-funding-allocated...,2,National Offender Management Service
7,/government/news/dcms-improves-efficiency-and-...,0,UK Film Council
8,/government/news/dcms-improves-efficiency-and-...,1,British Film Institute
9,/government/news/dcms-improves-efficiency-and-...,2,UK Sport
