## Getting started

In [1]:
import pandas as pd

import spacy
# Import the English language class
from spacy.lang.en import English

# make use of widescreen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


## Reading in GOV.UK data

In [2]:
# available from GOV.UK data scientists
# data has been pre-processed for taxonomy work
# this loses us useful information such as capital letters
# will want to adjust pre-processing pipeline
df = pd.read_csv("data/11-02-19/labelled.csv", usecols=["base_path", "content_id", "description", "locale", "title", "body", "combined_text"])

In [3]:
df.head()

Unnamed: 0,base_path,content_id,description,locale,title,body,combined_text
0,/government/publications/list-of-psychologists...,04a0cc0d-0b9f-45ad-bf57-7c54cbab9df9,list of english speaking psychologists and psy...,en,chile - list of psychologists and psychiatrist...,prepared by british embassy/consulate santiago...,chile - list of psychologists and psychiatrist...
1,/government/news/charity-commission-names-furt...,5fa49c52-7631-11e4-a3cb-005056011aef,regulator increases transparency of its work.,en,charity commission names further charities und...,the charity commission has today named further...,charity commission names further charities und...
2,/government/publications/trust-and-confidence-...,d0341424-12a1-4b4c-9045-2e74ba17f2d5,independent research into trust and confidence...,en,trust and confidence in the charity commission...,the charity commission commissioned populus to...,trust and confidence in the charity commission...
3,/government/speeches/william-shawcross-speech-...,9245dfca-4210-41d9-9ffd-7fcc35dc1642,william shawcross asks charities to pull toget...,en,william shawcross speech at commission’s publi...,good morning and thank you for joining us here...,william shawcross speech at commission’s publi...
4,/government/statistics/crime-statistics-focus-...,5fec046a-7631-11e4-a3cb-005056011aef,crime statistics from the crime survey for eng...,en,public perceptions of crime and the police and...,official statistics are produced impartially a...,public perceptions of crime and the police and...


## Documents, spans and tokens

In [4]:
# Create the nlp object
nlp = English()

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.

In [5]:
df.at[3,"title"]

'william shawcross speech at commission’s public meeting in southampton'

In [6]:
# Created by processing a string of text with the nlp object
doc = nlp(df.at[3,"title"])

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

william
shawcross
speech
at
commission
’s
public
meeting
in
southampton


When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [7]:
doc[0]

william

In [8]:
# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

william


When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

In [9]:
# can take slices
doc[0:-1]

william shawcross speech at commission’s public meeting in

## Lexical attributes

In [10]:
doc = nlp(df.at[18,"description"])
doc

the resting place of 4 members of the royal warwickshire regiment has finally been marked more than 100 years after they gave their lives for their country.

In [11]:
# Find a duration or period by number followed by years
# Could also be used for percentages, but we have removed % from text
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "years":
            print("Time period found (years):", token.text)

Time period found (years): 100


## Statistical models
Let's add some more power to the nlp object!

In this lesson, you'll learn about spaCy's statistical models.

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.

In [12]:
# https://spacy.io/usage/models
# https://stackoverflow.com/questions/52677634/pycharm-cant-find-spacy-model-en

nlp = spacy.load("en_core_web_sm")

In [25]:
df.title

0         chile - list of psychologists and psychiatrist...
1         charity commission names further charities und...
2         trust and confidence in the charity commission...
3         william shawcross speech at commission’s publi...
4         public perceptions of crime and the police and...
5                      britain honours its holocaust heroes
6                            esf funding for the north east
7         charities: holding moving and receiving funds ...
8                       english indices of deprivation 2015
9         dcms improves efficiency and cuts costs with r...
10                  advice for british nationals in kolkata
11        wales office minister welcomes prime minister’...
12          birth summary tables in england and wales: 2013
13        punishment and reform: effective community sen...
14                           lord-lieutenant for midlothian
15        cic36: application to form a community interes...
16        long-term effects of childhood

### Predicting part of speech (POS) tags

In [21]:
# Process a text
doc = nlp(df.at[22,"description"])

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

the DET
queen NOUN
has VERB
been VERB
pleased ADJ
to PART
approve VERB
that ADP
the DET
honour NOUN
of ADP
knighthood NOUN
be VERB
conferred VERB
upon ADP
oliver NOUN
heald NOUN
mp PRON
. PUNCT


Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English model and receive an nlp object.

Next, we're processing the text from a pages description.

For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

Here, the model correctly predicted "pleased" as a verb and "knighthood" as a noun but got "mp" wrong.

### Predicting syntactic dependencies

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

To describe syntactic dependencies, spaCy uses a standardized label scheme.



In [23]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

the DET det queen
queen NOUN nsubj been
has VERB aux been
been VERB ROOT been
pleased ADJ acomp been
to PART aux approve
approve VERB xcomp pleased
that ADP mark conferred
the DET det honour
honour NOUN nsubjpass conferred
of ADP prep honour
knighthood NOUN pobj of
be VERB auxpass conferred
conferred VERB ccomp approve
upon ADP prep conferred
oliver NOUN compound mp
heald NOUN compound mp
mp PRON pobj upon
. PUNCT punct been


### Predicting named entities
Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc dot ents property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [32]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


However, it performs poorly on our data possibly due to 

In [52]:
df.body

0         prepared by british embassy/consulate santiago...
1         the charity commission has today named further...
2         the charity commission commissioned populus to...
3         good morning and thank you for joining us here...
4         official statistics are produced impartially a...
5         at an event at the foreign & commonwealth offi...
6         the funding is broken down by co financing org...
7         chapter 4 of the commission’s compliance toolk...
8         these statistics update the english indices of...
9         a number of the department for culture media a...
10        the consular section at the british deputy hig...
11        wales office minister david jones today welcom...
12        official statistics are produced impartially a...
13        this document contains the following informati...
14        the queen has been pleased to appoint sir robe...
15        when applying to form a community interest com...
16        there is a body of evidence su

In [54]:
# show what we are looking at
print(df.at[0,"body"])
# Process a text
doc = nlp(df.at[0,"body"])

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print("--------")
    print(ent.text, ent.label_)

prepared by british embassy/consulate santiago chile. list of psychologists and psychiatrists 2017 pdf 412kb 8 pages
--------
british NORP
--------
2017 DATE
--------
412 CARDINAL
--------
8 CARDINAL


In [62]:
# show what we are looking at
print(df.at[305692,"body"])
# Process a text
doc = nlp(df.at[305692,"body"])

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print("--------")
    print(ent.text, ent.label_)

the dame lesley strathie operational excellence award recognises excellence in the delivery of public services. this includes putting user needs at the heart of a project and significantly improving the quality value for money or productivity of services to the public. gov.uk notify is delivered by a multidisciplinary team of 12 people including designers user researchers and developers. they work closely with service teams across the country to constantly iterate and improve it. the judges said: gov.uk notify is a great example of a small diverse set of civil servants challenging established ways of doing things to rapidly deliver a product benefitting millions of people whilst saving taxpayers millions. gov.uk notify product manager pete herlihy said: it’s ace. obviously we’re incredibly proud of notify and the impact it’s having right across the public sector but this recognition for how our little team goes about delivering it really means so much to us. about gov.uk notify gov.uk 

#### Tip: explain the method

In [60]:
spacy.explain('CARDINAL')
# Could extract dates discussed and adjust search results accordingly.

'Numerals that do not fall under another type'

In [59]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

#### Pre-processing is important
By lower casing we lose information.

## Predicting named entities in context
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

In [63]:
# show what we are looking at
print(df.at[305692,"title"])
# Process a text
doc = nlp(df.at[305692,"title"])

gov.uk notify wins civil service operational excellence award


In [69]:
# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "GOV.UK" and "Notify"
govuk = doc[0:1]
notify = doc[1:2]

# Print the span text
print("Missing entity:", govuk.text)
print("Missing entity:", notify.text)

Missing entity: gov.uk
Missing entity: notify


Of course, you don't always have to do this manually. In the
next exercise, you'll learn about spaCy's rule-based matcher, which can help you
find certain words and phrases in text.
## Rule based matching 
We'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.

Why not just use regular expressions?

Consider: "duck" (verb) vs. "duck" (noun)

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

#### Match patterns
Lists of dictionaries, one per token

* Match exact token texts  
[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]  
* Match lexical attributes  
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
* Match any token attributes  
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]  

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".