## Getting started

In [23]:
import pandas as pd

# Import the English language class
from spacy.lang.en import English

# make use of widescreen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


## Reading in GOV.UK data

In [8]:
# available from GOV.UK data scientists
# data has been pre-processed for taxonomy work
df = pd.read_csv("data/11-02-19/labelled.csv", usecols=["base_path", "content_id", "description", "locale", "title", "body", "combined_text"])

In [9]:
df.head()

Unnamed: 0,base_path,content_id,description,locale,title,body,combined_text
0,/government/publications/list-of-psychologists...,04a0cc0d-0b9f-45ad-bf57-7c54cbab9df9,list of english speaking psychologists and psy...,en,chile - list of psychologists and psychiatrist...,prepared by british embassy/consulate santiago...,chile - list of psychologists and psychiatrist...
1,/government/news/charity-commission-names-furt...,5fa49c52-7631-11e4-a3cb-005056011aef,regulator increases transparency of its work.,en,charity commission names further charities und...,the charity commission has today named further...,charity commission names further charities und...
2,/government/publications/trust-and-confidence-...,d0341424-12a1-4b4c-9045-2e74ba17f2d5,independent research into trust and confidence...,en,trust and confidence in the charity commission...,the charity commission commissioned populus to...,trust and confidence in the charity commission...
3,/government/speeches/william-shawcross-speech-...,9245dfca-4210-41d9-9ffd-7fcc35dc1642,william shawcross asks charities to pull toget...,en,william shawcross speech at commission’s publi...,good morning and thank you for joining us here...,william shawcross speech at commission’s publi...
4,/government/statistics/crime-statistics-focus-...,5fec046a-7631-11e4-a3cb-005056011aef,crime statistics from the crime survey for eng...,en,public perceptions of crime and the police and...,official statistics are produced impartially a...,public perceptions of crime and the police and...


## Documents, spans and tokens

In [10]:
# Create the nlp object
nlp = English()

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.

In [13]:
df.at[3,"title"]

'william shawcross speech at commission’s public meeting in southampton'

In [15]:
# Created by processing a string of text with the nlp object
doc = nlp(df.at[3,"title"])

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

william
shawcross
speech
at
commission
’s
public
meeting
in
southampton


When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [18]:
doc[0]

william

In [20]:
# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

william


When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

In [21]:
# can take slices
doc[0:-1]

william shawcross speech at commission’s public meeting in

## Lexical attributes

In [28]:
doc = nlp(df.at[18,"description"])
doc

the resting place of 4 members of the royal warwickshire regiment has finally been marked more than 100 years after they gave their lives for their country.

In [30]:
# Find a duration or period by number followed by years
# Could also be used for percentages, but we have removed % from text
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "years":
            print("Time period found (years):", token.text)

Time period found (years): 100


## Statistical models
Let's add some more power to the nlp object!

In this lesson, you'll learn about spaCy's statistical models.

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.