# Validated NER data exploration

This notebook investigates named entitiy data whose labels were validated by humans.

In [1]:
# import libraries
import pandas as pd
import numpy as np
from google.colab import drive
import os
import re

In [2]:
drive.mount('/content/drive')

Mounted at /content/drive


Here are all the folders containing data from the Docanno labelling exercise.

In [3]:
# where we keep data for this project
PROJECT_DATA_DIR = 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/content-metadata-2021/Data'

# validated data folder
DATA_DIR = 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/'
os.listdir(DATA_DIR)

['Import data for team session on 03 02 2020',
 'Outputs from team sessions',
 'Import data content support + personalisation 02 03 20',
 'Import data for data science team 12 03 20',
 'Output from 2nd March - Content Designers session',
 'Output from 12th March - Data Scientists session',
 'TSV reference files for input data',
 'Import data volunteer session 19 06 20',
 'models']

## Inspecting data

Let's inspect some of this data...

Starting with "import data". This is the data annotated by GCP NLP API and WordNet Sysnet, which was imported into Docanno for validation.

In [4]:
annotated_fh = os.listdir(DATA_DIR + 'Import data volunteer session 19 06 20')

In [5]:
df = pd.read_json(DATA_DIR + 'Import data volunteer session 19 06 20/' + annotated_fh[5],
                  lines=True)

In [6]:
df.head()

Unnamed: 0,text,labels
0,This is a Ministry of Defence ( MOD ) initiati...,"[[10, 29, ORGANIZATION], [33, 36, ORGANIZATION..."
1,Housing Possession Court Duty Scheme \ ( HPCDS...,"[[19, 24, ORGANIZATION], [41, 46, ORGANIZATION..."
2,If you get Disability Living Allowance . Disab...,"[[11, 72, FINANCE], [76, 82, EVENT], [87, 93, ..."
3,As long as your employee has actually spent th...,"[[16, 24, PERSON], [48, 66, FINANCE], [70, 87,..."
4,Help recording training . Contact DVSA to get ...,"[[0, 23, EVENT], [34, 38, ORGANIZATION], [46, ..."


In [7]:
# add a column to show the text that was labelled and what label it was given
df['entities'] = df.apply(lambda x: [(x['text'][pos[0]:pos[1]], pos[2]) for pos in x['labels']], axis=1)

In [8]:
df

Unnamed: 0,text,labels,entities
0,This is a Ministry of Defence ( MOD ) initiati...,"[[10, 29, ORGANIZATION], [33, 36, ORGANIZATION...","[(Ministry of Defence, ORGANIZATION), (OD , OR..."
1,Housing Possession Court Duty Scheme \ ( HPCDS...,"[[19, 24, ORGANIZATION], [41, 46, ORGANIZATION...","[(Court, ORGANIZATION), (HPCDS, ORGANIZATION),..."
2,If you get Disability Living Allowance . Disab...,"[[11, 72, FINANCE], [76, 82, EVENT], [87, 93, ...",[(Disability Living Allowance . Disability Liv...
3,As long as your employee has actually spent th...,"[[16, 24, PERSON], [48, 66, FINANCE], [70, 87,...","[(employee, PERSON), (scale rate payment, FINA..."
4,Help recording training . Contact DVSA to get ...,"[[0, 23, EVENT], [34, 38, ORGANIZATION], [46, ...","[(Help recording training, EVENT), (DVSA, ORGA..."
...,...,...,...
264,Personal data an employer can keep about an em...,"[[17, 25, PERSON], [44, 52, PERSON], [72, 80, ...","[(employer, PERSON), (employee, PERSON), (empl..."
265,Individual trust financial statements do not p...,"[[11, 37, FINANCE], [67, 78, CONTACT], [119, 1...","[(trust financial statements, FINANCE), (infor..."
266,The SPF describes how UK government organisati...,"[[4, 7, ORGANIZATION], [25, 49, ORGANIZATION],...","[(SPF, ORGANIZATION), (government organisation..."
267,Psychosocial immaturity is prevalent in young ...,"[[46, 49, PERSON]]","[(men, PERSON)]"


Annotation quality is low here, as expected.

Now let's look at the data that was validated:

In [9]:
annotated_fh_val = os.listdir(DATA_DIR + 'Outputs from team sessions')
annotated_fh_val

['ner_output_team5_dataset1.json1', 'ner_output_team1_dataset1.json1']

In [10]:
df_val = df = pd.read_json(DATA_DIR + 'Outputs from team sessions/' + annotated_fh_val[0],
                  lines=True)

In [11]:
df_val.head()

Unnamed: 0,id,text,meta,annotation_approver,labels
0,869,If you decide not to be paid Child Benefit you...,{},,"[[48, 70, FINANCE], [29, 42, FINANCE]]"
1,870,Moving somewhere to study does not count as no...,{},,"[[7, 16, LOCATION]]"
2,871,Your partner must apply to their own employer ...,{},,"[[71, 75, FORM], [64, 67, ORGANIZATION], [37, ..."
3,872,You have to pay tax on it if your income is ov...,{},,"[[34, 40, FINANCE], [53, 71, FINANCE]]"
4,873,Apply for Widowed Parent ’ s Allowance within ...,{},,"[[46, 54, DATE], [10, 38, FINANCE], [70, 75, E..."


In [12]:
# add a column which shows exactly what text is annotated with what label
df_val['entities'] = df_val.apply(lambda x: [(x['text'][pos[0]:pos[1]], pos[2]) for pos in x['labels']], axis=1)

In [13]:
df_val

Unnamed: 0,id,text,meta,annotation_approver,labels,entities
0,869,If you decide not to be paid Child Benefit you...,{},,"[[48, 70, FINANCE], [29, 42, FINANCE]]","[(Guardian ’ s Allowance, FINANCE), (Child Ben..."
1,870,Moving somewhere to study does not count as no...,{},,"[[7, 16, LOCATION]]","[(somewhere, LOCATION)]"
2,871,Your partner must apply to their own employer ...,{},,"[[71, 75, FORM], [64, 67, ORGANIZATION], [37, ...","[(ShPP, FORM), (SPL, ORGANIZATION), (employer,..."
3,872,You have to pay tax on it if your income is ov...,{},,"[[34, 40, FINANCE], [53, 71, FINANCE]]","[(income, FINANCE), (Personal Allowance, FINAN..."
4,873,Apply for Widowed Parent ’ s Allowance within ...,{},,"[[46, 54, DATE], [10, 38, FINANCE], [70, 75, E...","[(3 months, DATE), (Widowed Parent ’ s Allowan..."
...,...,...,...,...,...,...
214,1083,If you want them to manage your payments they ...,{},,"[[71, 80, PERSON]]","[(appointee, PERSON)]"
215,1084,If your course starts on or after 1 August 201...,{},,"[[79, 88, EVENT], [60, 71, DATE], [34, 47, DATE]]","[(first day, EVENT), (19 or older, DATE), (1 A..."
216,1085,If they die the money passes to whoever inheri...,{},,"[[32, 61, PERSON]]","[(whoever inherits their estate, PERSON)]"
217,1086,If a child has been excluded for a fixed perio...,{},,"[[89, 109, DATE], [20, 47, EVENT], [48, 55, OR...","[(first 5 school days , DATE), (excluded for a..."


Initial impression is that the validated data is of much higher quality than the unvalidated data. After inspecting a sample of hundreds of labels from the validated and unvalidated data, a small minority of validated labels appear erroneous (only one of them), whereas a much larger proportion of unvalidated labels appear erroneous (maybe half the labels).

## Merging validated data

The validated data is spread across multiple files and folders. Let's merge everything into one dataset.

First, we need the file paths to all of the validated data.

In [14]:
validated_data_folders = [
                          'Outputs from team sessions',
                          'Output from 2nd March - Content Designers session',
                          'Output from 12th March - Data Scientists session',
                          ]


In [15]:
validated_data = [os.path.join(DATA_DIR,folder,file) for folder in validated_data_folders for file in os.listdir(DATA_DIR + folder)]
validated_data

['drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/Outputs from team sessions/ner_output_team5_dataset1.json1',
 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/Outputs from team sessions/ner_output_team1_dataset1.json1',
 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/Output from 2nd March - Content Designers session/doccano_content_designers_project_2.json1',
 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/Output from 2nd March - Content Designers session/doccano_content_designers_project_1.json1',
 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/Output from 2nd March - Content Designers session/doccano_content_designers_project_3.json1',
 'drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/Output from 2nd March - Content Designers session/doccano_content_designers_project_4.json1',
 'drive/Shareddrives/GOV.UK 

Now we concatenate the data from each file into a single data frame.

In [16]:
vd = pd.concat([pd.read_json(data, lines=True)[['text','labels']] for data in validated_data])

In [17]:
# labels aren't always in order of occurrence, so sort them by character position
vd.labels = vd.labels.apply(sorted)

In [18]:
# add a column to show exactly what text is assigned to which label
vd['labelled_entities'] = vd.apply(lambda x: [(x['text'][pos[0]:pos[1]], pos[2]) for pos in x['labels']], axis=1)

In [19]:
def charPositionLabelsToTokenMapping(labels, text, convention='IOB'):
  '''
  Returns a list of labels, mapping to each token in a given text.
  This is specific to the situation in which only named entities are labelled, and those labels are not mapped
  directly to tokens, but are mapped to character positions.

    Parameters:
      labels (list of lists): contains character positions of named entities assigned to a given label, in a given text, e.g. [[10, 38, FINANCE], [46, 54, DATE]]
      text (string): the string which has been labelled
      convention (string): tagging format to be used. Inside-outside-beginning (IOB) as default.

    Returns:
      label_list (list): list of labels mapping to each token in text
    
  '''
  # flatten labels into a list of character positions
  positions = [index for label in labels for index in label[:2]]

  # append None and 0 to handle edge cases
  positions.append(None)
  prev_positions = [0] + positions

  # maintain a list of sections of text which belong to the same label
  sections = []

  # identify sections of text which belong to the same label
  for begin, end in zip(prev_positions, positions):
    sections.append(text[begin:end])

  # remove empty strings and strip spaces left over
  sections = [section.strip() for section in sections if section.strip()]

  # create a dict of what text corresponds to named entities, and what
  # named entity that text has been labelled as
  named_entities = {text[i:j]:f"I-{label}" for i,j,label in labels}

  # group sections of tokens together with their label
  label_token_list = [(section.split(), named_entities[section]) if section in named_entities else (section.split(), 'O') for section in sections]

  # if convention == 'IOB':
  #   # if we have a multi-word entity, make the first label prefixed with 'B-' instead of 'I-'
  #   new_l = []
  #   for x in label_token_list:
  #     if x[1] != 'O' and len(x[0]) > 1:
  #       new_l.append((x[0][0],x[1].replace('I-','B-')))
  #       new_l.append((x[0][1:],x[1]))
  #     else:
  #       new_l.append(x)
  #   label_token_list = new_l

  # directly map each label to a token
  label_list = [label_token[-1] for label_token in label_token_list for _ in label_token[0]]
  token_list = [token for label_token in label_token_list for token in label_token[0]]

  return (label_list, token_list)


In [20]:
# add columns for label lists and the tokens each label maps to
vd[['label_list', 'text_tokens']] = vd.apply(lambda x: charPositionLabelsToTokenMapping(x.labels, x.text), axis=1, result_type='expand')
vd

Unnamed: 0,text,labels,labelled_entities,label_list,text_tokens
0,If you decide not to be paid Child Benefit you...,"[[29, 42, FINANCE], [48, 70, FINANCE]]","[(Child Benefit, FINANCE), (Guardian ’ s Allow...","[O, O, O, O, O, O, O, I-FINANCE, I-FINANCE, O,...","[If, you, decide, not, to, be, paid, Child, Be..."
1,Moving somewhere to study does not count as no...,"[[7, 16, LOCATION]]","[(somewhere, LOCATION)]","[O, I-LOCATION, O, O, O, O, O, O, O, O, O, O]","[Moving, somewhere, to, study, does, not, coun..."
2,Your partner must apply to their own employer ...,"[[5, 12, PERSON], [37, 45, ORGANIZATION], [64,...","[(partner, PERSON), (employer, ORGANIZATION), ...","[O, I-PERSON, O, O, O, O, O, I-ORGANIZATION, O...","[Your, partner, must, apply, to, their, own, e..."
3,You have to pay tax on it if your income is ov...,"[[34, 40, FINANCE], [53, 71, FINANCE]]","[(income, FINANCE), (Personal Allowance, FINAN...","[O, O, O, O, O, O, O, O, O, I-FINANCE, O, O, O...","[You, have, to, pay, tax, on, it, if, your, in..."
4,Apply for Widowed Parent ’ s Allowance within ...,"[[10, 38, FINANCE], [46, 54, DATE], [70, 75, E...","[(Widowed Parent ’ s Allowance, FINANCE), (3 m...","[O, O, I-FINANCE, I-FINANCE, I-FINANCE, I-FINA...","[Apply, for, Widowed, Parent, ’, s, Allowance,..."
...,...,...,...,...,...
451,This may mean you have difficulty getting a mo...,"[[44, 52, FINANCE], [65, 69, LOCATION]]","[(mortgage, FINANCE), (home, LOCATION)]","[O, O, O, O, O, O, O, O, I-FINANCE, O, O, O, I...","[This, may, mean, you, have, difficulty, getti..."
452,Reapplying for ESA You may be able to re - app...,"[[15, 18, ORGANIZATION], [58, 66, DATE], [116,...","[(ESA, ORGANIZATION), (12 weeks, DATE), (ESA, ...","[O, O, I-ORGANIZATION, O, O, O, O, O, O, O, O,...","[Reapplying, for, ESA, You, may, be, able, to,..."
453,The qualified person can include you if they a...,"[[4, 20, PERSON], [62, 81, STATE]]","[(qualified person, PERSON), (permanent reside...","[O, I-PERSON, I-PERSON, O, O, O, O, O, O, O, O...","[The, qualified, person, can, include, you, if..."
454,You must see a border officer when you arrive ...,"[[15, 29, PERSON], [53, 55, LOCATION]]","[(border officer, PERSON), (UK, LOCATION)]","[O, O, O, O, I-PERSON, I-PERSON, O, O, O, O, O...","[You, must, see, a, border, officer, when, you..."


In [21]:
#vd.to_csv(os.path.join(PROJECT_DATA_DIR, 'govuk-labelled-data-ner-validated.csv'))

## Label counts

Value counts of labels for each entity.

In [22]:
vd.labelled_entities.apply(lambda x: [label[1] for label in x]).explode().value_counts()

ORGANIZATION    3096
PERSON          2883
FINANCE         2648
FORM            1366
EVENT           1324
LOCATION        1029
DATE             791
CONTACT          669
STATE            663
MISC             503
Name: labelled_entities, dtype: int64

Value counts of labels for each token.

In [23]:
vd.label_list.explode().value_counts()

O                 101033
I-ORGANIZATION      5503
I-FINANCE           4409
I-PERSON            3410
I-FORM              2715
I-EVENT             2267
I-DATE              1945
I-STATE             1482
I-LOCATION          1431
I-MISC              1064
I-CONTACT            987
Name: label_list, dtype: int64

MONEY and SCHEME are not present in the validated data.

## Problems with data

Was any data duplicated? In our context, this means to validate the same text more than once.

In [24]:
vd[~vd.text.duplicated()]

Unnamed: 0,text,labels,labelled_entities,label_list,text_tokens
0,If you decide not to be paid Child Benefit you...,"[[29, 42, FINANCE], [48, 70, FINANCE]]","[(Child Benefit, FINANCE), (Guardian ’ s Allow...","[O, O, O, O, O, O, O, I-FINANCE, I-FINANCE, O,...","[If, you, decide, not, to, be, paid, Child, Be..."
1,Moving somewhere to study does not count as no...,"[[7, 16, LOCATION]]","[(somewhere, LOCATION)]","[O, I-LOCATION, O, O, O, O, O, O, O, O, O, O]","[Moving, somewhere, to, study, does, not, coun..."
2,Your partner must apply to their own employer ...,"[[5, 12, PERSON], [37, 45, ORGANIZATION], [64,...","[(partner, PERSON), (employer, ORGANIZATION), ...","[O, I-PERSON, O, O, O, O, O, I-ORGANIZATION, O...","[Your, partner, must, apply, to, their, own, e..."
3,You have to pay tax on it if your income is ov...,"[[34, 40, FINANCE], [53, 71, FINANCE]]","[(income, FINANCE), (Personal Allowance, FINAN...","[O, O, O, O, O, O, O, O, O, I-FINANCE, O, O, O...","[You, have, to, pay, tax, on, it, if, your, in..."
4,Apply for Widowed Parent ’ s Allowance within ...,"[[10, 38, FINANCE], [46, 54, DATE], [70, 75, E...","[(Widowed Parent ’ s Allowance, FINANCE), (3 m...","[O, O, I-FINANCE, I-FINANCE, I-FINANCE, I-FINA...","[Apply, for, Widowed, Parent, ’, s, Allowance,..."
...,...,...,...,...,...
450,It ’ s against the law for a school or other e...,"[[29, 35, ORGANIZATION], [45, 63, ORGANIZATION...","[(school, ORGANIZATION), (education provider, ...","[O, O, O, O, O, O, O, O, I-ORGANIZATION, O, O,...","[It, ’, s, against, the, law, for, a, school, ..."
451,This may mean you have difficulty getting a mo...,"[[44, 52, FINANCE], [65, 69, LOCATION]]","[(mortgage, FINANCE), (home, LOCATION)]","[O, O, O, O, O, O, O, O, I-FINANCE, O, O, O, I...","[This, may, mean, you, have, difficulty, getti..."
453,The qualified person can include you if they a...,"[[4, 20, PERSON], [62, 81, STATE]]","[(qualified person, PERSON), (permanent reside...","[O, I-PERSON, I-PERSON, O, O, O, O, O, O, O, O...","[The, qualified, person, can, include, you, if..."
454,You must see a border officer when you arrive ...,"[[15, 29, PERSON], [53, 55, LOCATION]]","[(border officer, PERSON), (UK, LOCATION)]","[O, O, O, O, I-PERSON, I-PERSON, O, O, O, O, O...","[You, must, see, a, border, officer, when, you..."


Yes, previously we had 7129 rows of data. After removing rows containing duplicated text, this has decreased by 1287 to 5905 rows. This means 5905 unique sentences were validated, and many were validated multiple times.

Let's examine the differences in labelling between repeats of validation.

In [25]:
vd[vd.text.duplicated(keep=False)][['text','labelled_entities']].sort_values('text')

Unnamed: 0,text,labelled_entities
426,A child aged between 6 months and 3 years must...,"[(child, PERSON), (between 6 months and 3 year..."
286,A child aged between 6 months and 3 years must...,"[(child, PERSON), (between 6 months and 3 year..."
362,A judge will listen to both sides of the argum...,"[(judge, PERSON), (decision , CONTACT)]"
103,A judge will listen to both sides of the argum...,"[(judge, PERSON)]"
42,A judge will listen to both sides of the argum...,"[(judge, PERSON)]"
...,...,...
135,Your sponsor will give you your certificate of...,"[(sponsor, PERSON), (certificate of sponsorshi..."
441,Your sponsor will give you your certificate of...,"[(sponsor, PERSON), (certificate of sponsorshi..."
171,Your suitability to foster will be assessed .,"[(suitability to foster, STATE)]"
199,Your suitability to foster will be assessed .,"[(suitability to foster, STATE)]"


Unvalidated data conflates times with money. How are times handled in the validated data?

In [26]:
vd[vd.text.str.contains('pm ', case=False)][['text','labelled_entities']]

Unnamed: 0,text,labelled_entities
78,Acas helpline Telephone : 0300 123 1100 Textph...,"[(Acas, ORGANIZATION), (helpline, CONTACT), (T..."
94,The Insolvency Service Telephone : 0330 331 00...,"[(Insolvency Service, ORGANIZATION), (Telephon..."
398,Invest Northern Ireland Telephone : 0800 181 4...,"[(Invest Northern Ireland, ORGANIZATION)]"
450,Employer helpline 0800 916 0614 Monday to Frid...,"[(call charges, FINANCE)]"
37,Student Finance England Postgraduate Loan team...,[(Student Finance England Postgraduate Loan te...
43,Flexitime The employee chooses when to start a...,"[(employee, PERSON), (start, EVENT), (end work..."
287,Environment Agency Email : enquiries @ environ...,"[(Environment Agency, ORGANIZATION), (Email, C..."
384,BCMS Helpline bcmsctsonline @ rpa . gov . uk 0...,"[(BCMS Helpline, CONTACT), (bcmsctsonline @ rp..."
462,Student Finance England Postgraduate Loan team...,[(Student Finance England Postgraduate Loan te...
227,DBS sensitive applications team sensitive @ db...,"[(applications team, ORGANIZATION)]"


Now times are assigned the DATE label instead of MONEY. Although this isn't done with 100% consistency, as some times are not assigned to a named entity.

Where has the MONEY label gone? Let's check if it was ever used by the Google NLP API.

In [27]:
os.listdir(DATA_DIR)

['Import data for team session on 03 02 2020',
 'Outputs from team sessions',
 'Import data content support + personalisation 02 03 20',
 'Import data for data science team 12 03 20',
 'Output from 2nd March - Content Designers session',
 'Output from 12th March - Data Scientists session',
 'TSV reference files for input data',
 'Import data volunteer session 19 06 20',
 'models']

In [28]:
unvalidated_data_folders = [
                          'Import data for team session on 03 02 2020',
                          'Import data for data science team 12 03 20',
                          'Import data volunteer session 19 06 20'
                          ]

In [29]:
unvalidated_data = [os.path.join(DATA_DIR,folder,file) for folder in unvalidated_data_folders for file in os.listdir(DATA_DIR + folder)]

In [30]:
uvd = pd.concat([pd.read_json(data, lines=True)[['text','labels']] for data in unvalidated_data])
# add a column to show exactly what text is assigned to which label
uvd['labelled_entities'] = uvd.apply(lambda x: [(x['text'][pos[0]:pos[1]], pos[2]) for pos in x['labels']], axis=1)
# add columns for label lists and the tokens each label maps to
uvd[['label_list', 'text_tokens']] = uvd.apply(lambda x: charPositionLabelsToTokenMapping(x.labels, x.text), axis=1, result_type='expand')
uvd

Unnamed: 0,text,labels,labelled_entities,label_list,text_tokens
0,Contact your school or local council to find o...,"[[13, 19, ORGANIZATION], [29, 36, ORGANIZATION]]","[(school, ORGANIZATION), (council, ORGANIZATION)]","[O, O, I-ORGANIZATION, O, O, I-ORGANIZATION, O...","[Contact, your, school, or, local, council, to..."
1,The judge will then make a decision .,"[[4, 9, PERSON]]","[(judge, PERSON)]","[O, I-PERSON, O, O, O, O, O, O]","[The, judge, will, then, make, a, decision, .]"
2,Complaints You can complain to the Child Benef...,"[[35, 55, ORGANIZATION]]","[(Child Benefit Office, ORGANIZATION)]","[O, O, O, O, O, O, I-ORGANIZATION, I-ORGANIZAT...","[Complaints, You, can, complain, to, the, Chil..."
3,Problems and disputes Ask your employer to exp...,"[[31, 39, PERSON], [56, 59, ORGANIZATION]]","[(employer, PERSON), (SMP, ORGANIZATION)]","[O, O, O, O, O, I-PERSON, O, O, O, I-ORGANIZAT...","[Problems, and, disputes, Ask, your, employer,..."
4,You have certain responsibilities until the ch...,"[[44, 49, PERSON], [72, 77, PERSON]]","[(child, PERSON), (child, PERSON)]","[O, O, O, O, O, O, I-PERSON, O, O, O, O, O, I-...","[You, have, certain, responsibilities, until, ..."
...,...,...,...,...,...
264,This information will be useful to : school ad...,"[[5, 16, CONTACT], [37, 43, ORGANIZATION], [54...","[(information, CONTACT), (school, ORGANIZATION...","[O, I-CONTACT, O, O, O, O, O, I-ORGANIZATION, ...","[This, information, will, be, useful, to, :, s..."
265,If your organisation applies for additional se...,"[[8, 20, ORGANIZATION], [44, 52, MISC], [65, 7...","[(organisation, ORGANIZATION), (services, MISC...","[O, O, I-ORGANIZATION, O, O, O, I-MISC, O, O, ...","[If, your, organisation, applies, for, additio..."
266,Read the instructions for entry into Canada fo...,"[[9, 21, FORM], [37, 43, LOCATION], [61, 78, P...","[(instructions, FORM), (Canada, LOCATION), (bu...","[O, O, I-FORM, O, O, O, I-LOCATION, O, O, O, O...","[Read, the, instructions, for, entry, into, Ca..."
267,This information is for lead schools of a Scho...,"[[5, 16, CONTACT], [29, 36, ORGANIZATION], [56...","[(information, CONTACT), (schools, ORGANIZATIO...","[O, I-CONTACT, O, O, O, I-ORGANIZATION, O, O, ...","[This, information, is, for, lead, schools, of..."


I note here that there are 9765 rows of unvalidated data, compared to 7129 rows of validated data.

In [31]:
uvd.labelled_entities.apply(lambda x: [label[1] for label in x]).explode().value_counts()

ORGANIZATION    7876
PERSON          5119
EVENT           4279
FINANCE         4162
LOCATION        3132
CONTACT         2134
FORM            2105
DATE            1508
STATE           1224
MISC             453
SCHEME           149
MONEY             21
Name: labelled_entities, dtype: int64

There are only 21 entities assigned the MONEY label. Lets inspect the corresponding text.

In [32]:
uvd[uvd.label_list.apply(lambda x: 'I-MONEY' in x)][['text','labelled_entities']]

Unnamed: 0,text,labelled_entities
30,They take place from 9am to 5pm and include : ...,"[(place, LOCATION), (5pm, MONEY), (laboratory,..."
37,Contact PRaP Operational Support Team ( POST )...,"[(Contact, CONTACT), (ail :, CONTACT), (lephon..."
105,Digital Tachograph Team Telephone : 0300 300 2...,"[(Digital Tachograph Team, ORGANIZATION), (Tel..."
151,Defra defra . helpline @ defra . gov . uk Tele...,"[(Defra, ORGANIZATION), (defra . helpline, CON..."
160,NCS @ cms - cmno . com Telephone : 0844 984 01...,"[(Telephone, CONTACT), (Monday to Friday, DATE..."
170,Assisted Prison Visits Unit assisted . prison ...,"[(assisted . prison . visits, LOCATION), (Tele..."
214,DVSA theory test booking support customercare ...,"[(DVSA, ORGANIZATION), (theory test, EVENT), (..."
154,DVSA Telephone : 0300 123 9000 Monday to Frida...,"[(Telephone, CONTACT), (Monday to Friday, DATE..."
192,Biodiversity Unit 028 9056 9605 Monday to Frid...,"[(Biodiversity Unit, ORGANIZATION), (Monday to..."
32,Business Wales Helpline Telephone : 0300 060 3...,"[(Wales, LOCATION), (Telephone, CONTACT), (Mon..."


Inspecting this output, MONEY is only tagged to times. Hence, this will have been corrected. But does the text actually mention money? Let's look for text containing currency units.

In [33]:
vd[vd.text.str.contains('\\£|\\€|\\$')][['text','labelled_entities']]

Unnamed: 0,text,labelled_entities
67,How much it costs It costs £103 to register wi...,"[(£103, FINANCE), (Ofsted, ORGANIZATION)]"
71,Tax relief On top of the £10 000 exemption you...,"[(Tax relief, FINANCE), (£10 000 exemption, FI..."
66,How much it costs It costs £103 to register wi...,"[(£103, FINANCE), (Ofsted, ORGANIZATION)]"
70,Tax relief On top of the £10 000 exemption you...,"[(Tax relief, FINANCE), (£10 000 exemption, FI..."
52,You may be prosecuted or have to pay a £50 pen...,"[(be prosecuted, EVENT), (pay, FINANCE), (£50,..."
74,If you ’ re appealing a Self Assessment penalt...,"[(Self Assessment penalty, FINANCE), (£100 pen..."
476,You can choose a low share value ( for example...,"[(low share value, FINANCE), (shareholders, PE..."
87,How much it costs It usually costs £35 to regi...,"[(Ofsted, ORGANIZATION)]"
360,It costs £50 for each form you submit .,"[(each, PERSON), (form, FORM)]"
39,How much it costs It costs £103 to register wi...,"[(Ofsted, ORGANIZATION)]"


Sometimes money is tagged to FINANCE, sometimes it is not tagged to anything.

What about entities tagged to SCHEME? We don't have any of those in the validated data, either.

In [34]:
uvd[uvd.label_list.apply(lambda x: 'I-SCHEME' in x)][['text','labelled_entities']]

Unnamed: 0,text,labelled_entities
19,Registering with a redress scheme as a propert...,"[(scheme, SCHEME), (property agent, PERSON), (..."
122,You can use the stakeholder needs and quality ...,"[(scheme, SCHEME)]"
56,Business rates : retail discount - guidance Gu...,"[(Business rates, FINANCE), (discount, FINANCE..."
66,VAT Annual Accounting Scheme . Annual accounti...,"[(VAT, FINANCE), (Annual Accounting Scheme, OR..."
137,Stop being an employer . Tell HMRC if you stop...,"[(employer, PERSON), (Tell HMRC, ORGANIZATION)..."
...,...,...
84,The money is paid through the Domestic RHI sch...,"[(money, FINANCE), (paid, FINANCE), (Domestic ..."
106,Fill in the Park Homes Warm Home Discount appl...,"[(Park Homes Warm Home Discount, EVENT), (appl..."
188,Tax advantages only apply if the shares are of...,"[(apply, FORM), (shares, FINANCE), (schemes, S..."
198,Defined benefit pension schemes Your employer ...,"[(benefit pension, FINANCE), (schemes, SCHEME)..."


The word 'scheme' is tagged as SCHEME. Understandably, the word 'scheme' isn't informative of what schemes are mentioned on a page. However, in some cases, looking at the text preceding the word 'scheme', an actual scheme is mentioned. After human validation, the SCHEME tags were removed, instead of being extended backwards.