# Validated NER data exploration

This notebook investigates named entitiy data whose labels were validated by humans.

In [None]:
# import libraries
import itertools
import os
import re

import numpy as np
import pandas as pd
from google.colab import drive

In [None]:
drive.mount("/content/drive")

Here are all the folders containing data from the Docanno labelling exercise.

In [None]:
# where we keep data for this project
PROJECT_DATA_DIR = (
    "drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/content-metadata-2021/Data"
)

# validated data folder
DATA_DIR = (
    "drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/govNER (1)/Exported data/"
)
os.listdir(DATA_DIR)

## Inspecting data

Let's inspect some of this data...

Starting with "import data". This is the data annotated by GCP NLP API and WordNet Sysnet, which was imported into Docanno for validation.

In [None]:
annotated_fh = os.listdir(DATA_DIR + "Import data volunteer session 19 06 20")

In [None]:
df = pd.read_json(
    DATA_DIR + "Import data volunteer session 19 06 20/" + annotated_fh[5], lines=True
)

In [None]:
df.head()

In [None]:
# add a column to show the text that was labelled and what label it was given
df["entities"] = df.apply(
    lambda x: [(x["text"][pos[0] : pos[1]], pos[2]) for pos in x["labels"]], axis=1
)

In [None]:
df

Annotation quality is low here, as expected.

Now let's look at the data that was validated:

In [None]:
annotated_fh_val = os.listdir(DATA_DIR + "Outputs from team sessions")
annotated_fh_val

In [None]:
df_val = df = pd.read_json(
    DATA_DIR + "Outputs from team sessions/" + annotated_fh_val[0], lines=True
)

In [None]:
df_val.head()

In [None]:
# add a column which shows exactly what text is annotated with what label
df_val["entities"] = df_val.apply(
    lambda x: [(x["text"][pos[0] : pos[1]], pos[2]) for pos in x["labels"]], axis=1
)

In [None]:
df_val

Initial impression is that the validated data is of much higher quality than the unvalidated data. After inspecting a sample of hundreds of labels from the validated and unvalidated data, a small minority of validated labels appear erroneous (only one of them), whereas a much larger proportion of unvalidated labels appear erroneous (maybe half the labels).

## Merging validated data

The validated data is spread across multiple files and folders. Let's merge everything into one dataset.

First, we need the file paths to all of the validated data.

In [None]:
validated_data_folders = [
    "Outputs from team sessions",
    "Output from 2nd March - Content Designers session",
    "Output from 12th March - Data Scientists session",
]

In [None]:
validated_data = [
    os.path.join(DATA_DIR, folder, file)
    for folder in validated_data_folders
    for file in os.listdir(DATA_DIR + folder)
]
validated_data

Now we concatenate the data from each file into a single data frame.

In [None]:
vd = pd.concat(
    [pd.read_json(data, lines=True)[["text", "labels"]] for data in validated_data]
)

In [None]:
# labels aren't always in order of occurrence, so sort them by character position
vd.labels = vd.labels.apply(sorted)

In [None]:
# some labels are repeated, so we remove them here
vd.labels = vd.labels.apply(lambda k: list(k for k, _ in itertools.groupby(k)))

In [None]:
# add a column to show exactly what text is assigned to which label
vd["labelled_entities"] = vd.apply(
    lambda x: [(x["text"][pos[0] : pos[1]], pos[2]) for pos in x["labels"]], axis=1
)

In [None]:
def charPositionLabelsToTokenMapping(labels, text, convention="IOB"):
    """
    Returns a list of labels, mapping to each token in a given text.
    This is specific to the situation in which only named entities are labelled, and those labels are not mapped
    directly to tokens, but are mapped to character positions.

      Parameters:
        labels (list of lists): contains character positions of named entities assigned to a given label, in a given text, e.g. [[10, 38, FINANCE], [46, 54, DATE]]
        text (string): the string which has been labelled
        convention (string): tagging format to be used. Inside-outside-beginning (IOB) as default.

      Returns:
        label_list (list): list of labels mapping to each token in text

    """
    # flatten labels into a list of character positions
    positions = [index for label in labels for index in label[:2]]

    # append None and 0 to handle edge cases
    positions.append(None)
    prev_positions = [0] + positions

    # maintain a list of sections of text which belong to the same label
    sections = []

    # identify sections of text which belong to the same label
    for begin, end in zip(prev_positions, positions):
        sections.append(text[begin:end])

    # remove empty strings and strip spaces left over
    sections = [section.strip() for section in sections if section.strip()]

    # create a dict of what text corresponds to named entities, and what
    # named entity that text has been labelled as
    named_entities = {text[i:j]: f"I-{label}" for i, j, label in labels}

    # group sections of tokens together with their label
    label_token_list = [
        (section.split(), named_entities[section])
        if section in named_entities
        else (section.split(), "O")
        for section in sections
    ]

    # if convention == 'IOB':
    #   # if we have a multi-word entity, make the first label prefixed with 'B-' instead of 'I-'
    #   new_l = []
    #   for x in label_token_list:
    #     if x[1] != 'O' and len(x[0]) > 1:
    #       new_l.append((x[0][0],x[1].replace('I-','B-')))
    #       new_l.append((x[0][1:],x[1]))
    #     else:
    #       new_l.append(x)
    #   label_token_list = new_l

    # directly map each label to a token
    label_list = [
        label_token[-1] for label_token in label_token_list for _ in label_token[0]
    ]
    token_list = [token for label_token in label_token_list for token in label_token[0]]

    return (label_list, token_list)

In [None]:
# add columns for label lists and the tokens each label maps to
vd[["label_list", "text_tokens"]] = vd.apply(
    lambda x: charPositionLabelsToTokenMapping(x.labels, x.text),
    axis=1,
    result_type="expand",
)
vd

In [None]:
vd.to_csv(os.path.join(PROJECT_DATA_DIR, "govuk-labelled-data-ner-validated.csv"))

## Label counts

Value counts of labels for each entity.

In [None]:
vd.labelled_entities.apply(lambda x: [label[1] for label in x]).explode().value_counts()

Value counts of labels for each token.

In [None]:
vd.label_list.explode().value_counts()

MONEY and SCHEME are not present in the validated data.

## Problems with data

Was any data duplicated? In our context, this means to validate the same text more than once.

In [None]:
vd[~vd.text.duplicated()]

Yes, previously we had 7129 rows of data. After removing rows containing duplicated text, this has decreased by 1287 to 5905 rows. This means 5905 unique sentences were validated, and many were validated multiple times.

Let's examine the differences in labelling between repeats of validation.

In [None]:
vd[vd.text.duplicated(keep=False)][["text", "labelled_entities"]].sort_values("text")

Unvalidated data conflates times with money. How are times handled in the validated data?

In [None]:
vd[vd.text.str.contains("pm ", case=False)][["text", "labelled_entities"]]

Now times are assigned the DATE label instead of MONEY. Although this isn't done with 100% consistency, as some times are not assigned to a named entity.

Where has the MONEY label gone? Let's check if it was ever used by the Google NLP API.

In [None]:
os.listdir(DATA_DIR)

In [None]:
unvalidated_data_folders = [
    "Import data for team session on 03 02 2020",
    "Import data for data science team 12 03 20",
    "Import data volunteer session 19 06 20",
]

In [None]:
unvalidated_data = [
    os.path.join(DATA_DIR, folder, file)
    for folder in unvalidated_data_folders
    for file in os.listdir(DATA_DIR + folder)
]

In [None]:
uvd = pd.concat(
    [pd.read_json(data, lines=True)[["text", "labels"]] for data in unvalidated_data]
)
# add a column to show exactly what text is assigned to which label
uvd["labelled_entities"] = uvd.apply(
    lambda x: [(x["text"][pos[0] : pos[1]], pos[2]) for pos in x["labels"]], axis=1
)
# add columns for label lists and the tokens each label maps to
uvd[["label_list", "text_tokens"]] = uvd.apply(
    lambda x: charPositionLabelsToTokenMapping(x.labels, x.text),
    axis=1,
    result_type="expand",
)
uvd

I note here that there are 9765 rows of unvalidated data, compared to 7129 rows of validated data.

In [None]:
uvd.labelled_entities.apply(
    lambda x: [label[1] for label in x]
).explode().value_counts()

There are only 21 entities assigned the MONEY label. Lets inspect the corresponding text.

In [None]:
uvd[uvd.label_list.apply(lambda x: "I-MONEY" in x)][["text", "labelled_entities"]]

Inspecting this output, MONEY is only tagged to times. Hence, this will have been corrected. But does the text actually mention money? Let's look for text containing currency units.

In [None]:
vd[vd.text.str.contains("\\£|\\€|\\$")][["text", "labelled_entities"]]

Sometimes money is tagged to FINANCE, sometimes it is not tagged to anything.

What about entities tagged to SCHEME? We don't have any of those in the validated data, either.

In [None]:
uvd[uvd.label_list.apply(lambda x: "I-SCHEME" in x)][["text", "labelled_entities"]]

The word 'scheme' is tagged as SCHEME. Understandably, the word 'scheme' isn't informative of what schemes are mentioned on a page. However, in some cases, looking at the text preceding the word 'scheme', an actual scheme is mentioned. After human validation, the SCHEME tags were removed, instead of being extended backwards.