<a id="Prepare Data"></a>

# Prepare Re3d Dataset for NER

The Re3d Dataset is a dataset compiled for Named Entity Recognition (NER) on the subject of National Defense. The data includes sources such as:

* Australian Department of Foreign Affiars
* BBC Online
* CENTCOM
* Delegation of the European Union to Syria
* UK Government
* US State Department
* Wikipedia

To prepare this datset, we will need tokenized words, their BIO tags, and the sentence number they belong to. Let's investigate how the data is structured.

In [29]:
import json
from pathlib import Path
from ast import literal_eval

In [30]:
ROOT_DIR = Path('notebooks/format_data.ipynb').resolve().parents[2]
DATA_DIR = ROOT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
PREPARED_DIR = DATA_DIR / "prepared"

In [31]:
with open(RAW_DIR / "UK Government/documents.json", "r") as fp:
    lines = fp.readlines()

In [32]:
lines[0]

'{ "_id" : "09FF200CBB76594D688F8EB565717543", "sourceName" : "UK Government", "sourceUrl" : "https://www.gov.uk/government/news/minister-for-the-middle-east-condemns-attack-in-baghdad", "wordCount" : 95, "sentenceCount" : 5, "title" : "Minister for the Middle East condemns attack in Baghdad", "text" : "Minister for the Middle East, Tobias Ellwood, responds to the attack in the Karrada district of Baghdad\\n\\nForeign Office Minister, Tobias Ellwood, said:\\n\\nThe attack was a horrific act of terror against innocent people in the early hours of Sunday morning in Karrada, Baghdad. I offer my sincere condolences to the families and friends of the victims, and hope those injured in the attack recover quickly.\\n\\nDaesh targeted families out shopping for Eid and those celebrating suhur. It violated the peace of Ramadan. The UK stands by Iraq to defeat Daesh and end this violence." }\n'

The first document from UK Government. What data type is it?

In [33]:
type(lines[0])

str

We will need to ingest these documents as literal dictionaries, not as strings

In [34]:
document = literal_eval(lines[0])
document

{'_id': '09FF200CBB76594D688F8EB565717543',
 'sourceName': 'UK Government',
 'sourceUrl': 'https://www.gov.uk/government/news/minister-for-the-middle-east-condemns-attack-in-baghdad',
 'wordCount': 95,
 'sentenceCount': 5,
 'title': 'Minister for the Middle East condemns attack in Baghdad',
 'text': 'Minister for the Middle East, Tobias Ellwood, responds to the attack in the Karrada district of Baghdad\n\nForeign Office Minister, Tobias Ellwood, said:\n\nThe attack was a horrific act of terror against innocent people in the early hours of Sunday morning in Karrada, Baghdad. I offer my sincere condolences to the families and friends of the victims, and hope those injured in the attack recover quickly.\n\nDaesh targeted families out shopping for Eid and those celebrating suhur. It violated the peace of Ramadan. The UK stands by Iraq to defeat Daesh and end this violence.'}

In [35]:
type(document)

dict

Now that we have the dictionary, we need to tokenize the text. We will use spaCy as it has higher performance splitting named entities correctly (i.e. U.S. as one token) 

In [36]:
import spacy

nlp = spacy.load("en_core_web_lg")

doc = nlp(document["text"])
for token in doc:
    print(token.text)

Minister
for
the
Middle
East
,
Tobias
Ellwood
,
responds
to
the
attack
in
the
Karrada
district
of
Baghdad



Foreign
Office
Minister
,
Tobias
Ellwood
,
said
:



The
attack
was
a
horrific
act
of
terror
against
innocent
people
in
the
early
hours
of
Sunday
morning
in
Karrada
,
Baghdad
.
I
offer
my
sincere
condolences
to
the
families
and
friends
of
the
victims
,
and
hope
those
injured
in
the
attack
recover
quickly
.



Daesh
targeted
families
out
shopping
for
Eid
and
those
celebrating
suhur
.
It
violated
the
peace
of
Ramadan
.
The
UK
stands
by
Iraq
to
defeat
Daesh
and
end
this
violence
.


Which of these tokens are labeled as entities, and what type of entity are they? We have copied the entities raw from the entities.json file for ease.

In [37]:
entities = [{ "_id" : "09FF200CBB76594D688F8EB565717543-0-0-28-Person", "begin" : 0, "end" : 28, "type" : "Person", "value" : "Minister for the Middle East", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-105-128-Person", "begin" : 105, "end" : 128, "type" : "Person", "value" : "Foreign Office Minister", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-130-144-Person", "begin" : 130, "end" : 144, "type" : "Person", "value" : "Tobias Ellwood", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-17-28-Location", "begin" : 17, "end" : 28, "type" : "Location", "value" : "Middle East", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 0 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-224-245-Temporal", "begin" : 224, "end" : 245, "type" : "Temporal", "value" : "early hours of Sunday", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-257-264-Location", "begin" : 257, "end" : 264, "type" : "Location", "value" : "Karrada", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-266-273-Location", "begin" : 266, "end" : 273, "type" : "Location", "value" : "Baghdad", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-30-44-Person", "begin" : 30, "end" : 44, "type" : "Person", "value" : "Tobias Ellwood", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-72-103-Location", "begin" : 72, "end" : 103, "type" : "Location", "value" : "the Karrada district of Baghdad", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-0-96-103-Location", "begin" : 96, "end" : 103, "type" : "Location", "value" : "Baghdad", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 0 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-1-0-1-Person", "begin" : 275, "end" : 276, "type" : "Person", "value" : "I", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-1-79-83-Person", "begin" : 354, "end" : 358, "type" : "Person", "value" : "hope", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 0 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-2-0-5-Organisation", "begin" : 405, "end" : 410, "type" : "Organisation", "value" : "Daesh", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-2-15-23-Organisation", "begin" : 420, "end" : 428, "type" : "Organisation", "value" : "families", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-3-25-32-Temporal", "begin" : 504, "end" : 511, "type" : "Temporal", "value" : "Ramadan", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-4-0-6-Organisation", "begin" : 513, "end" : 519, "type" : "Organisation", "value" : "The UK", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-4-17-21-Location", "begin" : 530, "end" : 534, "type" : "Location", "value" : "Iraq", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 0 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-4-17-21-Organisation", "begin" : 530, "end" : 534, "type" : "Organisation", "value" : "Iraq", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 0.97 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-4-32-37-Organisation", "begin" : 545, "end" : 550, "type" : "Organisation", "value" : "Daesh", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 1 },
{ "_id" : "09FF200CBB76594D688F8EB565717543-4-4-6-Location", "begin" : 517, "end" : 519, "type" : "Location", "value" : "UK", "documentId" : "09FF200CBB76594D688F8EB565717543", "confidence" : 0 }]

These entities have type (label) and start (begin) and end (end) character spans. We will need to first find the character spans of the tokens generated previously.

In [38]:
from typing import List, Tuple, Dict, Union

Re3dDict = Dict[str, Union[str, int]]
TokenSpan = List[Tuple[int, str, int, int]]

def get_token_spans(text: str) -> TokenSpan:
    """Generate Token Spans given text

    Args:
        text (str): raw text

    Returns:
        TokenSpan: List of (sentence_idx, word, index_start, index_end)
    """

    doc = nlp(text)
    token_spans = []
    for sentence_idx, sent in enumerate(doc.sents):
        for token in sent:
            token_span = doc[token.i : token.i + 1]
            token_spans.append((sentence_idx, token.text, token_span.start_char, token_span.end_char))

    return token_spans

In [39]:
token_spans = get_token_spans(document["text"])
token_spans

[(0, 'Minister', 0, 8),
 (0, 'for', 9, 12),
 (0, 'the', 13, 16),
 (0, 'Middle', 17, 23),
 (0, 'East', 24, 28),
 (0, ',', 28, 29),
 (0, 'Tobias', 30, 36),
 (0, 'Ellwood', 37, 44),
 (0, ',', 44, 45),
 (0, 'responds', 46, 54),
 (0, 'to', 55, 57),
 (0, 'the', 58, 61),
 (0, 'attack', 62, 68),
 (0, 'in', 69, 71),
 (0, 'the', 72, 75),
 (0, 'Karrada', 76, 83),
 (0, 'district', 84, 92),
 (0, 'of', 93, 95),
 (0, 'Baghdad', 96, 103),
 (0, '\n\n', 103, 105),
 (1, 'Foreign', 105, 112),
 (1, 'Office', 113, 119),
 (1, 'Minister', 120, 128),
 (1, ',', 128, 129),
 (1, 'Tobias', 130, 136),
 (1, 'Ellwood', 137, 144),
 (1, ',', 144, 145),
 (1, 'said', 146, 150),
 (1, ':', 150, 151),
 (1, '\n\n', 151, 153),
 (1, 'The', 153, 156),
 (1, 'attack', 157, 163),
 (1, 'was', 164, 167),
 (1, 'a', 168, 169),
 (1, 'horrific', 170, 178),
 (1, 'act', 179, 182),
 (1, 'of', 183, 185),
 (1, 'terror', 186, 192),
 (1, 'against', 193, 200),
 (1, 'innocent', 201, 209),
 (1, 'people', 210, 216),
 (1, 'in', 217, 219),
 (1, 'the

We will need a data structure to hold the sentence number, token, start index, end index, and labels for each token. Let's define that now.

In [40]:
from dataclasses import dataclass, field


@dataclass
class Word:
    """Class for a word and tags"""

    sentence_num: int
    word: str
    start_idx: int
    end_idx: int
    tags: List[str] = field(default_factory=list)

    def __post_init__(self):
        self.tags.append("O")

Now we have the tokens and their character spans, we can label the words appropriately.

In [41]:
def validate_entity(entity: Re3dDict, token_spans: TokenSpan) -> bool:
    """Check that an Re3d Entity dict's entity span matches a word boundaries.

    Some entities in Re3d start inside words, this ensures only entities that align with
    word boundaries are valid.

    Args:
        entity (Re3dDict): Re3d Entity Dictionary
        token_spans (TokenSpan): Token spans

    Returns:
        bool: Valid Entity
    """
    if entity["begin"] not in [tup[2] for tup in token_spans]:
        return False
    return True

In [42]:
def label_words(token_spans: TokenSpan, entities: List[Re3dDict]) -> List[Word]:
    """Label individual words with Re3d entity tags

    Args:
        token_spans (TokenSpan): Token spans
        entities (List[Re3dDict]): List of entities

    Returns:
        List[Word]: List of Re3d labeled words
    """
    words = []

    for sentence_num, text, start_idx, end_idx in token_spans:
        word = Word(sentence_num, text, start_idx, end_idx)

        for entity in entities:
            if not validate_entity(entity, token_spans):
                continue

            START = entity["begin"]
            END = entity["end"]
            TYPE = entity["type"]

            if START <= word.start_idx <= END and START <= word.end_idx <= END:
                if word.tags == ["O"]:
                    word.tags = []
                word.tags.append(TYPE)

        words.append(word)

    return words

In [43]:
words = label_words(token_spans, entities)
words

[Word(sentence_num=0, word='Minister', start_idx=0, end_idx=8, tags=['Person']),
 Word(sentence_num=0, word='for', start_idx=9, end_idx=12, tags=['Person']),
 Word(sentence_num=0, word='the', start_idx=13, end_idx=16, tags=['Person']),
 Word(sentence_num=0, word='Middle', start_idx=17, end_idx=23, tags=['Person', 'Location']),
 Word(sentence_num=0, word='East', start_idx=24, end_idx=28, tags=['Person', 'Location']),
 Word(sentence_num=0, word=',', start_idx=28, end_idx=29, tags=['O']),
 Word(sentence_num=0, word='Tobias', start_idx=30, end_idx=36, tags=['Person']),
 Word(sentence_num=0, word='Ellwood', start_idx=37, end_idx=44, tags=['Person']),
 Word(sentence_num=0, word=',', start_idx=44, end_idx=45, tags=['O']),
 Word(sentence_num=0, word='responds', start_idx=46, end_idx=54, tags=['O']),
 Word(sentence_num=0, word='to', start_idx=55, end_idx=57, tags=['O']),
 Word(sentence_num=0, word='the', start_idx=58, end_idx=61, tags=['O']),
 Word(sentence_num=0, word='attack', start_idx=62, e

Notice that the Re3d tag schem doesn't organize tags as BIO. We will need to update this for downstream modeling.

In [44]:
from copy import deepcopy


def bio_tagger(words: List[Word]) -> List[Word]:
    """Format Re3d Entity tags into BIO schema

    Args:
        words (List[Word]): List of Words with Re3d Schema

    Returns:
        List[Word]: Words with BIO Schema
    """
    max_multilabel_len = max([len(word.tags) for word in words])
    words_out = []
    prev_tag = ["O"] * max_multilabel_len
    word_iter = 0
    for _, word in enumerate(words):
        bio_tagged = []
        _, labels = deepcopy(word.word), deepcopy(word.tags)

        for idx, label in enumerate(labels):
            if label == "O":
                bio_tagged.append(label)
            elif label != "O" and prev_tag[idx] == "O":  # Begin NE
                bio_tagged.append("B-" + label)
            elif prev_tag[idx] != "O" and prev_tag[idx] == label:  # Inside NE
                bio_tagged.append("I-" + label)
            elif prev_tag[idx] != "O" and prev_tag[idx] != label:  # Adjacent NE
                bio_tagged.append("B-" + label)
            prev_tag[idx] = label

        # Reset secondary/tertiary labels if no extra labels
        if len(labels) > 1:
            word_iter += 1
            if word_iter > 1:
                for idx in range(1, len(prev_tag)):
                    prev_tag[idx] = "O"
                word_iter = 0

        word.tags = bio_tagged

        words_out.append(word)

    return words
    

In [45]:
bio_tagger(words)

[Word(sentence_num=0, word='Minister', start_idx=0, end_idx=8, tags=['B-Person']),
 Word(sentence_num=0, word='for', start_idx=9, end_idx=12, tags=['I-Person']),
 Word(sentence_num=0, word='the', start_idx=13, end_idx=16, tags=['I-Person']),
 Word(sentence_num=0, word='Middle', start_idx=17, end_idx=23, tags=['I-Person', 'B-Location']),
 Word(sentence_num=0, word='East', start_idx=24, end_idx=28, tags=['I-Person', 'I-Location']),
 Word(sentence_num=0, word=',', start_idx=28, end_idx=29, tags=['O']),
 Word(sentence_num=0, word='Tobias', start_idx=30, end_idx=36, tags=['B-Person']),
 Word(sentence_num=0, word='Ellwood', start_idx=37, end_idx=44, tags=['I-Person']),
 Word(sentence_num=0, word=',', start_idx=44, end_idx=45, tags=['O']),
 Word(sentence_num=0, word='responds', start_idx=46, end_idx=54, tags=['O']),
 Word(sentence_num=0, word='to', start_idx=55, end_idx=57, tags=['O']),
 Word(sentence_num=0, word='the', start_idx=58, end_idx=61, tags=['O']),
 Word(sentence_num=0, word='attack

Now that we have the words tagged in the BIO schema, we need to compile the words across the entire Re3d dataset. Outined below is the process for a single set of documents.

In [46]:
def match_doc_id(doc: Re3dDict, line: Re3dDict) -> bool:
    """Match entities to document by document ID

    Args:
        doc (Re3dDict): Re3d Document dictionary
        line (Re3dDict): Re3d Entity dictionary

    Returns:
        bool: True if document id matches between arguments
    """
    return doc["_id"] == line["documentId"]

In [47]:
import pandas as pd


def preprocess_docs(doc_path: Path) -> List[pd.DataFrame]:
    """ETL a given documents.json into a dataframe with sentence number,
    word, start index, end index, and BIO tags

    Args:
        doc_path (Path): Path to */documents.json

    Returns:
        List[pd.DataFrame]: Dataframe for each document in doc_path
    """
    # load docs
    with open(doc_path, "r") as fp:
        docs = [literal_eval(line) for line in fp]

    entities_path = doc_path.parent / "entities.json"

    output = []

    for doc in docs:
        # Load entities for doc
        with open(entities_path, "r") as fp:
            entities = [literal_eval(line) for line in fp if match_doc_id(doc, literal_eval(line))]

        # ID sentence number, token, start/end idx
        token_spans = get_token_spans(doc["text"])

        # Add tags
        labeled_words = label_words(token_spans, entities)

        # Format tags as BIO
        tagged_words = bio_tagger(labeled_words)

        # Create dataframe
        doc_df = pd.DataFrame(tagged_words)

        output.append(doc_df)

    return output


In [48]:
dfs = preprocess_docs(Path("/home/enyquist/repos/RLNER/data/raw/Australian Department of Foreign Affairs/documents.json"))

In [49]:
def merge_dfs(dfs: List[List[pd.DataFrame]]) -> pd.DataFrame:
    """Merge nested lists of dataframes into one dataframe

    Args:
        dfs (List[List[pd.DataFrame]]): Nested dataframes

    Returns:
        pd.DataFrame: Merged dataframe
    """
    flat_dfs = [item for sublist in dfs for item in sublist]
    base_df = flat_dfs.pop()

    while flat_dfs:
        df = flat_dfs.pop()
        max_sentence_num = max(base_df["sentence_num"])
        df["sentence_num"] = df["sentence_num"].apply(lambda x: x + max_sentence_num + 1)
        base_df = pd.concat([base_df, df], axis=0)

    return base_df

In [50]:
docs_df = merge_dfs([dfs])

In [51]:
docs_df.reset_index(inplace=True, drop=True)

In [52]:
docs_df

Unnamed: 0,sentence_num,word,start_idx,end_idx,tags
0,0,Foreign,0,7,[B-Person]
1,0,Minister,8,16,[I-Person]
2,0,Julie,17,22,"[I-Person, B-Person]"
3,0,Bishop,23,29,"[I-Person, I-Person]"
4,0,met,30,33,[O]
...,...,...,...,...,...
383,15,in,939,941,[O]
384,15,particular,942,952,[O]
385,15,against,953,960,[O]
386,15,ISIL,961,965,[B-Organisation]


In [53]:
type(docs_df["tags"][0])

list

In [54]:
import spacy

nlp = spacy.load("en_core_web_lg")

def add_pos_tag(x):
    doc = nlp(x)
    return [tok.pos_ for tok in doc][0]

In [55]:
docs_df["single_tag"] = docs_df["tags"].apply(lambda x: x[0])
docs_df["POS"] = docs_df["word"].apply(add_pos_tag)

In [56]:
docs_df.head()

Unnamed: 0,sentence_num,word,start_idx,end_idx,tags,single_tag,POS
0,0,Foreign,0,7,[B-Person],B-Person,ADJ
1,0,Minister,8,16,[I-Person],I-Person,PROPN
2,0,Julie,17,22,"[I-Person, B-Person]",I-Person,PROPN
3,0,Bishop,23,29,"[I-Person, I-Person]",I-Person,PROPN
4,0,met,30,33,[O],O,VERB


In [None]:
from tqdm import tqdm

def main() -> None:
    """Transform raw Re3d Data into a master csv"""

    docs_list = list(RAW_DIR.glob("**/documents.json"))

    # Process docs
    dfs = [preprocess_docs(doc_path) for doc_path in tqdm(docs_list, desc="Formatting Doc Paths")]

    # Merge docs
    master_df = merge_dfs(dfs)
    master_df.reset_index(drop=True, inplace=True)

    # Split single labels out instead of multi-label
    master_df["single_tag"] = master_df["tags"].apply(lambda x: x[0])
    master_df["POS"] = master_df["word"].apply(add_pos_tag)

    # Save master df
    master_df.to_csv(PREPARED_DIR / "master.csv", index=False)