### Converting Document-level Data and Labels Into Sentence-Level Data and Labels

This notebook explores converting the document-level data and labels into
sentence-level data and labels for training Masked Language Models, like Kaggle
model 1, and Named Entity Recognition models.

In [46]:
import json
import os
from itertools import count, islice
from typing import List

import pandas as pd
import regex as re
import spacy
from unidecode import unidecode
import tqdm.notebook as tqdm

import democratizing_data_ml_algorithms.models.regex_model as rm

nlp = spacy.load('en_core_web_trf')

In [3]:
kaggle_labels = pd.read_csv("../data/kaggle/train.csv")
kaggle_labels.head(2)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
1,2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study


This contains only a single label for each line, so we need to aggregate them 
by id

In [4]:
aggregated_labels = pd.DataFrame({"id": kaggle_labels["Id"].unique()})

def aggregate_clean_label(row: pd.DataFrame):
    labels = list(map(lambda x: x.strip(), row["dataset_label"].unique()))
    return "|".join(labels)

unique_labels = kaggle_labels.groupby("Id").apply(aggregate_clean_label)
aggregated_labels["label"] = aggregated_labels["id"].apply(lambda x: unique_labels[x])
aggregated_labels.head(2)

Unnamed: 0,id,label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,National Education Longitudinal Study|Educatio...
1,2f26f645-3dec-485d-b68d-f013c9e05e60,National Education Longitudinal Study|Educatio...


In [5]:
document_id = aggregated_labels.id.values[0]

with open("../data/kaggle/train/" + document_id + ".json") as f:
    document = json.load(f)

text = unidecode(" ".join(list(map(
    lambda x: x["text"].strip().replace("\n", " "), 
    document
))))

label = aggregated_labels[aggregated_labels.id == document_id].label.values[0]

print(text[:25])
print(label)

This study used data from
National Education Longitudinal Study|Education Longitudinal Study


To produce a sentence level dataset we will spaCy to parse the document and
split it into sentences. We will then assign tags and store in a format similar
to [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/), which is a popular dataset for NER based models. The CoNLL-2003
data looks like this:

```
   U.N.         NNP  I-NP  I-ORG 
   official     NN   I-NP  O 
   Ekeus        NNP  I-NP  I-PER 
   heads        VBZ  I-VP  O 
   for          IN   I-PP  O 
   Baghdad      NNP  I-NP  I-LOC 
   .            .    O     O 
```

The first column contains the tokens, the second column contains the
part-of-speech tagging, the third column contains the syntactic chunk tagging,
and the fourth column contains the named entity tagging.

We will use the same format, but we will only use the first two columns and add
a third column indicating the label for each token using the [IOB
format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).


Our tags will be `I-DAT`, `O`, and `B-DAT`

In [6]:
parsed_document = nlp(text)

Let's print out the text and tokens for a look at the data.

In [7]:
for sentence in islice(parsed_document.sents, 1):
    print("Sentence:\n", sentence)
    for token in islice(sentence, 20):
        print("\t".join([token.text, token.tag_]))

Sentence:
 This study used data from the National Education Longitudinal Study (NELS:88) to examine the effects of dual enrollment programs for high school students on college degree attainment.
This	DT
study	NN
used	VBD
data	NNS
from	IN
the	DT
National	NNP
Education	NNP
Longitudinal	NNP
Study	NNP
(	-LRB-
NELS:88	NNP
)	-RRB-
to	TO
examine	VB
the	DT
effects	NNS
of	IN
dual	JJ
enrollment	NN


To add the labels we need to see if any of the labels are in the sentence first.

In [8]:
# thrown in a bad label to test for false positives
labels = label.split("|") + ["Banana Pancakes"]
rm.RegexModel.regexify_keyword(labels[0])
regex_labels = list(map(
    re.compile,
    map(rm.RegexModel.regexify_keyword, labels)
))

In [19]:
def detect_labels(labels:List[re.Pattern], sentence:str) -> List[List[str]]:
    return list(map(
        lambda match: match.captures(), # It's possible to have more than one match
        filter(
            bool,
            map(
                lambda rl: rl.search(sentence),
                labels
            )
        )
    ))

def tag_sentence(regex_labels:List[re.Pattern], sentence:spacy.tokens.span.Span):
    match_lists = sorted(
        detect_labels(regex_labels, sentence.text), 
        key=lambda x: max(map(len, x)), 
        reverse=True
    )

    tokens = [token.text for token in sentence]
    tags = [token.tag_ for token in sentence]
    ner_tags = ["O"] * len(sentence) # assume no match

    for matches in match_lists:
        for match in matches:
            label_tokens = nlp(match)
            start_idx = tokens.index(label_tokens[0].text)
            idxs = list(range(start_idx, start_idx + len(label_tokens)))


            first_tag = ner_tags[start_idx]
            prev_tag = ner_tags[start_idx - 1] if start_idx > 0 else "O"
            # If there are any tokens that are already marked then this match
            # could be a subset of another match
            if not any(map(lambda x: x!="O", ner_tags[start_idx: start_idx + len(label_tokens)])):
                if prev_tag=="O":
                    ner_tags[start_idx] = "I-DAT"
                else:
                    ner_tags[start_idx] = "B-DAT"

                for idx in idxs[1:]:
                    ner_tags[idx] = "I-DAT"

    return tokens, tags, ner_tags

# https://spacy.io/usage/visualizers/#span
from spacy import displacy
from spacy.tokens import Span

for sentence in islice(parsed_document.sents, 1):
    tokens, tags, ner_tags = tag_sentence(regex_labels, sentence)
    
    start = sentence.text.index(tokens[ner_tags.index("I-DAT")])
    end = start + len(" ".join([tokens[i] for i in range(len(tokens)) if ner_tags[i] == "I-DAT"]))

    # https://spacy.io/usage/visualizers/#span
    # https://spacy.io/usage/visualizers/#manual-usage
    ex = [{
        "text": sentence.text,
        "ents": [{"start": start, "end": end, "label": "Dataset"}],
        "title": None 
    }]

    displacy.render(ex, style="ent", manual=True)

Let's covert the entire document.

In [36]:
from itertools import chain

tokens_tags_ner = list(map(
    lambda sentence: tag_sentence(regex_labels, sentence),
    parsed_document.sents
))

# this is a list of tuples, one for each sentence
# this flattens the list of tuples into a list of the tokens, tags, and ner tags
# for the whole document
tokens, tags, ner = list(map(lambda x: list(chain(*x)), zip(*tokens_tags_ner)))

In [45]:
from typing import Tuple

def extract_entities(ner_tags:List[str], tokens:List[str]) -> List[Tuple[int, int, str]]:
    entities = []
    checked = -1
    
    for i in range(len(ner_tags)):
        if ner_tags[i] in ["B-DAT", "I-DAT"] and i > checked:
            start = len(" ".join(tokens[:i]))
            end = i + 1
            while ner_tags[end] == "I-DAT":
                end += 1
            checked = end 
            end = len(" ".join(tokens[:end]))
            entities.append(dict(start=start, end=end, label="Dataset"))
    return entities

ex = [{
    "text": " ".join(tokens),
    "ents": extract_entities(ner, tokens),
    "title": None 
}]
displacy.render(ex, style="ent", manual=True)
