### Converting Document-level Data and Labels Into Sentence-Level Data and Labels

This notebook explores converting the document-level data and labels into
sentence-level data and labels for training Masked Language Models, like Kaggle
model 1, and Named Entity Recognition models.

In [66]:
import json
from itertools import islice
from typing import List

import pandas as pd
import regex as re
import spacy
from unidecode import unidecode

import src.models.regex_model as rm

nlp = spacy.load('en_core_web_trf')

In [4]:
kaggle_labels = pd.read_csv("../data/kaggle/train.csv")
kaggle_labels.head(2)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
1,2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study


This contains only a single label for each line, so we need to aggregate them 
by id

In [15]:
aggregated_labels = pd.DataFrame({"id": kaggle_labels["Id"].unique()})

def aggregate_clean_label(row: pd.DataFrame):
    labels = list(map(lambda x: x.strip(), row["dataset_label"].unique()))
    return "|".join(labels)

unique_labels = kaggle_labels.groupby("Id").apply(aggregate_clean_label)
aggregated_labels["label"] = aggregated_labels["id"].apply(lambda x: unique_labels[x])
aggregated_labels.head(2)

Unnamed: 0,id,label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,National Education Longitudinal Study|Educatio...
1,2f26f645-3dec-485d-b68d-f013c9e05e60,National Education Longitudinal Study|Educatio...


In [18]:
document_id = aggregated_labels.id.values[0]

with open("../data/kaggle/train/" + document_id + ".json") as f:
    document = json.load(f)

text = unidecode(" ".join(list(map(
    lambda x: x["text"].strip().replace("\n", " "), 
    document
))))

label = aggregated_labels[aggregated_labels.id == document_id].label.values[0]

print(text[:25])
print(label)

This study used data from
National Education Longitudinal Study|Education Longitudinal Study


To produce a sentence level dataset we will spaCy to parse the document and
split it into sentences. We will then assign tags and store in a format similar
to [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/), which is a popular dataset for NER based models. The CoNLL-2003
data looks like this:

```
   U.N.         NNP  I-NP  I-ORG 
   official     NN   I-NP  O 
   Ekeus        NNP  I-NP  I-PER 
   heads        VBZ  I-VP  O 
   for          IN   I-PP  O 
   Baghdad      NNP  I-NP  I-LOC 
   .            .    O     O 
```

The first column contains the tokens, the second column contains the
part-of-speech tagging, the third column contains the syntactic chunk tagging,
and the fourth column contains the named entity tagging.

We will use the same format, but we will only use the first two columns and add
a third column indicating the label for each token using the [IOB
format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).


In [23]:
parsed_document = nlp(text)

Let's print out the text and tokens for a look at the data.

In [37]:
for sentence in islice(parsed_document.sents, 1):
    print("Sentence:\n", sentence)
    for token in islice(sentence, 10):
        print("\t".join([token.text, token.tag_]))

Sentence:
 This study used data from the National Education Longitudinal Study (NELS:88) to examine the effects of dual enrollment programs for high school students on college degree attainment.
This	DT
study	NN
used	VBD
data	NNS
from	IN
the	DT
National	NNP
Education	NNP
Longitudinal	NNP
Study	NNP


To add the labels we need to see if any of the labels are in the sentence first.

In [65]:
# thrown in a bad label to test for false positives
labels = label.split("|") + ["Banana Pancakes"]
rm.RegexModel.regexify_keyword(labels[0])
regex_labels = list(map(
    re.compile,
    map(rm.RegexModel.regexify_keyword, labels)
))

In [73]:
def detect_labels(labels:List[re.Pattern], sentence:str) -> List[List[str]]:
    return list(map(
        lambda match: match.captures(), # It's possible to have more than one match
        filter(
            bool,
            map(
                lambda rl: rl.search(sentence),
                labels
            )
        )
    ))

for sentence in islice(parsed_document.sents, 1):
    print("Sentence:\n", sentence)
    print(detect_labels(regex_labels, sentence.text))
    # start withe the longest match because other datasets can be subsets of the larger dataset
    matches = sorted(
        detect_labels(regex_labels, sentence.text), 
        key=len, 
        reverse=True
    )





Sentence:
 This study used data from the National Education Longitudinal Study (NELS:88) to examine the effects of dual enrollment programs for high school students on college degree attainment.
[['National Education Longitudinal Study'], ['Education Longitudinal Study']]
