### Extracting additional labels from text

This notebook explores the first place Kaggle Submission's algorithm
for extracting labels that we're not included in the training set.

[notebook here](https://github.com/Coleridge-Initiative/rc-kaggle-models/blob/main/1st%20ZALO%20FTW/notebooks/get_candidate_labels.ipynb)

The first place submission uses discovered labels for validation only
and not for training. The code is an adaptation from the notebooks.

In [7]:
import json
from typing import List

import pandas as pd
from unidecode import unidecode

In [2]:
kaggle_labels = pd.read_csv("../data/kaggle/train.csv")
kaggle_labels.head(2)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
1,2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study


In [3]:
aggregated_labels = pd.DataFrame({"id": kaggle_labels["Id"].unique()})

def aggregate_clean_label(row: pd.DataFrame):
    labels = list(map(lambda x: x.strip(), row["dataset_label"].unique()))
    return "|".join(labels)

unique_labels = kaggle_labels.groupby("Id").apply(aggregate_clean_label)
aggregated_labels["label"] = aggregated_labels["id"].apply(lambda x: unique_labels[x])
aggregated_labels.head(2)

Unnamed: 0,id,label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,National Education Longitudinal Study|Educatio...
1,2f26f645-3dec-485d-b68d-f013c9e05e60,National Education Longitudinal Study|Educatio...


In [4]:
def get_text(document_id: str) -> str:
    with open("../data/kaggle/train/" + document_id + ".json") as f:
        document = json.load(f)

    text = unidecode(" ".join(list(map(
        lambda x: x["text"].strip().replace("\n", " "), 
        document
    ))))

    return text

In [8]:
text = get_text("d0fa7568-7d8e-4db9-870f-f9c6f668c17b")
text[:100]

'This study used data from the National Education Longitudinal Study (NELS:88) to examine the effects'

The description from the notebook says that candidates are selected in the 
following way:

```
2. (Optional) We detect the keywords (Dataset, Database, Study, Survey, ...) 
position in the input string then look forward/backward of that keyword util
meet two consecutive lowercase words.
```

This is implemented using a few functions and for loops. Let's try to do it 
with a regular expression. 

Here's the regex101:

https://regex101.com/r/58x52j/1

In [None]:
connecting_words = ["on", "for", "of", "in"]
keywords = ["Database", "Dataset", "Databases", "Datasets"]

NON_CAPTURING_GROUP = lambda s: r"(?:" + s + ")"
CAPTURING_GROUP = lambda s: r"(" + s + ")"
LOWERCASE_WORD = r"[a-z,\']+"
CAPITALIZED_WORD = r"[A-Z][a-z,\']+"
SPACE = r"\s+"
