<center> <h1> Exploiting Asymmetry for Synthetic Training Data Generation: <br>SynthIE and the Case of Information Extraction </h1> </center>

---

## Table of Content
1. [Data](#Data)<br>
    1.1. [Direct Download and Loading](#Direct-Download-and-Loading) (required for running code in the repository)<br>
    1.2. [HuggingFace Datasets Download and Loading](#HuggingFace-Datasets-Download-and-Loading) (only for providing access to the data through HuggingFace)

2. [Models and Inference](#Models-and-Inference)<br>
    2.1 [Model Download](#Model-Download)<br>
    2.2 [Model Loading](#Model-Loading)<br>
    2.3 [Unconstrained Decoding](#Unconstrained-Decoding)<br>
    2.4 [Constrained Decoding](#Constrained-Decoding)<br>

3. [Loading Models and Datasets with Hydra](#Loading-Models-and-Datasets-with-Hydra)

4. [Loading and Using the WikidataID2Name Dictionaries](#Loading-Models-and-Datasets-with-Hydra)

---

## Data

The HuggingFace [dataset card](https://huggingface.co/datasets/martinjosifoski/SynthIE) contains information about the available datasets and data splits, a detailed description of the fields, and some basic statistics.

### Direct Download and Loading

The code in the repository assumes that all the necessary data is in the `data` directory, with different parts of the code relying on different data. To download all the data in the `data` directory (around 4.4GB), from the project's root directory, execute the following code:

```bash download_data.sh```

If you want to omit some files, comment out the specific lines in the script. 

In [1]:
# If you are using a different directory for your data, update the path below
DATA_DIR="../data"

# To download the data, uncomment the following line and run the cell
# !bash ../download_data.sh $DATA_DIR

In [2]:
import os
import sys
import gzip
import jsonlines

sys.path.append("../")

In [3]:
dataset_name2folder_name = {
    "synthie_text": "sdg_text_davinci_003",
    "synthie_text_pc": "sdg_text_davinci_003",
    "synthie_code": "sdg_code_davinci_002", 
    "synthie_code_pc": "sdg_code_davinci_002", 
    "rebel": "rebel",
    "rebel_pc": "rebel"
}


def get_full_path(data_dir, dataset_name, split):
    file_name = f"{split}.jsonl.gz"
    
    if dataset_name.endswith("_pc"):
        data_dir = os.path.join(data_dir, "processed")
        if not dataset_name.startswith("rebel"):
            file_name = f"{split}_ordered.jsonl.gz"
    
    return os.path.join(data_dir, dataset_name2folder_name[dataset_name], file_name)

def read_gzipped_jsonlines(path_to_file):
    with gzip.open(path_to_file, "r") as fp:
        json_reader = jsonlines.Reader(fp)
        data = list(json_reader)
    return data

dataset_name = 'synthie_text' # 'synthie_code', 'rebel', 'synthie_text_pc', 'synthie_code_pc', 'rebel_pc'
split = "test" # "train", "test", "test_small" 

path_to_file = get_full_path(DATA_DIR, dataset_name, split)
data = read_gzipped_jsonlines(path_to_file)

### HuggingFace Datasets Download and Loading

In [4]:
import datasets

dataset_name = 'synthie_text' # 'synthie_code', 'rebel', 'synthie_text_pc', 'synthie_code_pc', 'rebel_pc'
split = "test" # "train", "test", "test_small"

dataset = datasets.load_dataset(f"martinjosifoski/SynthIE", dataset_name, split=split)

Found cached dataset synth_ie (/home/martin/.cache/huggingface/datasets/martinjosifoski___synth_ie/synthie_text/1.0.0/dafe24bbede03960f5e57caca7e56d3dc917919ba1161566607366ef74a4b52b)


## Models and Inference

### Model Download

For a list and a description of the provided models see the HuggingFace [model card](https://huggingface.co/martinjosifoski/SynthIE). For more details about the models, please refer to the paper.

To download all the data in the `data/models` directory (around ToDO), from the project's root directory, execute the following code:

```bash download_models.sh```

If you want to omit some models, comment out the specific lines in the script. 

In [5]:
# If you are using a different directory for your data, update the path below
MODELS_DIR="../data/models"

# To download the models, uncomment the following line and run the cell
# !bash ../download_models.sh $MODELS_DIR

### Model Loading

In [6]:
"""Load the Model (downloaded in the ../data/models directory)"""
from src.models import GenIEFlanT5PL

ckpt_name = "synthie_base_sc.ckpt"
path_to_checkpoint = os.path.join(DATA_DIR, 'models', ckpt_name)
model = GenIEFlanT5PL.load_from_checkpoint(checkpoint_path=path_to_checkpoint)
model.to("cuda");

For inference use the `model.sample` function. 

Under the hood, **SynthIE** uses the HuggingFace's generate function, thus it accepts the same generation parameters. The model's default generation parameters can be overriden as shown in the example.

In [7]:
override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 1,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123,
    "length_penalty": 0.8
}

texts = ['The Journal of Colloid and Interface Science is a bibliographic review indexed in Scopus and published by Elsevier. Its main subject is chemical engineering, and it is written in the English language. It is based in the United States, and is owned by Elsevier, the same company that owns Scopus.']

### Unconstrained Decoding

In [8]:
output = model.sample(texts,
                      convert_to_triplets=True,
                      return_generation_outputs=True,
                      **override_models_default_hf_generation_parameters)

output['grouped_decoded_outputs'][0]

[{('Journal_of_Colloid_and_Interface_Science',
   'country of origin',
   'United_States'),
  ('Journal_of_Colloid_and_Interface_Science',
   'indexed in bibliographic review',
   'Scopus'),
  ('Journal_of_Colloid_and_Interface_Science',
   'language of work or name',
   'English_language'),
  ('Journal_of_Colloid_and_Interface_Science',
   'main subject',
   'Chemical_engineering'),
  ('Journal_of_Colloid_and_Interface_Science', 'publisher', 'Elsevier'),
  ('Scopus', 'owned by', 'Elsevier')}]

In [9]:
# ~~ synthie_base_sc
# ~~ Load model ~~
from src.models import GenIEFlanT5PL

from pprint import pprint 

ckpt_name = "synthie_base_sc.ckpt"
path_to_checkpoint = os.path.join(DATA_DIR, 'models', ckpt_name)
model = GenIEFlanT5PL.load_from_checkpoint(checkpoint_path=path_to_checkpoint)
model.to("cuda");

output = model.sample(texts,
                      convert_to_triplets=True,
                      return_generation_outputs=True,
                      **override_models_default_hf_generation_parameters)

print("Checkpoint name:")
pprint(output['grouped_decoded_outputs'][0])
pprint(model.tokenizer.batch_decode(
    output['generation_outputs'].sequences, skip_special_tokens=True
))

Checkpoint name:
[{('Journal_of_Colloid_and_Interface_Science',
   'country of origin',
   'United_States'),
  ('Journal_of_Colloid_and_Interface_Science',
   'indexed in bibliographic review',
   'Scopus'),
  ('Journal_of_Colloid_and_Interface_Science',
   'language of work or name',
   'English_language'),
  ('Journal_of_Colloid_and_Interface_Science',
   'main subject',
   'Chemical_engineering'),
  ('Journal_of_Colloid_and_Interface_Science', 'publisher', 'Elsevier'),
  ('Scopus', 'owned by', 'Elsevier')}]
['[s] Journal_of_Colloid_and_Interface_Science [r] indexed in bibliographic '
 'review [o] Scopus [r] publisher [o] Elsevier [r] main subject [o] '
 'Chemical_engineering [r] language of work or name [o] English_language [r] '
 'country of origin [o] United_States [e] [s] Scopus [r] owned by [o] Elsevier '
 '[e]']


### Constrained Decoding 

Assumes that the `constrained_world` definitions have been downloaded using the `download_data.sh` script.

In [10]:
"""Load constrained decoding module"""
from src.constrained_generation import IEConstrainedGeneration

params = {}
params['constrained_worlds_dir'] = os.path.join(DATA_DIR, "constrained_worlds")
params['constrained_world_id'] = "genie_t5_tokenizeable" # specifies the folder name from which the constrained world is loaded
params['identifier'] = "genie_t5_tokenizeable" # specifies the cache subfolder where the trie will be stored
    
params['path_to_trie_cache_dir'] = os.path.join(DATA_DIR, ".cache")
params['path_to_entid2name_mapping'] = os.path.join(DATA_DIR, "id2name_mappings", "entity_mapping.jsonl")
params['path_to_relid2name_mapping'] = os.path.join(DATA_DIR, "id2name_mappings", "relation_mapping.jsonl")

constraint_module = IEConstrainedGeneration.from_constrained_world(model=model, 
                                                                   linearization_class_id=model.hparams.linearization_class_id, 
                                                                   **params)

model.constraint_module = constraint_module

In [11]:
output = model.sample(texts,
                      convert_to_triplets=True,
                      **override_models_default_hf_generation_parameters)

output['grouped_decoded_outputs'][0]

[{('Journal_of_Colloid_and_Interface_Science',
   'country of origin',
   'United_States'),
  ('Journal_of_Colloid_and_Interface_Science',
   'indexed in bibliographic review',
   'Scopus'),
  ('Journal_of_Colloid_and_Interface_Science',
   'language of work or name',
   'English_language'),
  ('Journal_of_Colloid_and_Interface_Science',
   'main subject',
   'Chemical_engineering'),
  ('Journal_of_Colloid_and_Interface_Science', 'publisher', 'Elsevier'),
  ('Scopus', 'owned by', 'Elsevier')}]

----

## Loading Models and Datasets with Hydra

An alternative way to load the models (or the data) is by using the package manage [Hydra](https://hydra.cc/). Below we provide an example of using Hydra in a jupyter notebook but the library really shines when used in scripts. See our training and evaluation pipelines, outlined in the README, for an example.

In [12]:
# ~~~Load config~~~
import hydra

configs_path = "../configs"
config_name = "inference_root.yaml"
model_id = "synthie_base_sc"
dataset = "sdg_text_davinci_003_pc"

with hydra.initialize(version_base="1.2", config_path=configs_path):
    cfg = hydra.compose(config_name=config_name, 
                           overrides=[f"data_dir={DATA_DIR}",
                                      f"work_dir=../",
                                      f"+experiment/inference={model_id}",
                                      f"datamodule={dataset}"
                                     ])
    
# ~~~Load model~~~
model = hydra.utils.instantiate(cfg.model, _recursive_=False)
model.to("cuda");

# ~~~Load dataset~~~
datamodule = hydra.utils.instantiate(cfg.datamodule, _recursive_=False)
datamodule.set_tokenizer(model.tokenizer)
datamodule.setup("validate")
datamodule.set_tokenizer(model.tokenizer)

# If defined, use the model's collate function (otherwise proceed with the PyTorch's default collate_fn)
if getattr(model, "collator", None):
    datamodule.set_collate_fn(model.collator.collate_fn)

Loading the data from: ../data/processed/sdg_text_davinci_003/val_ordered.jsonl.gz: 100%|â–ˆ| 1


In [13]:
override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 1,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123,
    "length_penalty": 0.8
}

texts = [datamodule.data_val[1]['text']]

In [14]:
output = model.sample(texts,
                      convert_to_triplets=True,
                      **override_models_default_hf_generation_parameters)

output['grouped_decoded_outputs'][0]

[{('Journal_of_Colloid_and_Interface_Science',
   'country of origin',
   'United_States'),
  ('Journal_of_Colloid_and_Interface_Science',
   'indexed in bibliographic review',
   'Scopus'),
  ('Journal_of_Colloid_and_Interface_Science',
   'language of work or name',
   'English_language'),
  ('Journal_of_Colloid_and_Interface_Science',
   'main subject',
   'Chemical_engineering'),
  ('Journal_of_Colloid_and_Interface_Science', 'publisher', 'Elsevier'),
  ('Scopus', 'owned by', 'Elsevier')}]

## Loading and Using the WikidataID2Name Dictionaries

In [15]:
path_to_entity_id2name_mapping = os.path.join(DATA_DIR, "id2name_mappings", "entity_mapping.jsonl")
with jsonlines.open(path_to_entity_id2name_mapping) as reader:
    entity_id2name_mapping = {obj["id"]: obj["en_label"] for obj in reader}

In [16]:
list(entity_id2name_mapping.items())[:5]

[('Q191069', 'Paleoclimatology'),
 ('Q47716', 'Charleston,_South_Carolina'),
 ('Q15643', 'Juliet_(moon)'),
 ('Q2143143', 'Owaneco,_Illinois'),
 ('Q1899035', 'Erhards_Grove_Township,_Otter_Tail_County,_Minnesota')]

In [17]:
path_to_relation_id2name_mapping = os.path.join(DATA_DIR, "id2name_mappings", "relation_mapping.jsonl")
with jsonlines.open(path_to_relation_id2name_mapping) as reader:
    relation_id2name_mapping = {obj["id"]: obj["en_label"] for obj in reader}

In [18]:
list(relation_id2name_mapping.items())[:5]

[('P2673', 'next crossing upstream'),
 ('P571', 'inception'),
 ('P186', 'material used'),
 ('P1066', 'student of'),
 ('P453', 'character role')]