### A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach

GitHub: https://github.com/declare-lab/HyperRED

In [1]:
!git clone https://github.com/declare-lab/HyperRED.git
!cd HyperRED && git checkout 388a87f
!cp -a HyperRED/* .

# Install requirements but use the existing torch (remove if not in Colab)
!sed -i '/torch/d' requirements.txt
!pip install -q -r requirements.txt

Cloning into 'HyperRED'...
remote: Enumerating objects: 739, done.[K
remote: Counting objects: 100% (739/739), done.[K
remote: Compressing objects: 100% (223/223), done.[K
remote: Total 739 (delta 511), reused 739 (delta 511), pack-reused 0[K
Receiving objects: 100% (739/739), 208.99 KiB | 447.00 KiB/s, done.
Resolving deltas: 100% (511/511), done.
Note: checking out '388a87f'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 388a87f Add diagram
[K     |████████████████████████████████| 311 kB 51.9 MB/s 
[K     |████████████████████████████████| 87 kB 7.2 MB/s 
[K     |████████████████████████████████| 1.5 MB 60.2 M

In [3]:
from data_process import download_data, process_many

def colab_demo_truncate_data(path: str, limit: int):
    # Reduce data size for faster training in demo
    with open(path) as f:
        lines = [x for x in f]
    with open(path, "w") as f:
        for x in lines[:limit]:
            f.write(x)

download_data("data/hyperred/")
colab_demo_truncate_data("data/hyperred/train.json", limit=5000)
process_many("data/hyperred/", "data/processed")



  0%|          | 0/3 [00:00<?, ?it/s]

{'path_out': PosixPath('data/hyperred/train.json')}
{'path_out': PosixPath('data/hyperred/dev.json')}
{'path_out': PosixPath('data/hyperred/test.json')}


data/hyperred/dev.json: 100%|██████████| 1000/1000 [00:00<00:00, 8820.64it/s]


{
  "sents": 1000,
  "relations": 1201,
  "relation_labels": 60,
  "qualifiers": 1342,
  "qualifier_labels": 43,
  "hash": "85647a5d5cf820eed9a742b2fe9b9c75"
}


data/hyperred/test.json: 100%|██████████| 4000/4000 [00:00<00:00, 6647.79it/s]


{
  "sents": 4000,
  "relations": 4878,
  "relation_labels": 62,
  "qualifiers": 5533,
  "qualifier_labels": 44,
  "hash": "ba67d954cac2e51fbbd1c1f2f5d50d61"
}


data/hyperred/train.json: 100%|██████████| 5000/5000 [00:00<00:00, 7040.45it/s]


{
  "sents": 5000,
  "relations": 6532,
  "relation_labels": 62,
  "qualifiers": 7556,
  "qualifier_labels": 44,
  "hash": "130cd3445879909292b58e553f551ea8"
}


1000it [00:00, 4310.52it/s]
4000it [00:00, 6662.40it/s]
5000it [00:00, 6605.62it/s]


{'relations': 62, 'qualifiers': 44}
{'process': {'source_file': 'temp/dev.json', 'target_file': 'data/processed/dev.json', 'label_file': 'data/processed/label.json', 'pretrained_model': 'bert-base-uncased', 'mode': 'joint'}}
Load bert-base-uncased tokenizer successfully.


100%|██████████| 1000/1000 [00:01<00:00, 527.36it/s]


{'process': {'source_file': 'temp/test.json', 'target_file': 'data/processed/test.json', 'label_file': 'data/processed/label.json', 'pretrained_model': 'bert-base-uncased', 'mode': 'joint'}}
Load bert-base-uncased tokenizer successfully.


100%|██████████| 4000/4000 [00:07<00:00, 520.69it/s]


{'process': {'source_file': 'temp/train.json', 'target_file': 'data/processed/train.json', 'label_file': 'data/processed/label.json', 'pretrained_model': 'bert-base-uncased', 'mode': 'joint'}}
Load bert-base-uncased tokenizer successfully.


100%|██████████| 5000/5000 [00:10<00:00, 492.02it/s]


In [4]:
# Data Exploration

from data_process import Data

def explore_data(path: str):
    data = Data.load(path)
    data.analyze()

    for s in data.sents[:3]:
        print(f"\nText: {s.text}")
        print(f"Tokens: {s.tokens}")
        for r in s.relations:
            fn = lambda span: " ".join(s.tokens[span[0] : span[1]])
            print(f"\tRelation: {r}")
            print(f"\tHead: {fn(r.head)}, Relation: {r.label}, Tail: {fn(r.tail)}")
            for q in r.qualifiers:
                print(f"\t\tQualifier: {q.label}, Value: {fn(q.span)}")
        print()

explore_data("data/hyperred/train.json")

data/hyperred/train.json: 100%|██████████| 5000/5000 [00:00<00:00, 6776.36it/s]


{
  "sents": 5000,
  "relations": 6532,
  "relation_labels": 62,
  "qualifiers": 7556,
  "qualifier_labels": 44,
  "hash": "130cd3445879909292b58e553f551ea8"
}

Text: She is known for her two best - selling novels , The Fountainhead ( 1943 ) and Atlas Shrugged ( 1957 ) , and for developing a philosophical system she called Objectivism .
Tokens: ['She', 'is', 'known', 'for', 'her', 'two', 'best', '-', 'selling', 'novels', ',', 'The', 'Fountainhead', '(', '1943', ')', 'and', 'Atlas', 'Shrugged', '(', '1957', ')', ',', 'and', 'for', 'developing', 'a', 'philosophical', 'system', 'she', 'called', 'Objectivism', '.']
	Relation: head=(0, 1) tail=(11, 13) label='notable work' qualifiers=[Entity(span=(14, 15), label='publication date')]
	Head: She, Relation: notable work, Tail: The Fountainhead
		Qualifier: publication date, Value: 1943


Text: Apollo is the son of Zeus and Leto , and has a twin sister , the chaste huntress Artemis .
Tokens: ['Apollo', 'is', 'the', 'son', 'of', 'Zeus', 'and', '

In [5]:
# Download Pretrained Model
!wget https://github.com/declare-lab/HyperRED/releases/download/v1.0.0/cube_model.zip
!unzip cube_model.zip

--2022-11-23 09:45:40--  https://github.com/declare-lab/HyperRED/releases/download/v1.0.0/cube_model.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/569615636/df79920c-9103-49d0-b1f3-a44a912498fe?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221123%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221123T094540Z&X-Amz-Expires=300&X-Amz-Signature=94073b996b6c3d78c6435643eafbee47127df255ccc901fe605d0238614b3ee2&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=569615636&response-content-disposition=attachment%3B%20filename%3Dcube_model.zip&response-content-type=application%2Foctet-stream [following]
--2022-11-23 09:45:40--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/569615636/df79920c-9103-49d0-b1f3-a44a912498fe?X-

In [8]:
# Use Pretrained Model for Generation

from prediction import run_predict

texts = [
    "Leonard Parker received his PhD from Harvard University in 1967 .",
    "Szewczyk played 37 times for Poland, scoring 3 goals .",
]
preds = run_predict(texts, path_checkpoint="cube_model")
preds.save("preds.json")
explore_data("preds.json")

{'load': 'cube_model/best_model'}
{
  "config_file": "config.yml",
  "save_dir": "ckpt/cube_prune_20_seed_0",
  "data_dir": "data/processed",
  "train_file": "data/processed/train.json",
  "dev_file": "data/processed/dev.json",
  "test_file": "data/processed/test.json",
  "ent_rel_file": "data/processed/label.json",
  "max_sent_len": 80,
  "max_wordpiece_len": 80,
  "test": false,
  "freeze_bert": false,
  "load_weight_path": "",
  "prune_topk": 20,
  "task": "quintuplet",
  "embedding_model": "bert",
  "bert_model_name": "bert-base-uncased",
  "pretrained_model_name": null,
  "bert_output_size": 0,
  "bert_dropout": 0.0,
  "fine_tune": true,
  "max_span_length": 10,
  "mlp_hidden_size": 150,
  "dropout": 0.4,
  "separate_threshold": 1.4,
  "logit_dropout": 0.2,
  "gradient_clipping": 5.0,
  "learning_rate": 5e-05,
  "bert_learning_rate": 5e-05,
  "lr_decay_rate": 0.9,
  "adam_beta1": 0.9,
  "adam_beta2": 0.9,
  "adam_epsilon": 1e-12,
  "adam_weight_decay_rate": 1e-05,
  "adam_bert_wei

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'cuda': 0}



1it [00:00, 49.93it/s]

preds.json: 100%|██████████| 2/2 [00:00<00:00, 3289.65it/s]

{
  "sents": 2,
  "relations": 2,
  "relation_labels": 2,
  "qualifiers": 4,
  "qualifier_labels": 4,
  "hash": "65fbd277ea331512418590035d409d72"
}

Text: Leonard Parker received his PhD from Harvard University in 1967 .
Tokens: ['Leonard', 'Parker', 'received', 'his', 'PhD', 'from', 'Harvard', 'University', 'in', '1967', '.']
	Relation: head=(0, 2) tail=(6, 8) label='educated at' qualifiers=[Entity(span=(4, 5), label='academic degree'), Entity(span=(9, 10), label='end time')]
	Head: Leonard Parker, Relation: educated at, Tail: Harvard University
		Qualifier: academic degree, Value: PhD
		Qualifier: end time, Value: 1967


Text: *@UNK@* played 37 times for *@UNK@* scoring 3 goals .
Tokens: ['*@UNK@*', 'played', '37', 'times', 'for', '*@UNK@*', 'scoring', '3', 'goals', '.']
	Relation: head=(0, 1) tail=(5, 6) label='member of sports team' qualifiers=[Entity(span=(2, 3), label='number of matches played/races/starts'), Entity(span=(7, 8), label='number of points/goals/set scored')]
	Head:




In [None]:
# Train CubeRE Model from scratch
!python training.py --save_dir ckpt/cube_prune_20 --data_dir data/processed --prune_topk 20 --config_file config.yml

{
  "config_file": "config.yml",
  "save_dir": "ckpt/cube_prune_20",
  "data_dir": "data/processed",
  "train_file": "data/processed/train.json",
  "dev_file": "data/processed/dev.json",
  "test_file": "data/processed/test.json",
  "ent_rel_file": "data/processed/label.json",
  "max_sent_len": 80,
  "max_wordpiece_len": 80,
  "test": false,
  "freeze_bert": false,
  "load_weight_path": "",
  "prune_topk": 20,
  "task": "quintuplet",
  "embedding_model": "bert",
  "bert_model_name": "bert-base-uncased",
  "pretrained_model_name": null,
  "bert_output_size": 0,
  "bert_dropout": 0.0,
  "fine_tune": true,
  "max_span_length": 10,
  "mlp_hidden_size": 150,
  "dropout": 0.4,
  "separate_threshold": 1.4,
  "logit_dropout": 0.2,
  "gradient_clipping": 5.0,
  "learning_rate": 5e-05,
  "bert_learning_rate": 5e-05,
  "lr_decay_rate": 0.9,
  "adam_beta1": 0.9,
  "adam_beta2": 0.9,
  "adam_epsilon": 1e-12,
  "adam_weight_decay_rate": 1e-05,
  "adam_bert_weight_decay_rate": 1e-05,
  "seed": 5216,
 