# GenIE: Generative Information Extraction

---

## Table of Content
1. [How to download the required artefacts?](#Download)
2. [How to load the models?](#Loading-the-Models)
3. [How to run inference?](#Inference)
    - [Unconstrained Generation](#Unconstrained-Generation)
    - [Constrained Generation](#Constrainted-Generation)
    - [Extracting the Wikidata Disambiguated Triplet Sets](#Extracting-the-Wikidata-Disambiguated-Triplet-Sets)
4. [Loading models and running inference with Hydra](#Loading-Models-and-Running-Inference-with-Hydra)
5. [How to load and use the datasets?](#Loading-Datasets)
6. Optional
    1. [How to constraint the model with a custom set of strings?](#Constructing-Prefix-Tries-for-A-Custom-Set-of-Strings) 
    2. [Loading and Using the WikidataID2Name Dictionaries](#Loading-and-Using-the-WikidataID2Name-Dictionaries)

## Download

The data that we release consists of:

1. **Pre-trained Model(s)**
    - Wiki-NRE (W): [Random Initialization](https://zenodo.org/record/6139236/files/genie_w.ckpt)
    - Rebel (R): [Random Initialization](https://zenodo.org/record/6139236/files/genie_r.ckpt) – [Pretrained Language Model](https://zenodo.org/record/6139236/files/genie_plm_r.ckpt) – [Pretrained Entity Linker (GENRE)](https://zenodo.org/record/6139236/files/genie_genre_r.ckpt)
    - Rebel + Wiki-NRE (R+W): [Random Initialization](https://zenodo.org/record/6139236/files/genie_rw.ckpt)
2. [**Prefix Trees (tries) for Constrained Generation**](https://zenodo.org/record/6139236/files/tries.zip)
    - relation trie
    - entity trie
3. **Datasets** \[Not required for inference\] 
    - [Rebel](https://zenodo.org/record/6139236/files/rebel.zip)
    - [FewRel](https://zenodo.org/record/6139236/files/fewrel.zip)
    - [Wikipedia-NRE](https://zenodo.org/record/6139236/files/wikipedia_nre.zip)
    - [Geo-NRE](https://zenodo.org/record/6139236/files/geo_nre.zip)
4. [**World Definitions**](https://zenodo.org/record/6139236/files/world_definitions.zip) \[Not required for inference\] 
5. **Mapping between Unique Names and Wikidata Identifiers** ([used by GenIE](https://zenodo.org/record/6139236/files/surface_form_dicts.zip), [full snapshot](https://zenodo.org/record/6139236/files/surface_form_dicts_from_snapshot.zip)) \[Optional. Necessary for processing data\] 
    - relation name to wikidata ID (and vice-versa)
    - entity name to wikidata ID (and vice-versa)

You can download the data by executing the <code>download_data.sh</code> script. If you want to omit some files, comment out parts of the code.

Alternatively, you can access the data [here](https://zenodo.org/record/6139236#.YhJdiJPMJhH).

In [1]:
# If you are using a different directory for your data, update the path below
DATA_DIR="../data"

# To download the data uncomment and run the following line
# !bash ../download_data.sh $DATA_DIR

# If your working directory is not the GenIE fodler, include the path to it in your PATH variable to make the library available
import os
import sys

sys.path.append("../")

# Loading the Models

In [2]:
"""Load the Model"""
from genie.models import GeniePL

ckpt_name = "genie_r.ckpt"
path_to_checkpoint = os.path.join(DATA_DIR, 'models', ckpt_name)
model = GeniePL.load_from_checkpoint(checkpoint_path=path_to_checkpoint)

In [3]:
"""Load the Prefix Tries"""
from genie.constrained_generation import Trie

# Large schema tries (correspond to Rebel; see the paper for details) 
entity_trie_path = os.path.join(DATA_DIR, "tries/large/entity_trie.pickle")
entity_trie = Trie.load(entity_trie_path)

relation_trie_path = os.path.join(DATA_DIR, "tries/large/relation_trie.pickle")
relation_trie = Trie.load(relation_trie_path)

large_schema_tries = {'entity_trie': entity_trie, 'relation_trie': relation_trie}

# Small schema tries (correspond to Wiki-NRE; see the paper for details) 
entity_trie_path = os.path.join(DATA_DIR, "tries/small/entity_trie.pickle")
entity_trie = Trie.load(entity_trie_path)

relation_trie_path = os.path.join(DATA_DIR, "tries/small/relation_trie.pickle")
relation_trie = Trie.load(relation_trie_path)

small_schema_tries = {'entity_trie': entity_trie, 'relation_trie': relation_trie}

To construct a prefix trie for your custom set of strings see [this section](#Constructing-Prefix-Tries-for-A-Custom-Set-of-Strings).

# Inference

For inference use the `model.sample` function. 

Under the hood, **GenIE** uses the HuggingFace's generate function, thus it accepts the same generation parameters. By default, during inference the same generation parameters used by the model during are employed – they are the model's default – but you can override them in the call of the function, as shown in the examples.

In [4]:
sentences = ["Prior to KTRK, Carson was an anchor for KSAZ in Phoenix, Arizona."]

----

### Unconstrained Generation

In [5]:
override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

output = model.sample(sentences, 
                      **override_models_default_hf_generation_parameters)

output

[[{'text': ' <sub> KSAZ-TV <rel> headquarters location <obj> Phoenix, Arizona <et>',
   'log_prob': -0.1262681782245636},
  {'text': ' <sub> KSAZ <rel> headquarters location <obj> Phoenix, Arizona <et>',
   'log_prob': -0.13841137290000916}]]

---

### Constrainted Generation

To constrain the generation, set the `entity_trie` and the `relation_trie` arguments of the sample to the entity and relation trie,

#### Small Schema Constrainted Generation

In [6]:
"""Small Schema Constrainted Generation"""

override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

output = model.sample(sentences, 
                      **small_schema_tries, 
                      **override_models_default_hf_generation_parameters)

output

[[{'text': ' <sub> Phoenix, Arizona <rel> capital of <obj> Arizona <et> <sub> Arizona <rel> capital <obj> Phoenix, Arizona <et>',
   'log_prob': -0.30889713764190674},
  {'text': ' <sub> Arizona <rel> capital <obj> Phoenix, Arizona <et>',
   'log_prob': -0.3368832468986511}]]

Applying the small schema constraints results with the model linking the "KSAZ" entity to the [Fox Broadcasting Company](https://en.wikipedia.org/wiki/Fox_Broadcasting_Company). This is due to the fact that the correct entity is missing from the "small" schema. 

Note that in comparison with the unconstrained generation, the ("best" eligible) predictions in this case are assigned a much lower score (log probability).

#### Large Schema Constrainted Generation

In [7]:
"""Large Schema Constrainted Generation"""

override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

output = model.sample(sentences,
                      **large_schema_tries, 
                      **override_models_default_hf_generation_parameters)

output

[[{'text': ' <sub> KSAZ-TV <rel> headquarters location <obj> Phoenix, Arizona <et>',
   'log_prob': -0.1262681782245636},
  {'text': ' <sub> KSAZ <rel> headquarters location <obj> Phoenix, Arizona <et>',
   'log_prob': -0.13841137290000916}]]

Finally, under the large schema constraint, the model links the subject of the triplets to: 1) The TV station [KSAZ-TV](https://en.wikipedia.org/wiki/KTRK-TV); 2) [KSAZ](https://en.wikipedia.org/wiki/KSAZ) – the disambiguation page containing [KSAZ-TV](https://en.wikipedia.org/wiki/KSAZ-TV) and [KSAZ-AM](https://en.wikipedia.org/wiki/KSAZ_(AM))

The last two examples illustrate how the generation for any of the **GenIE** models can be constrained with an arbitrary prefix tries. See how you can construct your custom prefix trie [here](#Constructing-a-Prefix-Tries-for-A-Custom-Set-of-Strings).

----

### Extracting the Wikidata Disambiguated Triplet Sets

##### Textual Set of triplets

The textual predictions produced by the sampling function can be directly mapped to a subject-relation-object tuples by passing the argument <code>convert_to_triplets=True</code>. This requires that the <code>return_dict_in_generate=True</code> and adds an additional field <code>textual_triplets</code> to the output.

##### WikidataID set of triplets

The entity and relation names produced by GenIE are textual identifiers that can be uniquely translated to an element from Wikidata. To facilitate future reseach, and make the output of the models more useful, among the data that we release, we provide the mappings a mappiing between Wikidata IDs and entity/relation that cover all of the elements in a snapshot of the English Wikipedia. See [this](#Loading-and-Using-the-WikidataID2Name-Dictionaries) section for details.

The sample function has the support for providing the predicted triplets as tuples of Wikidata IDs, translated according to name-to-ID mappings for entities and relations that need to be passed to the call. The <code>convert_to_triplets=True</code> must be set for this functionality, which adds an additional field <code>id_triplets</code> to the output. The example below shows how to leverage the provided dictionaries for this purpose.

In [8]:
from genie.datamodule.utils import WikidataID2SurfaceForm

# Entity Mapping
ent_id2surface_info_path = os.path.join(DATA_DIR, "surface_form_dicts", "ent_id2surface_form.jsonl") # used in our experiments
ent_mapping = WikidataID2SurfaceForm(ent_id2surface_info_path)
ent_mapping.load()

# Relation Mapping
rel_id2surface_info_path = os.path.join(DATA_DIR, "surface_form_dicts", "rel_id2surface_form.jsonl") # used in our experiments
rel_mapping = WikidataID2SurfaceForm(rel_id2surface_info_path)
rel_mapping.load()

Reading mapping from: ../../../GenIE_public/GenIE/data/surface_form_dicts/ent_id2surface_form.jsonl
Reading mapping from: ../../../GenIE_public/GenIE/data/surface_form_dicts/rel_id2surface_form.jsonl


In [9]:
sentences = ["The physicist Einstein was given a Nobel Prize."]

In [10]:
"""Large Schema Constrainted Generation"""

override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

convert_to_triples = True
surface_form_mappings = {'entity_name2id': ent_mapping.surface_form2id, 'relation_name2id': rel_mapping.surface_form2id}

output = model.sample(sentences,
                      convert_to_triplets=convert_to_triples,
                      surface_form_mappings=surface_form_mappings,
                      **large_schema_tries, 
                      **override_models_default_hf_generation_parameters)

output

[[{'text': ' <sub> Albert Einstein <rel> award received <obj> Nobel Prize in Physics <et>',
   'log_prob': -0.1404302716255188,
   'textual_triplets': {('Albert Einstein',
     'award received',
     'Nobel Prize in Physics')},
   'id_triplets': [['Q937', 'P166', 'Q38104']]},
  {'text': ' <sub> Albert Einstein <rel> award received <obj> Nobel Prize in Physiology or Medicine <et>',
   'log_prob': -0.21287985146045685,
   'textual_triplets': {('Albert Einstein',
     'award received',
     'Nobel Prize in Physiology or Medicine')},
   'id_triplets': [['Q937', 'P166', 'Q80061']]}]]

---

# Loading Models and Running Inference with Hydra

An alternative way to load the models (or data) is by using the package manage [Hydra](https://hydra.cc/). Below is a way of using Hydra in a jupyter notebook, but the library shines when used in scripts. See our training and evaluation code for an example.

## Loading a Model and Performing Inference

In [11]:
# Load config
import hydra

configs_path = "../configs"
config_name = "config.yaml"

with hydra.initialize(config_path=configs_path):
    config = hydra.compose(config_name=config_name, 
                           overrides=[f"data_dir={DATA_DIR}/",
                                      f"work_dir=../",
                                      f"model=ckpt_genie"
                                     ])

In [12]:
# Load model
model = hydra.utils.instantiate(config.model)

In [13]:
output = model.sample(sentences, 
                      entity_trie=model.entity_trie, 
                      relation_trie=model.relation_trie, 
                      seed=123)

output

[[{'text': ' <sub> Albert Einstein <rel> award received <obj> Nobel Prize in Physics <et>',
   'log_prob': -0.14043039083480835},
  {'text': ' <sub> Albert Einstein <rel> award received <obj> Nobel Prize in Physiology or Medicine <et>',
   'log_prob': -0.2128797322511673}]]

# Loading Datasets

### With Hydra and PyTorch Lightning (Recommended)

In [14]:
# Load config
import sys
sys.path.append("../")

import hydra

configs_path = "../configs"
config_name = "config.yaml"

"""
datamodule -> rebel, wikipedia_nre, geo_nre, fewrel
"""

with hydra.initialize(config_path=configs_path):
    config = hydra.compose(config_name=config_name, 
                           overrides=[f"data_dir={DATA_DIR}/", 
                                      f"work_dir=../",
                                      f"datamodule=wikipedia_nre"
                                     ])

In [15]:
# Load datamodule
datamodule = hydra.utils.instantiate(config.datamodule, tokenizer=None)

In [16]:
datamodule.setup("validate")
len(datamodule.data_val)

Loading data from ../../../GenIE_public/GenIE/data//wikipedia_nre/val_data


980

In [17]:
datamodule.data_val[5]

{'id': 5,
 'src': 'During these five years Mikhail Bukinik studied with Alfred von Glehn ( who was also the teacher of Gregor Piatigorsky ) at the Moscow Conservatory .',
 'trg': ' <sub> Mikhail Bukinik <rel> educated at <obj> Moscow Conservatory <et>',
 'non_formatted_wikidata_id_output': [['Q1930285', 'P69', 'Q215539']]}

### Raw Data Loading

In [18]:
import sys
sys.path.append("../")

from genie.datamodule.datasets import Rebel, WikipediaNRE, FewRel

In [19]:
"""
Rebel -> train, val, test
WikipediaNRE -> train, val, test and trip (which corresopnds to GeoNRE)
FewRel -> test
"""

split='val'
raw_data, dataset = WikipediaNRE.from_kilt_dataset(data_split=split, 
                                                   tokenizer=None, 
                                                   return_raw_data=True,
                                                   matching_status="title",
                                                   relations_to_keep=f'{DATA_DIR}/world_definitions/complete_relations.jsonl',
                                                   )
len(dataset)

Loading data from ../../../GenIE_public/GenIE/data//wikipedia_nre/val_data


980

In [20]:
dataset[5]

{'id': 5,
 'src': 'During these five years Mikhail Bukinik studied with Alfred von Glehn ( who was also the teacher of Gregor Piatigorsky ) at the Moscow Conservatory .',
 'trg': ' <sub> Mikhail Bukinik <rel> educated at <obj> Moscow Conservatory <et>',
 'non_formatted_wikidata_id_output': [['Q1930285', 'P69', 'Q215539']]}

# Optional

## Constructing Prefix Tries for A Custom Set of Strings

In [21]:
from genie.constrained_generation.trie import Trie, get_trie_from_strings

In [22]:
names = ["place of birth", "father", "employer"]
output_folder_path = os.path.join(DATA_DIR, "tries") # If the output folder path is None, the trie won't be saved to disk
tokenizer = None # If the tokenizer is set to None the function is loading GenIE's tokenizer by default. If you are using a different model, load the tokenizer specific to your model.

trie = get_trie_from_strings(names, 
                             output_folder_path=output_folder_path, 
                             trie_name="myCustomTrie", 
                             tokenizer=tokenizer)

100%|█████████████████████████████████████| 3/3 [00:00<00:00, 2465.30it/s]


## Loading and Using the WikidataID2Name Dictionaries

#### Loading

In [23]:
import os
import config

from genie.datamodule.utils import WikidataID2SurfaceForm

In [24]:
# Entity Mapping
ent_id2surface_info_path = os.path.join(DATA_DIR, "surface_form_dicts", "ent_id2surface_form.jsonl") # used in our experiments
# ent_id2surface_info_path = os.path.join(DATA_DIR, "surface_form_dicts_from_snapshot", "ent_id2surface_form.jsonl") # maximal mapping extracted from the English Wikipedia snapshot
ent_mapping = WikidataID2SurfaceForm(ent_id2surface_info_path)
ent_mapping.load()

Reading mapping from: ../../../GenIE_public/GenIE/data/surface_form_dicts/ent_id2surface_form.jsonl


In [25]:
# Relation Mapping

rel_id2surface_info_path = os.path.join(DATA_DIR, "surface_form_dicts", "rel_id2surface_form.jsonl") # used in our experiments
# rel_id2surface_info_path = os.path.join(DATA_DIR, "surface_form_dicts_from_snapshot", "rel_id2surface_form.jsonl") # maximal mapping extracted from the English Wikipedia snapshot
rel_mapping = WikidataID2SurfaceForm(rel_id2surface_info_path)
rel_mapping.load()

Reading mapping from: ../../../GenIE_public/GenIE/data/surface_form_dicts/rel_id2surface_form.jsonl


#### Map IDs to Names

In [26]:
def map_ids_to_names(ids, mapping, allow_querying=False, allow_labels=False, invalid_tokens = set([" <"])):
    """For some items there might not be a matching in the current dictionary for several reasons, such as: 
    1) The item is a redirect, hence it will be resolved to another item; 
    2) The item has been added after the snapshot from which the dictionary was generated; 
    3) The entity doesn't have an english article associated with it.
    Issues like the first and the second can be resolved by qurying Wikidata (setting allow_querying=True). 
    To resolve the third we might use the label of the item (items that are not associated with an english article, are often assigned an english label). However, labels are not necessarily unique and might lead to duplicates. To enable this set allow_labels=True"""
    
    id2title = {}
    for _id in ids:
        invalid = False

        unq_name, _ = mapping.get_from_wikidata_id(_id, 
                                                   return_provenance=True,
                                                   query_wikidata=allow_querying, 
                                                   allow_labels=allow_labels)

        if unq_name is not None:
            for token in invalid_tokens:
                if token in unq_name:
                    print("The name contains a special token found ->", title)
                    invalid = True
                    break

            if invalid:
                continue

            id2title[_id] = unq_name

    print("Out of the original `{}` elements, `{}` were successfully mapped".format(len(ids), len(id2title)))
    return id2title

In [27]:
ids = list(ent_mapping.id2surface_form.keys())[:5]
invalid_ids = ["randomID", "randomID2", "randomID3"]
ids += invalid_ids

map_ids_to_names(ids, ent_mapping)

Out of the original `8` elements, `5` were successfully mapped


{'Q9659': 'A',
 'Q41746': 'Achilles',
 'Q20127832': 'Achilles Stakes',
 'Q4673749': 'Achilles Rink',
 'Q4673754': 'Achilles Rizzoli'}

#### Map Names To IDs

In [28]:
def map_names_to_ids(names, mapping):
    surface_form2id = mapping.surface_form2id
    
    name2id = {name: surface_form2id[name] for name in names if name in surface_form2id}
    
    print("Out of the original `{}` elements, `{}` were successfully mapped".format(len(names), len(name2id)))
    return name2id

In [29]:
names = ["place of birth", "father", "employer"]
invalid_names = ["randomName", "randomName2", "randomName3"]
names += invalid_names

map_names_to_ids(names, rel_mapping)

Out of the original `6` elements, `3` were successfully mapped


{'place of birth': 'P19', 'father': 'P22', 'employer': 'P108'}