The general objective is to extract facts of the form of Wikidata IDs (e1, relation, e2) from raw text.
We refer to e1 and e2 as "entities", which can manifest in text in the form of "mentions" that refer to them. We define an entity's common mentions as "aliases". Similarly, we refer call relation mentions "verbs" (even though they are not necessarily verbs).
Preliminary note: in order to get spacy and neuralcoref working together, we recommend working in a new conda env with python 3.7.
First of all it is required to install the python modules with:
pip install -r requirements.txt
In order to obtain all the dataset and model files, you can run the included code below or download the files from google drive, which is a download of all our models and data.
This repo uses spacy's 'en' model, please download it using:
python -m spacy download en
We also use the SPRING AMR parsing in our pipeline
(see the SPRING article) using scripts adapted
directly from the SPRING github. In order to make the AMR parsing scripts
function, please download the AMR parsing weights (link from
the SPRING github), and place the weights into models/spring_amr/AMR3.pt
Note that all the data setups are done using dataset_setups.py
, which details what each script does. You can set
up this repo by reading the helpers from that file or by reading this README.
We take the aliases of the wikidata-vitals relations from TorchKGE.
The following command saves a dictionary to wikidatavitals/data/property_verbs.json
that maps a wikidata entity ID to a
list of verbs that represent it:
python dataset_setups.py --verbs
The created file weighs 98.7 kB (with a maximum amount of verbs set to 5), the process takes around 7 minutes.
We also save a list at wikidatavitals/data/verb_idx2id.json
mapping the index of a verb to its original property ID.
We chose the Universal Sentence Encoder as a vector representation for our sentences.
The code for this comparison is located in models/comparators.py
In order to compare two sentences together, we use the cosine similarity of their USE representations.
Thanks to TorchKGE, it is simple to access the id -> title mapping of wikidata-vitals entities.
We save this dictionary in wikidatavitals/data/entity_names.json
with:
python dataset_setups.py --entities
The file weighs 1.4 MB.
This command also saves a dictionary mapping entity ids to the corresponding entity's aliases to
wikidatavitals/data/entity_aliases.json
. This file weighs 4.0 MB.
Obtaining the aliases requires prompting the Wikidata API, and in total takes around 7 hours.
Just like for entities, we save a dictionary id -> title for the relations, as well as the list of all the fact triplets in Wikidata-vitals using:
python dataset_setups.py --relations
This execution takes about a minute, and the two files weigh 31 kB and 6.9 MB.
In order to filter triplet possibilities, we need to store the types of each entity and the possible types of the arguments of each relation. This is done with:
python dataset_setups.py --entity-types
python dataset_setups.py --relation-types
In order to assess the quality of the knowledge extraction, we put it to the test on the USA article: we consider a predicted fact correct if the fact is already present in Wikidata-vitals.
The benchmark requires a setup (please also go through the general setup first!):
python benchmark.py --prepare
It can then be run using (example for v1):
python v1.py --usa-benchmark
This process takes 7 minutes with an RTX3090 (python allocates a lot of GPU memory but uses little GPU processing power here).
Another similar benchmark is the "Hundo" benchmark which does the same as the USA benchmark but with 100 random articles (note that these articles are randomly taken, and thus can be anywhere in a train/val split).
This is a baseline that outputs the most likely relation that satisfies type constraints. A triplet is considered legal if the entities have legal types for the relation. An entity type is defined by an "instance-of" relation in Wikidata, and the legal types for the head (h) and tail (t) entities for relation (r) are all the types of the corresponding entities in triplets (h', r, t') present in Wikidata.
We compare a query sentence with a list of generated sentence of the form "entity 1" "verb" "entity 2"
We apply the comparison method to all the (ordered) entities in the original sentence, then for each pair (that isn't too far away in the text) we find the most similar property.
We added a threshold method to avoid creating facts, eg 'Carlos Santana is a Mexican guitarist.': we don't want a property between "Mexican (nationality)" and "guitarist (occupation)", so far this method isn't too effective.
In order to test v1 on a sentence "[sentence]", run the command:
python v1.py --sentence "[sentence]"
/!\ This uses a TensorFlow model (the Universal Sentence Encoder), so having a GPU available is recommended.
Given a fact (e1, r, e2), we build a "sentence" using the entity aliases and relation verbs. The next step is to compute the BERT [CLS] output on all the built sentences. These vectors serve as input for a supervised classifier (XGBoost).
In order to test v2 on a sentence "[sentence]", run the command:
python v2.py --sentence "[sentence]"
In order to save the classification dataset to wikidatavitals/data/encoded/
, run:
python dataset_setups.py --encode
This requires about 5 GB of RAM and 3 GB of VRAM, the process takes under 2min with an RTX 3090. The produced files combined weigh 1.2 GB.
Adjust the configuration in the config file (the default value has good results on the train/val sets), then run:
python v2.py --train
This model is trained using Wikipedia sentences annotated using known Wikidata facts. It is a slight variation on the v2 idea, however preparing its dataset is extremely costly.
python datasets_setups.py --save-wikivitals
python dataset_setups.py --annotate-wikivitals
python dataset_setups.py --encode-wikivitals
python v2.py --train --point-five
This model classifies entity pairs (i.e., a pair of word groups detected by wikifier) into relations. The word pairs are encoded using BERT's attentions, and we classify these pair encodings using an XGB model.
- V3.5 idea: use the TransE pair result in the pipeline
The following command takes the sentences from wikivitals/data/encoded/train_sentences.json
and creates an
annotated pair dataset from it:
python dataset_setups.py --encode-pairs
python v3.py --train
These results hold true for the default configuration in config.py
:
Version | Detections | Correct % | Correct pairs % |
---|---|---|---|
V0 | 459 | 0.65% | 20.7% |
V1 | 51 | 0% | 7.84% |
V2 | 87 | 3.45% | 13.79% |
V2.5 | 161 | 22.36% | 25.47% |
V3 | 698 | 3.72% | 8.13% |
Version | Detections | Correct % | Correct pairs % | MRR |
---|---|---|---|---|
V0 | 459 | 0.65% | 20.7% | 0.296 |
V1 | 51 | 0% | 6.38% | --- |
V2 | 81 | 3.7% | 9.88% | 0.477 |
V2.5 | 156 | 23.72% | 26.28% | 0.929 |
V3 | 615 | 4.35% | 7.42% | 0.657 |
These results hold true for optimised thresholds (not the default configuration in config.py
):
Version | Detections | Correct % | Correct pairs % | MRR |
---|---|---|---|---|
V0 | 4826 | 0.35% | 13.8% | 0.32 |
V1 | 488 | 0.28% | 1.40% | 0.75 |
V2 | 1755 | 0.46% | 1.82% | 0.41 |
V2.5 | 2179 | 2.52% | 3.99% | 0.73 |
V3 | 12621 | 0.61% | 2.12% | 0.42 |
In order to parse a sentence into AMR format, use the AMRParser
from models/amr.py
. This parser converts
a raw text sentence into AMR format using SPRING, and also detects
Wikidata-vitals entities.
Using the amr_relation_detection_helper.py
, you can build quickly a simple model that detects a specific relation,
by giving it sentences that portray this relation.