Detection of Relation Assertion Errors in Knowledge Graphs
Implementation of the PaTyBRED error detection method from the
paper "Detection of Relation Assertion Errors in Knowledge Graphs" in the proceedings of K-CAP 2017.
How to use
Firstly the dataset needs to be converted into NPZ format supported by our system.
This can be done with
load_kb.py which takes NT files as
input. If the dataset is in another format
can be used to convert it into the NT format.
python load_kb.py dataset.nt
Once the dataset was correctly loaded into the correct format, it is possible to compute the triple score and rank all facts
in the data with
The KG model is selected with
-m, the path of the ranked data output with
-o and the learned model can be saved to the path defined by
python rank_facts.py dataset.npz -m patybred -o ranked_dataset.pkl -sp learned-model.pkl
An evaluation can be performed by adding noise (wrong triples) to a dataset and subsequently detecting it with the chosen method.
In order to add noise
generate_errors.py can be used.
-pe indicates the ratio of noise to be generated (
0.01 means 1% of the original number of triples).
-ek is the kind of noise generated by corrupting correct triples by replacing the subject or object (
1 for substituting original entity with a random entity of any type and
2 with a random entity of same type as the original).
A NPZ file with the original data plus the generated errors will be created as output.
This file can then be used by
detect_errors.py, which will learn a KG model on the noisy dataset, rank the facts, and evaluate
how the erroneous facts are ranked.
Evaluation results are shown with various performance measures.
python generate_errors.py dataset.npz -pe 0.01 -ek 1 python detect_errors.py dataset-ek1.npz -m patybred -o ranked_dataset.pkl -sp learned-model.pkl
Generating SHACL Constraints
Implementation of the generation of SHACL-SPARQL relation constraints from the
paper "Automatic Detection of Relation Assertion Errors and Induction of Relation Constraints" submitted to the Semantic Web Journal.
In order to generate the SHACL constraints it is necessary to learn a PaTyBRED model with decision trees as local classifiers (
-m patybred -clf dt) when learning the model.
When generating the constraints there two mandatory parameters: the first is the path to the learned model and the second the path to the original KG dataset, which contains the relation and type names.
python shacl-sparql.py learned-model.pkl dataset.npz -c 0.99 -ms 10
-c specifies the minimum confidence and
-ms the minimum support. These parameters are used when pruning the learned decision tree.
In order to validate your dataset against the set of learned constraints you can use the
TopBraid implementation of SHACL based on Jena
The datasets used in the paper's automatic evaluation (containing generated errorenous triples) can be downloaded here:
- Semantic Bible
- Nobel Prize
- AIFB portal
The datasets used in the paper's manual evaluation can be downloaded here: