# Content

- The objective of this notebook is to explain to the reader how to create synthetic instances for the purpose of discovering Differential Causal Rules.
- Synthetic instances are created to define a distance threshold in the embedding space. This threshold is then used to match similar instance and to discover differential causal rules. These steps are described in Tutorial2_Mining_DCRules.ipynb

# Libraries Import

In [1]:
import pandas as pd
import numpy as np

import random
import uuid

In [2]:
import ampligraph
from ampligraph.datasets import load_from_csv

if ampligraph.__version__ == '1.4.0':
    print("AmpliGraph version OK")

AmpliGraph version OK


In [3]:
from synthetic_generation import *

# Data Import

In [4]:
directory_path = '../datasets'
file_name = 'dbpedia_extract.csv'
X = load_from_csv(directory_path,file_name, sep=',')

In [5]:
# checking the import
print(f"The knowledge graph is composed of {len(X)} triples")

The knowledge graph is composed of 6908 triples


# Presenting the Schema

<img src="images/schema_dbpedia.png">

In [6]:
target_class = 'http://dbpedia.org/ontology/Writer'
target_class_instances = get_instances_for_type(X,target_class)

The dataset using in this tutorial is an extract from DBPedia that we name DBPediaW. It has been studied in [1] and [2]. The target class is writer. Therefore, we aim in this tutorial to create synthetic instances from writers already in DBPediaW.

[1] Munch, M., Dibie, J., Wuillemin, P., Manfredotti, C.E.: Towards interactive causal relation discovery driven by an ontology. In: International Florida Artificial Intelligence Research Society Conference (2019)

[2] Simonne, L., Pernelle, N., Sais, F., Thomopoulos, R.: Differential Causal Rules Mining in Knowledge Graphs. In: Proceedings of the Knowledge Capture Conference, K-CAP 2021 (2021)

# Creating A Synthetic Instance

<div class="alert alert-success">
    <b>Note :</b>
    As a reminder, synthethic instances are a modified version from an existing instance of the target class. They are associated to a <b>number of differences</b>, DONNER DÉFINITION PLUS TARD.
</div>

To create a synthetic instance of writer from an original writer instance, several paths can be modified :
- on the Writer description :
    - birthDate
    - genre
    - gender
- on the description of its nodes :
    - Book :
        - releaseDate
        - number of pages
    - University :
        - arwu
        - endowment
        - Country :
            - name

The protocol for creating synthetic instances can be structured in two sections :
- <b>Section 1 : Get the descriptions and select the subset of graph to modify.</b>
    - get the descriptions of a given writer, its books, university and country
    - given a number of differences, select the paths to be modified and the nodes accordingly
-  <b>Section 2 : Build new triples.</b>
    - query the KG to find whether nodes respecting the changes to apply already exist or not
    - create new nodes if needed
    - create the new triples from the nodes previouly determined (newly created or/and not)

In [7]:
def get_synthetic_instance_from_original(X,instance_writer,number_differences,blocked_p):
    """
    Return URI of synthetic writer and 
    """
    # obtain RDF description from instance
    dic_writer, dic_of_books, uni, dic_uni, country, dic_country = get_description_for_generation(X,instance_writer)
    
    # apply number of differences
    dic_paths_to_change = get_paths_to_change(X,instance_writer,dic_writer,dic_of_books,uni,dic_uni,country,dic_country,number_differences)
    
    # obtain new triples to add to the KG
    new_writer_URI, triples_to_add = get_triples_to_add(X,instance_writer,dic_paths_to_change,dic_writer,dic_of_books,uni,dic_uni,country,dic_country)
    
    return new_writer_URI, triples_to_add

In [8]:
blocked_p = [
    'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
    'http://dbpedia.org/ontology/author'
]

instance_test = random.sample(target_class_instances,1)[0]
NUMBER_DIFFERENCES = 5

In [9]:
dic_writer, dic_of_books, uni, dic_uni, country, dic_country = get_description_for_generation(X,instance_test)

In [10]:
dic_paths_to_change = get_paths_to_change(
    X,
    instance_test,
    dic_writer,
    dic_of_books,
    uni,
    dic_uni,
    country,
    dic_country,
    NUMBER_DIFFERENCES,
    blocked_p
)

In [11]:
dic_paths_to_change

{'http://dbpedia.org/resource/Jennifer,_Hecate,_Macbeth,_William_McKinley,_and_Me,_Elizabeth': ['http://dbpedia.org/ontology/releaseDate'],
 'http://dbpedia.org/resource/E._L._Konigsburg': ['http://dbpedia.org/ontology/genre',
  'http://dbpedia.org/ontology/birthDate'],
 'http://dbpedia.org/resource/Carnegie_Mellon_University': ['http://dbpedia.org/ontology/arwuW'],
 'http://dbpedia.org/resource/The_View_from_Saturday': ['http://dbpedia.org/ontology/numberOfPages']}

In [12]:
new_writer_URI, triples_to_add = get_triples_to_add(X,instance_test,dic_paths_to_change,dic_writer,dic_of_books,uni,dic_uni,country,dic_country)

In [15]:
print('The URI of the synthetic writer is :',new_writer_URI)

The URI of the synthetic writer is : 9f97ea1d-0593-4936-8acf-574bc1bf9d3f


In [14]:
triples_to_add

[['24659fae-42b2-481c-9175-26a2c106a692',
  'http://dbpedia.org/ontology/hasForStudent',
  '9f97ea1d-0593-4936-8acf-574bc1bf9d3f'],
 ['24659fae-42b2-481c-9175-26a2c106a692',
  'http://dbpedia.org/ontology/hasForStudent',
  'http://dbpedia.org/resource/E._L._Konigsburg'],
 ['24659fae-42b2-481c-9175-26a2c106a692',
  'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
  'http://dbpedia.org/ontology/University'],
 ['24659fae-42b2-481c-9175-26a2c106a692',
  'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
  'http://www.w3.org/2002/07/owl#NamedIndividual'],
 ['24659fae-42b2-481c-9175-26a2c106a692',
  'http://dbpedia.org/ontology/arwuW',
  '501'],
 ['24659fae-42b2-481c-9175-26a2c106a692',
  'http://www.w3.org/2000/01/rdf-schema#label',
  'Carnegie Mellon University'],
 ['24659fae-42b2-481c-9175-26a2c106a692',
  'http://dbpedia.org/ontology/endowment',
  '1.739E9'],
 ['http://dbpedia.org/resource/U.S.',
  'http://dbpedia.org/ontology/isCountryOf',
  '24659fae-42b2-481c-9175-26a2c106a692'],
 [

# Next steps : Creating Synthetic Instances and saving them

(1) We saw how to create a synthethic instance given an existing instance of the target class and a number of differences.

(2) The user can then generate a set of synthethic instances by keeping track of :
- the pairs (original instance - synthetic instance) given the number of differences
- we advise the reader to create a set of instances for a range of number of differences values to observe its effect on the rules

(3) The last step is presented in the next tutorial (Tutorial2_Mining_DCRules.ipynb) and consists of :
- training a KG embedding model on the whole dataset (original + synthetic instances)
- get the distance between pairs of instances given a number of differences to define a matching threshold
- get pairs of instances according to a treatment, obtain their outcome and compute rule metric