# Dataset Creation

This notebook creates a `Dataset` from a knowledge graph.

A small subgraph of DBpedia is stored in the OpenKE format in `data/dbpedia/toy`. This subgraph, denoted `DBP50K`, contains:
- 54,795 entities
- 776 relations
- 316,114 triples

In this notebook, we load this graph, create a `Dataset` by sampling instances from the graph, and store it on disk. This illustrates the use of classes `KnowledgeGraph`, `Taxonomy` and `Dataset`.

**Step 1** Load the graph from disk

In [1]:
from libs.graph import KnowledgeGraph

kg = KnowledgeGraph.from_dir("toy")

Triples: 100%|██████████| 316114/316114 [00:02<00:00, 118048.13it/s]


**Step 2** In our setting, the graph contains no taxonomic information, i.e no `rdfs:subClassOf` relations.
Our gold standard must thus come from an external source. Here, we use the axioms stored in `data/taxonomy/toy.txt`.

In [5]:
from libs.taxonomy import Taxonomy

T = Taxonomy.from_file("data/taxonomy/toy.txt", add_root="root")
T.print()

     ┌dbo:Location┐
     │            └dbo:PopulatedPlace┐
     │                               └dbo:Settlement
 root┤
     │         ┌dbo:Organisation
     ├dbo:Agent┤
     │         └dbo:Person┐
     │                    └dbo:Athlete
     │         ┌dbo:SocietalEvent
     └dbo:Event┤
               └dbo:SportsEvent


**Step 3** For each class in the taxonomy, sample entites and add them to the dataset

In [8]:
classes = {cls.name for cls in T if not cls.is_root}
classes

{'dbo:Agent',
 'dbo:Athlete',
 'dbo:Event',
 'dbo:Location',
 'dbo:Organisation',
 'dbo:Person',
 'dbo:PopulatedPlace',
 'dbo:Settlement',
 'dbo:SocietalEvent',
 'dbo:SportsEvent'}

In [10]:
from libs.dataset import Dataset
from sklearn.utils import shuffle

n_classes = 10
n_entities = 200

used_indices = set()
indices = []
labels = []
name2cls, cls2name = dict(), dict()

for name in classes:
    cls = len(name2cls)
    name2cls[name] = cls
    cls2name[cls] = name
    
    for instance in kg.sample_instances(n_entities, from_type=name, exclude_ids=used_indices):
        used_indices.add(instance)
        indices.append(instance)
        labels.append(cls)
        
indices, labels = shuffle(indices, labels)

data = Dataset(indices, labels, name2cls, cls2name, axioms=T.to_axioms())

Dataset is now created. A summary can be printed using:

In [13]:
print(data.summary())

Dataset (10 classes, 2000 instances):
---
dbo:Settlement       200
dbo:Location         200
dbo:SportsEvent      200
dbo:PopulatedPlace   200
dbo:Athlete          200
...


**Step 4** Save dataset (it will then become accessible with `Dataset.load('data/dataset/toy')`)

In [11]:
import os

dirname = "data/dataset/toy"
if not os.path.exists(dirname):
    data.save(dirname)