# Part 2: Knowledge Graph Generation

[![Notebook](https://shields.io/badge/notebook-access-green?logo=jupyter&style=for-the-badge)](https://github.com/cognitedata/neat/blob/main/docs/tutorial/notebooks/part-2-knowledge-graph-generation.ipynb)


* author: Nikola Vasiljevic
* date: 2023-04-03

This notebook represent Part 2 of NEAT Onboarding tutorial. If you have not completed previous part(s) we strongly suggest you do them first before doing this part.

Often we do not have knowledge graphs per se. Instead we have scattered and unconnected information which needs to be bring together to form knowledge graph.
Also, often is useful to test how knowledge graph based on certain data model will function, for example how queries would perform if we have very large knowledge graphs or very deep knowledge graph (many hops).

NEAT can help in both cases, i.e. simplify the above scenarios but requires that you have already defined data model through `Transformation Rules`.

For this purpose we have prepared simple transformation rules which you can download [using this link](https://github.com/cognitedata/neat/blob/main/cognite/neat/examples/rules/power-grid-example.xlsx). Same Excel sheet is available directly as part of `examples` module of `neat`, as we demonstrated in Part 1 tutorial.


In this notebook, we will also demonstrate both scenarios and base the demonstration on the data model defined in Part 1. 
Accordingly in this notebook we will cover:

1. Generation of knowledge graph using Graph Capturing Sheet generated by NEAT
2. Generation of mock knowledge graphs of arbitrary size using mock module


Let's import all necessary libraries:

In [1]:
from cognite.neat.rules import parse_rules_from_excel_file
from cognite.neat.graph.extractors import NeatGraphStore
from cognite.neat.graph.extractors import extract_graph_from_sheet

from cognite.neat.rules.examples import power_grid_model
from cognite.neat.rules.exporter.rules2graph_sheet import rules2graph_capturing_sheet

from cognite.neat.rules.analysis import get_defined_classes

# from cognite.neat.core import loader, parser, extractors
from cognite.neat.utils.utils import remove_namespace, add_triples
from cognite.neat.graph.extractors.mocks.graph import generate_triples, _rules_to_dict

%reload_ext autoreload
%autoreload 2


Since we already have defined data model in Part 1, we will load it and use it for the rest of this notebook.

Here we setting path to the transformation rules which contain data model definition and parsing data model in corresponding form:

In [2]:
transformation_rules = parse_rules_from_excel_file(power_grid_model)

Let's now take a look and see how many defined classes we have:

In [3]:
get_defined_classes(transformation_rules)

{'GeographicalRegion', 'SubGeographicalRegion', 'Substation', 'Terminal'}

Let's now inspect properties related to one of the classes. Here we can see that class `Substation` contains three properties. First property in the list contains value of type string, this type of property in semantic data modeling is known as data type properties, or in general graph theory this property is a node attribute, where node is equivalent to class instance. The remaining property basically contains link to `SubGeographicalRegion` instance. This type of property in the semantic data modeling is known as object properties, while in general graph theory this property represent an edge that connect nodes of two types.

In [4]:
_rules_to_dict(transformation_rules)['Substation']

Unnamed: 0,property_type,value_type,min_count,max_count
name,DatatypeProperty,string,1,1
subGeographicalRegion,ObjectProperty,SubGeographicalRegion,1,1


Now we can use the above data model to generate what we call `Graph Capturing Sheet` which is tailored Excel sheet containing:
- sheets for each of the defined classes
- columns corresponding to each property defined in data model

This sheet is generate using method `rules2graph_capturing_sheet` which is part of `extractors`. The method contains following arguments:

- `transformation_rules` : which is instance of transformation rules that contain definition of data model
- `file_path`: path where the graph capturing sheet should be stored
- `no_rows`: represent expected maximum number rows each sheet will have, thus corresponding to maximum of instance of any of define classes, by default set to 10000
- `auto_identifier_type` : type of auto identifier to be made for each class instance, by default set to `index` meaning `index-based` identifiers where index is row number
- `add_drop_down_list`: flag indicating whether to provide drop down selection of identifiers (i.e. links) for object type properties (i.e., edges)

We will use default values for arguments, meaning, automatic identifiers based on indexes, 10 000 rows, and drop down menus for object type properties:

In [5]:
rules2graph_capturing_sheet(transformation_rules, "./power-grid-graph-capture.xlsx")

In the animated gif below one can see how generated graph capturing sheet looks as well how process of capturing graph is conveyed.


<video src="../../videos/tutorial-2-graph-capturing-sheet.mp4" controls>
</video>


A row in a sheet represent an instance of a class. As one enters values for property in column `B`, the identifier is automatically added.
As we define instances, their identifier become in drop down menus for properties which are "edges" between "nodes". By connecting "nodes" we make a knowledge graph.

Let's now convert now filled graph capturing sheet into knowledge graph. First, we will create empty graph store object, then load raw sheet, and finally convert the raw sheet to graph using previously defined data model in transformation rules:

In [6]:
graph_store = NeatGraphStore(prefixes=transformation_rules.prefixes, 
                             namespace=transformation_rules.metadata.namespace)

graph_store.init_graph(base_prefix=transformation_rules.metadata.prefix)


triples = extract_graph_from_sheet("./examples/power-grid-example.xlsx", transformation_rules)


add_triples(graph_store, triples)

To check graph content we can execute `SPARQL` to count all the class instances:
```
SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)
```

and later on when processing results we are purposely removing namespaces from the class names:

In [7]:
for res in list(graph_store.graph.query("SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)")):
    print(f"{remove_namespace(res[0]):25} {res[1]}" )

GeographicalRegion        2
SubGeographicalRegion     2
Substation                2
Terminal                  2


As expected, we have two instances of each class that we captured through graph capturing sheet.
This completes first possible scenario of using NEAT to create knowledge graph when one does not exist. 

In the second scenario we will use data model and generate mock graph. 
We achieve this by configure desired number of instances per each of the above classes. 
We will store desired number of instances in a dictionary which we will call `class_count`:

In [8]:
class_count = {"GeographicalRegion":5, 
               "SubGeographicalRegion":10, 
               "Substation": 20, 
               "Terminal": 60}

To generate mock graph we will re-initialize empty graph store, to which we will store triples that will represent our mock graph:

In [9]:
graph_store = NeatGraphStore(prefixes=transformation_rules.prefixes, 
                             namespace=transformation_rules.metadata.namespace)
graph_store.init_graph(base_prefix=transformation_rules.metadata.prefix)

We will create triples and then will added them to the graph.

The triples are created by providing our data model and desired number of instances per class in a form of dictionary to method `generate_triples`. 

Afterwards, we will add those triples to our graph using method `add_tripes`:

In [10]:
mock_triples = generate_triples(transformation_rules, class_count)
add_triples(graph_store, mock_triples)

After successfully creating and adding mock triples let's now take a look at the graph and see if we have expected number of class instances:

In [11]:
for res in list(graph_store.graph.query("SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)")):
    print(f"{remove_namespace(res[0]):25} {res[1]}" )

Terminal                  60
Substation                20
SubGeographicalRegion     10
GeographicalRegion        5
