# Part 3: Knowledge Graph to Asset Hierarchy


[![Notebook](https://shields.io/badge/notebook-access-green?logo=jupyter&style=for-the-badge)](https://github.com/cognitedata/neat/blob/main/docs/tutorial/part-3-knowledge-graph-to-asset-hierarchy.ipynb)


* author: Nikola Vasiljevic
* date: 2023-05-11

Up until this notebook you did not need to have [Cognite Data Fusion](https://www.cognite.com/en/product/cognite_data_fusion_industrial_dataops_platform) instance running. However, for this notebook, CDF is prerequisite, as well client configured to interact for CDF. How to configure CDF as well client is beyond this notebook and tutorial. We suggest you to find appropriate information at [Cognite Developer Center](https://developer.cognite.com/dev/#authenticate).

In this Part 3 of tutorial we will describe how NEAT generates asset hierarchy (i.e., CDF classic graph) from RDF graph (i.e. knowledge graph) as depicted at the following image:

<img src="../../figs/rdf2cdf-graph.jpg"  width="75%" alt="RDF2CDF">


The aforementioned image shows high level flow from RDF graph to CDF graph. As we can see an RDF graph can be decoupled to:
- nodes (i.e. instances of specific classes)
- edges (i.e., relationships that connect nodes)

On the other hand CDF graph based on asset-centric data model (aka, classic CDF), consists of:
- assets
- asset hierarchy 
- relationships among assets

Accordingly, based on the transformation rules, NEAT converts:
- RDF nodes to CDF assets
- certain type(s) of RDF edges to CDF asset hierarchy
- certain type(s) of RDF edges to CDF relationships between assets 


First download transformation rules from this [link](https://github.com/cognitedata/neat/blob/main/cognite/neat/examples/rules/source-to-solution-mapping-rules.xlsx) and place it at an appropriate location. Alternatively, if you have cloned `neat` repository, find file in:

 `./cognite/neat/exaples/rules/source-to-solution-mapping-rules.xlsx`.


These transformation rules are a bit more extensive than in previous parts, containing more detailed data model. Their details we will be covered in Part 4. 


Also, for convenience store configuration of a Cognite client in `.env` file, with following structure:

```

TENANT_ID = ...
CLIENT_ID = ...
CLIENT_SECRET = ...
CDF_CLUSTER = ...
COGNITE_PROJECT = ...

```

This file will be loaded as config dictionary and used to configure the Cognite client.



Let's import all the necessary libraries, create CDF client and mock RDF graph that we will process:

In [2]:
from pathlib import Path

from cognite.client import CogniteClient, ClientConfig
from cognite.client.credentials import OAuthClientCredentials
from cognite.client.config import ClientConfig, global_config


from cognite.neat.workflows.examples import source_to_solution_mapping

from cognite.neat.rules import load_rules_from_excel_file
from cognite.neat.graph.stores import NeatGraphStore

from cognite.neat.graph.loaders import rdf2assets, categorize_assets
from cognite.neat.graph.loaders import rdf2relationships, categorize_relationships
from cognite.neat.graph.loaders import upload_labels, upload_assets, upload_relationships

from cognite.neat.utils.utils import remove_namespace, add_triples
from cognite.neat.graph.extractors.mocks.graph import generate_triples

from dotenv import dotenv_values


In [55]:
# Insert path to your .env file in 

config = dotenv_values() if dotenv_values() else dict(TENANT_ID="", CLIENT_ID="", CLIENT_SECRET="", CDF_CLUSTER="", COGNITE_PROJECT="")

SCOPES = [f"https://{config['CDF_CLUSTER']}.cognitedata.com/.default"]
TOKEN_URL = f"https://login.microsoftonline.com/{config['TENANT_ID']}/oauth2/v2.0/token"

credentials = OAuthClientCredentials(token_url=TOKEN_URL, 
                                     client_id=config['CLIENT_ID'], 
                                     client_secret=config['CLIENT_SECRET'], 
                                     scopes=SCOPES)

client_config = ClientConfig(client_name="cognite",
                             base_url=f"https://{config['CDF_CLUSTER']}.cognitedata.com",
                             project=config['COGNITE_PROJECT'],
                             credentials=credentials,
                             max_workers=1,
                             timeout=5 * 60,)

client = CogniteClient(client_config)

Let's now load transformation rules and check which classes are defined:

In [13]:
# Replace `source_to_solution_mapping` with Path to your own transformation rules
transformation_rules = load_rules_from_excel_file(source_to_solution_mapping)
transformation_rules.get_defined_classes()

{'GeographicalRegion',
 'Orphanage',
 'RootCIMNode',
 'SubGeographicalRegion',
 'Substation',
 'Terminal'}

Let's now configure desired number of instances per each of the above classes. We will store desired number of instances in a dictionary which we will call `class_count`:

In [14]:
class_count = {"RootCIMNode":1, 
               "GeographicalRegion":5, 
               "SubGeographicalRegion":10, 
               "Substation": 20, 
               "Terminal": 60}

To generate mock graph we will first create an empty graph to which we will store triples that will represent our mock graph:

In [15]:
graph_store = NeatGraphStore(prefixes=transformation_rules.prefixes, 
                             namespace=transformation_rules.metadata.namespace)
graph_store.init_graph(base_prefix=transformation_rules.metadata.prefix)

We will create triples and then will added them to the graph.

The triples are created by providing our data model and desired number of instances per class in a form of dictionary to method `generate_triples`. Afterwards, we will add those triples to our graph using method `add_tripes`:

In [16]:
mock_triples = generate_triples(transformation_rules, class_count)
add_triples(graph_store, mock_triples)

At this point we have RDF graph which is stored in memory, we can check if NEAT really created desired RDF graph by checking number of instances (i.e. nodes) per each class. We do this by executing `SPARQL` query against the RDF graph:

```
SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)
```

which will count number of instances per class

In [17]:
query = "SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)"
results = list(graph_store.graph.query(query))

for r in results:
    print(f"{r[0]:50} {r[1]}" )

http://purl.org/cognite/tnt#Terminal               60
http://purl.org/cognite/tnt#Substation             20
http://purl.org/cognite/tnt#SubGeographicalRegion  10
http://purl.org/cognite/tnt#GeographicalRegion     5
http://purl.org/cognite/tnt#RootCIMNode            1


Now let assume that our CDF dataset does not contain any asset, asset hierarchy and relationship. Accordingly NEAT will go through all the steps shown in the below image to produce CDF graph.
In the first step, we will call: 
- `rdf2assets` which will produce candidate assets to be uploaded to CDF as well asset hierarchy
- `categorize_assets` which will categorize this candidate assets against CDF, splitting them to those that are to be `created`, `updated`, `decommissioned` and `resurrected`
- `upload_assets` which will upload categorized assets to CDF


In the consecutive step, we will call: 
- `rdf2relationships` which will produce candidate relationships to be uploaded to CDF
- `categorize_relationships` which will categorize this candidate relationships against CDF, splitting them to those that are to be `created`, `decommissioned` and `resurrected`. This method will check both existence and state of existing relationships and assets
- `upload_relationships` which will upload categorized relationships to CDF

<img src="../../figs/rdf2cdf-init-run.jpg"  width="50%" alt="RDF2CDF">

Now let's run each step and inspect results. When running `rdf2assets` there will be ERROR logged regarding `Orphanage`, a special asset expected by NEAT to be in RDF graph. Since we did not created, but defined it in transformation rules, NEAT is logging this as an error, but also fixing issue by creating this asset:

In [18]:
candidate_assets = rdf2assets(graph_store, transformation_rules)


ERROR:root:Error while loading instances of class <http://purl.org/cognite/tnt#Orphanage> into cache. Reason: 'instance'


As expected total number of assets is 96 (which we have in RDF graph) plus additional special asset for Orphanage.

In [19]:
print(f"Total number of assets extracted: {len(candidate_assets)}")

Total number of assets extracted: 97


Let's now categorize and upload assets. This completes our first step, i.e. creation of assets and asset hierarchy, as shown below:

<img src="../../figs/rdf2cdf-init-run-step1.jpg"  width="50%" alt="RDF2CDF">

We are expecting to see that there are only assets under category `create`. 

In [21]:
categorized_assets = categorize_assets(client, 
                                       candidate_assets, 
                                       transformation_rules.metadata.data_set_id)


for cat in categorized_assets:
    print(f"Category {cat:15} has {len(categorized_assets[cat]):2} assets")


Category create          has 97 assets
Category update          has  0 assets
Category resurrect       has  0 assets
Category decommission    has  0 assets


Before we upload assets, we need to create labels which we use to label asset and relationship types as well their status:

In [23]:
upload_labels(client, transformation_rules)

Lets now upload assets:

In [24]:
upload_assets(client, categorized_assets, batch_size=1000)

Lets quickly inspect created CDF asset hierarchy:

<video src="../../videos/tutorial-3-asset-hierarchy.mp4" controls>
</video>

If we now re-run `categorize_assets` no asset will be present under any of categories. This means that we have successfully uploaded assets and created asset hierarchy!

In [26]:
categorized_assets = categorize_assets(client, 
                                       candidate_assets, 
                                       transformation_rules.metadata.data_set_id)


for cat in categorized_assets:
    print(f"Category {cat:15} has {len(categorized_assets[cat]):2} assets")


Category create          has  0 assets
Category update          has  0 assets
Category resurrect       has  0 assets
Category decommission    has  0 assets


Let's now create, categorize and upload relationships. This completes our last step as shown below:

<img src="../../figs/rdf2cdf-init-run-step2.png"  width="50%" alt="RDF2CDF">

We are expecting to see that there are only relationships under category `create`. 

In [27]:
candidate_relationships = rdf2relationships(graph_store, transformation_rules)


categorized_relationships = categorize_relationships(client, 
                                                     candidate_relationships, 
                                                     transformation_rules.metadata.data_set_id)


for cat in categorized_relationships:
    print(f"Category {cat:15} has {len(categorized_relationships[cat]):2} relationships")

Category create          has 135 relationships
Category resurrect       has  0 relationships
Category decommission    has  0 relationships


In [28]:
upload_relationships(client, categorized_relationships, batch_size=1000)

Let's inspect relationships that are created for one of the objects:

<video src="../../videos/tutorial-3-relationships.mp4" controls>
</video>


Similarly to rerunning categorization of assets, we can rerun categorization of relationships and see that no new relationships are created, decommissioned or updated:

In [29]:
categorized_relationships = categorize_relationships(client, 
                                                     candidate_relationships, 
                                                     transformation_rules.metadata.data_set_id)


for cat in categorized_relationships:
    print(f"Category {cat:15} has {len(categorized_relationships[cat]):2} relationships")

Category create          has  0 relationships
Category resurrect       has  0 relationships
Category decommission    has  0 relationships


Let's now introduce a change in the RDF graph and see how NEAT will react to given change.
Specifically we will:
- remove node `Substation-13`
- remove relation ship between nodes `Substation-3` and `Terminal-3`

This should produce graph with reduced number of nodes and edges, as conceptually depicted below:


<img src="../../figs/rdf2cdf-graph-change.jpg"  width="50%" alt="RDF2CDF">

RDF graph original state and state after removing certain parts.

In [30]:

# Removes all triples from the graph tied to the Substation-13
graph_store.graph.remove((transformation_rules.metadata.namespace["Substation-13"], None, None))


# Removes only relationship between SubGeographicalRegion-1 and GeographicalRegion-1
graph_store.graph.remove((transformation_rules.metadata.namespace["Substation-3"],
                          None, 
                          transformation_rules.metadata.namespace["Terminal-3"]))

<Graph identifier=Ne992e5442a75415f8931ce910923ab65 (<class 'rdflib.graph.Graph'>)>

Let's repeat the previous process to see if how assets and relationships will change.
We are suppose to see following results when comes to assets:

- asset `Substation-13` will be decommissioned, unlike RDF graph we never delete assets from CDF graph
- asset `Substation-3` will be updated since its metadata field will no longer have field `Substation.Terminal` since we removed this relationship from RDF graph

In [31]:
candidate_assets = rdf2assets(graph_store, transformation_rules)

categorized_assets = categorize_assets(client, candidate_assets, transformation_rules.metadata.data_set_id)

print(44*"-")
for cat in categorized_assets:
    print(f"Category {cat:15} has {len(categorized_assets[cat]):2} assets")
print(44*"-")


print(f"Asset to decommission {categorized_assets['decommission'][0].external_id}")
print(f"Asset to update {categorized_assets['update'][0].external_id}")

upload_assets(client, categorized_assets, batch_size=1000)

ERROR:root:Error while loading instances of class <http://purl.org/cognite/tnt#Orphanage> into cache. Reason: 'instance'


--------------------------------------------
Category create          has  0 assets
Category update          has  1 assets
Category resurrect       has  0 assets
Category decommission    has  1 assets
--------------------------------------------
Asset to decommission Substation-13
Asset to update Substation-3


When comes to relationships we are suppose to see 7 relationships being decommissioned of which. 
One being relationship we have explicitly removed from RDF graph, that being `Substation-3` -> `Terminal-3`, while remaining 6 are result of removing `Substation-13`, which are: 
- `Substation-13` -> `Terminal-53`
- `Substation-13` -> `Terminal-33`
- `Substation-13` -> `Terminal-13`
- `Terminal-53` -> `Substation-13`
- `Terminal-33` -> `Substation-13`
- `Terminal-13` -> `Substation-13`

In [32]:
candidate_relationships = rdf2relationships(graph_store, transformation_rules)


categorized_relationships = categorize_relationships(client, 
                                                     candidate_relationships, 
                                                     transformation_rules.metadata.data_set_id)

print(44*"-")
for cat in categorized_relationships:
    print(f"Category {cat:15} has {len(categorized_relationships[cat]):2} relationships")
print(44*"-")
    
for relationship in categorized_relationships["decommission"]:
    print(f"Relationship to decommission: {relationship._external_id}")
    
    
upload_relationships(client, categorized_relationships, batch_size=1000)

--------------------------------------------
Category create          has  0 relationships
Category resurrect       has  0 relationships
Category decommission    has  7 relationships
--------------------------------------------
Relationship to decommission: Substation-3:Terminal-3
Relationship to decommission: Substation-13:Terminal-13
Relationship to decommission: Substation-13:Terminal-33
Relationship to decommission: Terminal-13:Substation-13
Relationship to decommission: Terminal-53:Substation-13
Relationship to decommission: Substation-13:Terminal-53
Relationship to decommission: Terminal-33:Substation-13


Accordingly we will see that assets and associated relationships have been decommissioned in CDF as shown in video below:

<video src="../../videos/tutorial-3-decommissioning.mp4" controls>
</video>



The entire process of updating CDF graph is shown below. As demonstrated through code as well in video above, what happens is that at core level NEAT performs set operation between RDF and CDF graphs. Specifically, first it find difference in number of nodes (i.e. assets) between CDF and RDF graphs, which identifies which nodes are to be decommissioned (yellow dots). Afterwards it find difference in relationships (i.e., edges) between these two graphs, thus identifying and decommissioning relationships (curved yellow lines). It is important to remember that NEAT never deletes asset or relationship but decommissioned them.

<img src="../../figs/rdf2cdf-post-init.jpg"  width="100%" alt="RDF2CDF">

We can repeat the same process but now bringing back removed node and relationships, thus "resurrecting" decommissioned asset and relationships.
This is depicted in the figure below.

<img src="../../figs/rdf2cdf-graph-resurrect.jpg"  width="100%" alt="RDF2CDF">

To do this we will add removed triples back:

In [33]:
add_triples(graph_store, mock_triples)

In [34]:
candidate_assets = rdf2assets(graph_store, transformation_rules)

categorized_assets = categorize_assets(client, candidate_assets, transformation_rules.metadata.data_set_id)

print(44*"-")
for cat in categorized_assets:
    print(f"Category {cat:15} has {len(categorized_assets[cat]):2} assets")
print(44*"-")


print(f"Asset to resurrect {categorized_assets['resurrect'][0].external_id}")
print(f"Asset to update {categorized_assets['update'][0].external_id}")

upload_assets(client, categorized_assets, batch_size=1000)

ERROR:root:Error while loading instances of class <http://purl.org/cognite/tnt#Orphanage> into cache. Reason: 'instance'


--------------------------------------------
Category create          has  0 assets
Category update          has  1 assets
Category resurrect       has  1 assets
Category decommission    has  0 assets
--------------------------------------------
Asset to resurrect Substation-13
Asset to update Substation-3


In [35]:
candidate_relationships = rdf2relationships(graph_store, transformation_rules)


categorized_relationships = categorize_relationships(client, candidate_relationships, transformation_rules.metadata.data_set_id)

print(44*"-")
for cat in categorized_relationships:
    print(f"Category {cat:15} has {len(categorized_relationships[cat]):2} relationships")
print(44*"-")
    
for relationship in categorized_relationships["decommission"]:
    print(f"Relationship to decommission: {relationship._external_id}")
    

upload_relationships(client, categorized_relationships, batch_size=1000)

--------------------------------------------
Category create          has  0 relationships
Category resurrect       has  7 relationships
Category decommission    has  0 relationships
--------------------------------------------


Now in CDF we will see that the decommissioned asset and relationships have been resurrected as indicated in the video below:

<video src="../../videos/tutorial-3-resurrection.mp4" controls>
</video>
