# Part 4: Source to Solution Graph

[![Notebook](https://shields.io/badge/notebook-access-green?logo=jupyter&style=for-the-badge)](https://github.com/cognitedata/neat/blob/docs/tutorial/notebooks/part-4-knowledge-graph-transformation.ipynb)

* author: Nikola Vasiljevic
* date: 2023-02-12


In this Part 4 of tutorial series we will most comprehensive feature of NEAT and that is transformation of source knowledge graph to solution graph and the latter conversion to CDF asset hierarchy. We will work with Nordic44 knowledge graph sourced from the RDF/XML file. We will use Nordic44 Equipment Profile knowledge graph, which contains number instances which conform to CIM (Common Information Model) data model. Nordic44 is open source and it is primarily tailored for research purpose.

First download necessary files:
- [Transformation rules](https://github.com/cognitedata/neat/blob/main/cognite/neat/examples/rules/source-to-solution-mapping-rules.xlsx)
- [Nordic44 knowledge graph](https://github.com/cognitedata/neat/blob/main/cognite/neat/examples/source-graphs/Knowledge-Graph-Nordic44.xml)

and placed them at convenient location for loading in this notebook.


Alternatively, if you have cloned `neat` repository, you will find the aforementioned files in:

 - `./cognite/neat/exaples/rules/source-to-solution-mapping-rules.xlsx`
 - `./cognite/neat/exaples/source_graphs/Knowledge-Graph-Nordic44.xml`


Also, for convenience store configuration of a Cognite client in `.env` file, with following structure:

```

TENANT_ID = ...
CLIENT_ID = ...
CLIENT_SECRET = ...
CDF_CLUSTER = ...
COGNITE_PROJECT = ...

```

This file will be loaded as config dictionary and used to configure the Cognite client.


Once you located necessary files, created `.env` file, load necessary libraries:

In [1]:
from pathlib import Path

from cognite.client import CogniteClient, ClientConfig
from cognite.client.credentials import OAuthClientCredentials

from cognite.neat.core import loader, parser, extractors
from cognite.neat.core.mocks.graph import generate_triples
from cognite.neat.core.utils import add_triples, remove_namespace
from cognite.neat.core import transformer

from dotenv import dotenv_values


%reload_ext autoreload
%autoreload 2

Let's instantiate CDF client in same why we did in Part 4:

In [2]:

config = dotenv_values("insert_path_to_.env_file_here")

SCOPES = [f"https://{config['CDF_CLUSTER']}.cognitedata.com/.default"]
TOKEN_URL = f"https://login.microsoftonline.com/{config['TENANT_ID']}/oauth2/v2.0/token"

credentials = OAuthClientCredentials(token_url=TOKEN_URL, 
                                     client_id=config['CLIENT_ID'], 
                                     client_secret=config['CLIENT_SECRET'], 
                                     scopes=SCOPES)

client_config = ClientConfig(client_name="cognite",
                             base_url=f"https://{config['CDF_CLUSTER']}.cognitedata.com",
                             project=config['COGNITE_PROJECT'],
                             credentials=credentials,
                             max_workers=1,
                             timeout=5 * 60,)

client = CogniteClient(client_config)

Let's load transformation rules:

In [3]:
TRANSFORMATION_RULES = Path("insert_path_to_transformation_rules.xlsx_here")
raw_sheets = loader.rules.excel_file_to_table_by_name(TRANSFORMATION_RULES)
transformation_rules = parser.parse_transformation_rules(raw_sheets)

transformation_rules.get_defined_classes()

{'GeographicalRegion',
 'Orphanage',
 'RootCIMNode',
 'SubGeographicalRegion',
 'Substation',
 'Terminal'}

Unlike Part 3, in this tutorial we need to create two instances of graph stores, one to hold triples of source graph, and second to hold triples of solution graph.
As we are loading existing graph to source we will need to specify its namespace, which we have conveniently stored in `Prefixes` sheet under `nordic44`:

In [4]:
source_store = loader.NeatGraphStore(prefixes=transformation_rules.prefixes, 
                                     namespace=transformation_rules.prefixes["nordic44"])
source_store.init_graph(base_prefix=transformation_rules.metadata.prefix)


solution_store = loader.NeatGraphStore(prefixes=transformation_rules.prefixes, 
                                    namespace=transformation_rules.metadata.namespace)
solution_store.init_graph(base_prefix=transformation_rules.metadata.prefix)

Now let's import Nordic44 triples to `source_store`:

In [5]:
source_store.import_from_file(Path("insert_path_to_Knowledge-Graph-Nordic44.xml_here"))


This query should return list of tuples containing URIs (i.e., references, globally unique ids) of RDFS classes in Nordic44 knowledge graph. The result will be a mix of base RDFS classes such as `Class`, `Property`, but also classes specific to `CIM` namespace:

Let's list top 20 classes and number of their instances like we did in previous tutorials. As will see the number of substations is 44, reason why Nordic44 has 44 in its name.

In [6]:
print(f"{'namespace':20} | {'class name':20}")
print(40*"-")

for i, res in enumerate(list(source_store.graph.query("SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)"))):
    print(f"{remove_namespace(res[0]):20} | {res[1]:20}" )
    if i > 20:
        break

namespace            | class name          
----------------------------------------
CurrentLimit         | 530                 
Terminal             | 452                 
OperationalLimitSet  | 238                 
OperatingShare       | 207                 
VoltageLimit         | 184                 
AnalogValue          | 133                 
ConnectivityNode     | 89                  
GeneratingUnit       | 80                  
SynchronousMachine   | 80                  
ACLineSegment        | 68                  
Line                 | 68                  
BusbarSection        | 46                  
VoltageLevel         | 45                  
Substation           | 44                  
ConformLoad          | 35                  
ConformLoadGroup     | 35                  
Analog               | 30                  
Breaker              | 29                  
Disconnector         | 26                  
PowerTransformerEnd  | 24                  
RegulatingControl    | 18          

If we try to do the same for `solution_store` we will that it is empty:

In [7]:
print(f"{'namespace':20} | {'class name':20}")
print(40*"-")

for i, res in enumerate(list(solution_store.graph.query("SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)"))):
    print(f"{remove_namespace(res[0]):20} | {res[1]:20}" )
    if i > 20:
        break

namespace            | class name          
----------------------------------------


If you closely inspect transformation rules, we are not interested in all classes that are represented by Nordic44 knowledge graph, but only selected few.
Furthermore, as described in [rule types](../../rule-types.md), often source graph are deep and one is required to perform multiple hops to acquire a specific information.
Therefore, it is convenient to "short path", thus transform graph to be more performant, basically flattening graph structure to help us achieve simpler queries.

You can see that this is exactly what we are doing in case of links between substations and terminals. We are greatly reducing the traversal path.

To perform transformations described in the transformation rules, The actual knowledge graph transformation is achieved using method `domain2app_knowledge_graph` which will execute transformation rules one by one.
To automatically commit new triples we wrap this method in `NeatGraphStore.set_graph()`. 
As you can see we are passing couple of arguments to this method, which are:
- source knowledge graph
- transformation rules
- target knowledge graph (this to make sure triples are committed to the graph database as they are being created)
- extra triples to be injected to the target knowledge graph (see INSTANCES sheet in the transformation rules Excel file)
- instance of Cognite Client (to be able to fetch data from CDF RAW in case of `rawlookup` rules)
- CDF RAW database name (to be able to fetch data from CDF RAW in case of `rawlookup` rules)

In [8]:
solution_store.set_graph(
            transformer.source2solution_graph(
                source_store.get_graph(),
                transformation_rules,
                solution_store.get_graph(),
                extra_triples = transformation_rules.instances)
        
        )

Let's now inspect graph `solution_store` and see breakdown of number of instances per class:

In [9]:
print(f"{'namespace':30} | {'class name':20}")
print(40*"-")

for res in list(solution_store.graph.query("SELECT ?class (count(?s) as ?instances ) WHERE { ?s a ?class . } group by ?class order by DESC(?instances)")):
    print(f"{remove_namespace(res[0]):30} | {res[1]:20}" )

namespace                      | class name          
----------------------------------------
Terminal                       | 452                 
Substation                     | 44                  
SubGeographicalRegion          | 10                  
GeographicalRegion             | 2                   
RootCIMNode                    | 1                   


As one can notice, with the transformation rules we have cherry picked classes existing or created new properties to suite out need.

Let's continue and see how corresponding assets will look like in CDF. We will use same methods as one in Part 3 notebook.

<!-- Let's create them using method `rdf2asset`. To this method we are passing following arguments:
- target knowledge graph
- transformation rules, which contain mapping of RDF classes and properties to CDF Assets and their properties
- prefix to external id (useful if multiple people are working with the same CDF project to avoid conflicts in external ids)
- external id of Orphanage root asset (this is used in case of RDF instances which are expected to have parent asset, but do not have it defined in the source knowledge graph, so we will assign them to this root asset)


 and later on categorize them to those that will be:
- freshly created
- updated
- decommissioned (setting their end date, and labeling them as historic)
- resurrected (stating date when they are reactivated and removing historic label) -->

In [10]:
candidate_assets = extractors.rdf2assets(solution_store, transformation_rules)


ERROR:root:Error while loading instances of class <http://purl.org/cognite/tnt#Orphanage> into cache. Reason: 'instance'


Orphanage with external id orphanage-2626756768281823 not found in asset hierarchy!






We have corrupted Nordic44 knowledge graph to show you how `NEAT` handles missing properties or isolated nodes.

In the source knowledge graph there are three "problematic" instances, which ended in the target knowledge graph:
- An instance of GeographicalRegion which is missing relationship to its parent asset, specifically `RootCIMNode`
- An instance of SubGeographicalRegion which is missing relationship to a `GeographicalRegion`, i.e. its parent asset
- An instance of Terminal that is missing property that maps to CDF Asset name
- An instance of Terminal that has alias property that maps to CDF Asset name

NEAT manages these instances such that:
- An instance of GeographicalRegion and SubGeographicalRegion which is missing relationship to its parent asset will be assigned to Orphanage root asset
- An instance of Terminal that is missing property that maps to CDF Asset name will use its identifier with removed namespace as CDF Asset name
- An instance of Terminal that has alias property that maps to CDF Asset name will use its alias property as CDF Asset name

Let's confirm this by checking the corresponding assets:

In [11]:
print(f"{'External ID':36} | {'Name':30} | {'Parent External ID':36} | {'Asset Type':20}")
print(132*"-")

for id, asset in candidate_assets.items():
    if asset["parent_external_id"] == "orphanage-2626756768281823" or asset["name"] == "terminal-without-name-property" or asset["name"] == "Alias Name":
        
        print(f"{asset['external_id']:36} | {asset['name']:30} | {asset['parent_external_id']:36} | {asset['metadata']['type']:20}")

External ID                          | Name                           | Parent External ID                   | Asset Type          
------------------------------------------------------------------------------------------------------------------------------------
lazarevac                            | LA                             | orphanage-2626756768281823           | GeographicalRegion  
f17696b3-9aeb-11e5-91da-b8763fd99c5f | FI1 SGR                        | orphanage-2626756768281823           | SubGeographicalRegion
2dd901a4-bdfb-11e5-94fa-c8f73332c8f4 | Alias Name                     | f1769682-9aeb-11e5-91da-b8763fd99c5f | Terminal            
terminal-without-name-property       | terminal-without-name-property | f1769688-9aeb-11e5-91da-b8763fd99c5f | Terminal            


Let's now categorize assets and see how many of them will be:
- created
- updated
- decommissioned (setting their end date, and labeling them as historic)
- resurrected (stating date when they are reactivated and removing historic label)

We are passing cognite clinet, asset dictionary and dataset id to the method `categorize_assets` which will return a dictionary with categorized assets.
If this is the case the returned dictionary should only have assets under category "create", since there are no assets in CDF:

In [12]:
categorized_assets = extractors.categorize_assets(client, 
                                                  candidate_assets, 
                                                  transformation_rules.metadata.data_set_id)


for cat in categorized_assets:
    print(f"Category {cat:15} has {len(categorized_assets[cat]):2} assets")


Category create          has 510 assets
Category update          has  0 assets
Category resurrect       has  0 assets
Category decommission    has  0 assets


Before we upload assets, we need to create labels which we use to label asset and relationship types as well their status:

In [13]:
extractors.labels.upload_labels(client, transformation_rules)

Finally we can upload categorized assets to CDF:

In [14]:
extractors.upload_assets(client, categorized_assets, batch_size=1000)

Let's now for sake of completeness check if two orphaned assets are added under Orphanage and that two terminals with missing and alias names have their names properly fixed:

Let's repeat process for relationships:


<video src="../../videos/tutorial-4-asset-hierarchy.mp4" controls>
</video>


In [15]:
candidate_relationships = extractors.rdf2relationships(solution_store, transformation_rules)


categorized_relationships = extractors.categorize_relationships(client, 
                                                  candidate_relationships, 
                                                  transformation_rules.metadata.data_set_id)


for cat in categorized_relationships:
    print(f"Category {cat:15} has {len(categorized_relationships[cat]):2} relationships")



Category create          has 454 relationships
Category resurrect       has  0 relationships
Category decommission    has  0 relationships


In [16]:
extractors.upload_relationships(client, categorized_relationships, batch_size=1000)