# Source to Solution Graph

[![Notebook](https://shields.io/badge/notebook-access-green?logo=jupyter&style=for-the-badge)](https://github.com/cognitedata/neat/blob/docs/tutorial/notebooks/source-to-solution-graph.ipynb)

* author: Nikola Vasiljevic
* date: 2023-02-12


In this notebook we will demonstrate key NEAT functionalities and typical workflow. We will work with both knowledge graph provided as file (RDF/XML format) and as well a knowledge graph loaded in dedicated RDF graph database GraphDB. We will use Nordic44 Equipment Profile knowledge graph, which contains number instances which conform to CIM (Common Information Model) ontology. Nordic44 is open source and it is primarily tailored for research purpose.

# Necessary libraries

We will use handful of NEAT methods, which we load in the cell bellow:

In [1]:
import os
from pathlib import Path

from cognite.neat.core.utils import get_cognite_client_from_env
from cognite.neat.core import loader, parser
from cognite.neat.core.data_classes.config import RdfStoreType
from cognite.neat.core.parser.instances import parse_instances
from cognite.neat.core.transformer import domain2app_knowledge_graph


%reload_ext autoreload
%autoreload 2

# Connecting to CDF

We have convenience method `get_cognite_client_from_env` which results in cognite sdk client instance which we will use to fetch and push resources from/to CDF. This method requires the environmental variables below to be set:

In [2]:
CDF_CLUSTER = "az-power-no-northeurope"
TENANT_ID = ""

os.environ["CDF_TOKEN_URL"] = f"https://login.microsoftonline.com/{TENANT_ID}/oauth2/v2.0/token"
os.environ["CDF_CLIENT_ID"] = ""
os.environ["CDF_CLIENT_SECRET"] = ""
os.environ["CDF_SCOPES"] = f"https://{CDF_CLUSTER}.cognitedata.com/.default"
os.environ["CDF_CLIENT_NAME"] = "cognite"
os.environ["CDF_BASE_URL"] = f"https://{CDF_CLUSTER}.cognitedata.com"
os.environ["CDF_PROJECT"] = "get-power-grid"

client = get_cognite_client_from_env()

# Loading Transformation Rules
All our references 

In [4]:
# Loading the transformation rules

ROOT = Path().resolve().parent.parent
DATA_FOLDER = ROOT / "data"
TRANSFORMATION_RULES = DATA_FOLDER / "Rules-Nordic44-to-TNT.xlsx"
IN_MEMORY_KNOWLEDGE_GRAPH = DATA_FOLDER / "Knowledge-Graph-Nordic44.xml"

raw_sheets = loader.rules.excel_file_to_table_by_name(TRANSFORMATION_RULES)
transformation_rules = parser.parse_transformation_rules(raw_sheets)

# Loading Knowledge Graph to Graph Database
In the first part of this tutorial we will work with an instance of `NeatGraphStore` object connected to GraphDB.

Accordingly, we need to make sure that we have a running GraphDB on our computer. Follow these steps to achieve this:

1. Install and make sure that docker runs
2. In terminal go to the root of NEAT repository and execute: 

    ```make build-docker```

3. Make sure to update `docker-compose.yaml` in `./docker` folder with version of GraphDB which suites your computer architecture. Currently it is set to `arm64` which is suitable for M1 and M2 chips.

4. After updating `docker-compose.yaml` execute to start containers:

    ```make compose-up```

5. Go to `http://localhost:7201` in your browser, this should resolve to GraphDB user interface


6. Navigate through the user interface and create two repositories, one named `nordic44` another named `tnt`

7. In `nordic44` repository load knowledge graph provided as file `Knowledge-Graph-Nordic44.xml` located in `./data`, set `baseIRI` to `http://purl.org/nordic44#`. The `baseIRI` is needed since Nordic44 instances do not contain it, and all instance references are "relative", in other words they are missing `namespace`

8. Select `nordic44` and navigate to Class Hierarchy and check if you see CIM classes such as Terminal in this repository. Alternatively go to http://localhost:7201/hierarchy

If you are unable to create repositories, or have hard time loading knowledge graph checkout the following screen recording (no sound):

https://drive.google.com/file/d/1SnS6XvVmuOKOuIaZcsTyt8My8T56OYfS/view?usp=sharing

Let's continue and connect to the repositories we made by instantiating and configuring two `NeatGraphStore` objects. 
These two instances we will call `source_graph` (linked to `nordic44` repository containing source knowledge graph) and `target_graph` (which we will store transformed knowledge to which we will refer as `tnt`).

Here we are instantiating object and binding prefixes which can be found in PREFIXES sheet in the transformation rules Excel file. 
> Checkout other docs to understand why are we using prefixes instead of full namespaces.

In [5]:
source_graph = loader.NeatGraphStore(prefixes=transformation_rules.prefixes)
source_graph.init_graph(
    rdf_store_type = RdfStoreType.GRAPHDB,
    rdf_store_query_url = "http://localhost:7201/repositories/nordic44",
    rdf_store_update_url = "http://localhost:7201/repositories/nordic44/statements",
    graph_name = "nordic44",
)


target_graph = loader.NeatGraphStore(prefixes=transformation_rules.prefixes)
target_graph.init_graph(
    rdf_store_type = RdfStoreType.GRAPHDB,
    rdf_store_query_url = "http://localhost:7201/repositories/tnt-solution",
    rdf_store_update_url = "http://localhost:7201/repositories/tnt-solution/statements",
    graph_name = "tnt-solution",
)
target_graph.graph_db_rest_url = "http://localhost:7201"

Let's check if connection to the repositories are correctly established. 
We will do this by executing very simple SPARQL queries. 

First lets query against source graph, and request first 10 RDFS classes.
This query should return list of tuples containing URIs (i.e., references, globally unique ids) of RDFS classes in Nordic44 knowledge graph. The result will be a mix of base RDFS classes such as `Class`, `Property`, but also classes specific to `CIM` namespace:

In [6]:
res = list(source_graph.query("SELECT DISTINCT ?class WHERE { ?s a ?class } Limit 10"))

print(f"{'namespace':50} | {'class name':20}")
print(80*"-")
for i in res:
    namespace = i[0].split('#')[0]
    print(f"{i[0].split('#')[0]:50} | {i[0].split('#')[1]:20}")

namespace                                          | class name          
--------------------------------------------------------------------------------
http://iec.ch/TC57/61970-552/ModelDescription/1    | FullModel           
http://iec.ch/TC57/2013/CIM-schema-cim16           | ACLineSegment       
http://iec.ch/TC57/2013/CIM-schema-cim16           | BaseVoltage         
http://iec.ch/TC57/2013/CIM-schema-cim16           | Line                
http://entsoe.eu/CIM/SchemaExtension/3/2           | LineCircuit         
http://iec.ch/TC57/2013/CIM-schema-cim16           | ActivePowerLimit    
http://iec.ch/TC57/2013/CIM-schema-cim16           | OperationalLimitSet 
http://iec.ch/TC57/2013/CIM-schema-cim16           | OperationalLimitType
http://iec.ch/TC57/2013/CIM-schema-cim16           | Analog              
http://entsoe.eu/CIM/SchemaExtension/3/2           | EnergyCongestionZone


Now, let's do the same for target repository. Contrary to the nordic44 respository, we should only get handful of standard/base classes from RDFS namespace:

In [7]:
res = list(target_graph.query("SELECT DISTINCT ?class WHERE { ?s a ?class } Limit 10"))

print(f"{'namespace':50} | {'class name':20}")
print(80*"-")
for i in res:
    namespace = i[0].split('#')[0]
    print(f"{i[0].split('#')[0]:50} | {i[0].split('#')[1]:20}")

namespace                                          | class name          
--------------------------------------------------------------------------------
http://www.w3.org/1999/02/22-rdf-syntax-ns         | Property            
http://www.w3.org/2002/07/owl                      | TransitiveProperty  
http://www.w3.org/2002/07/owl                      | SymmetricProperty   
http://www.w3.org/1999/02/22-rdf-syntax-ns         | List                
http://www.w3.org/2000/01/rdf-schema               | Class               
http://www.w3.org/2000/01/rdf-schema               | Datatype            
http://www.w3.org/2000/01/rdf-schema               | ContainerMembershipProperty


Now we are ready to start transforming Nordic44 knowledge graph (aka source knowledge graph) into TNT knowledge graph (i.e., target knowledge graph).
To be sure that we are starting from scratch, we will first delete all triples from the target repository using `.drop()` method:

In [8]:
target_graph.drop()

The actual knowledge graph transformation is achieved using method `domain2app_knowledge_graph` which will execute transformation rules one by one.
To automatically commit new triples we wrap this method in `NeatGraphStore.set_graph()`. 
As you can see we are passing couple of arguments to this method, which are:
- source knowledge graph
- transformation rules
- target knowledge graph (this to make sure triples are committed to the graph database as they are being created)
- extra triples to be injected to the target knowledge graph (see INSTANCES sheet in the transformation rules Excel file)
- instance of Cognite Client (to be able to fetch data from CDF RAW in case of `rawlookup` rules)
- CDF RAW database name (to be able to fetch data from CDF RAW in case of `rawlookup` rules)

In [9]:
target_graph.set_graph(
            domain2app_knowledge_graph(
                source_graph.get_graph(),
                transformation_rules,
                target_graph.get_graph(),
                extra_triples = parse_instances(raw_sheets)
                
            )
        )

Let's now inspect the target repository and see if we have added any new classes or instances:

In [10]:
res = list(target_graph.query("SELECT DISTINCT ?class WHERE { ?s a ?class } Limit 10"))

print(f"{'namespace':{50}} | {'class name':{20}}")
print(80*"-")
for i in res:
    namespace = i[0].split('#')[0]
    print(f"{i[0].split('#')[0]:{50}} | {i[0].split('#')[1]:{20}}")

namespace                                          | class name          
--------------------------------------------------------------------------------
http://www.w3.org/1999/02/22-rdf-syntax-ns         | Property            
http://www.w3.org/2002/07/owl                      | TransitiveProperty  
http://www.w3.org/2002/07/owl                      | SymmetricProperty   
http://www.w3.org/1999/02/22-rdf-syntax-ns         | List                
http://www.w3.org/2000/01/rdf-schema               | Class               
http://www.w3.org/2000/01/rdf-schema               | Datatype            
http://www.w3.org/2000/01/rdf-schema               | ContainerMembershipProperty
http://iec.ch/TC57/2013/CIM-schema-cim16           | GeographicalRegion  
http://purl.org/cognite/tnt                        | GeographicalRegion  
http://purl.org/cognite/tnt                        | SubGeographicalRegion


You might noticed that `SubGeographicalRegion` class appears twice in the list. This is because we have used "*" wildcard in the transformation rules, which results in copying of all statements (subject, predicate, object) for instances of `http://iec.ch/TC57/2013/CIM-schema-cim16#SubGeographicalRegion` from the source to the target knowledge graph. 

Among these stamentes there are statements defining that an entity is an instance of class `http://iec.ch/TC57/2013/CIM-schema-cim16#SubGeographicalRegion`. The issue is that while we are doing this, we also stating that these instances in the target knowledge graph are instances of the new class `tnt:SubGeographicalRegion` which we have defined in the transformation rules, hence the "duplication". 

Another issue with this rule type is that you will probably have even more duplication since you will be specifically defining mapping of single properties from source to target knowledge graph (check "Rule Types in NEAT"). So, in short words, do not be lazy and explicitly define transformation rules for all classes and their properties you want to transform and avoid using wildcard "*" in the transformation rules.


But, let's continue and see how corresponding assets will look like in CDF. Let's create them using method `rdf2asset_dictionary`. To this method we are passing following arguments:
- target knowledge graph
- transformation rules, which contain mapping of RDF classes and properties to CDF Assets and their properties
- prefix to external id (useful if multiple people are working with the same CDF project to avoid conflicts in external ids)
- external id of Orphanage root asset (this is used in case of RDF instances which are expected to have parent asset, but do not have it defined in the source knowledge graph, so we will assign them to this root asset)


 and later on categorize them to those that will be:
- freshly created
- updated
- decommissioned (setting their end date, and labeling them as historic)
- resurrected (stating date when they are reactivated and removing historic label)

In [11]:
asset_dictionary =  parser.rdf2asset_dictionary(target_graph,
                                                transformation_rules)

In the source knowledge graph there are three "problematic" instances, which ended in the target knowledge graph:
- An instance of SubGeographicalRegion which is missing relationship to its parent asset
- An instance of Terminal that is missing property that maps to CDF Asset name
- An instance of Terminal that has alias property that maps to CDF Asset name

NEAT manages these instances such that:
- An instance of SubGeographicalRegion which is missing relationship to its parent asset will be assigned to Orphanage root asset
- An instance of Terminal that is missing property that maps to CDF Asset name will use its identifier with removed namespace as CDF Asset name
- An instance of Terminal that has alias property that maps to CDF Asset name will use its alias property as CDF Asset name

Let's confirm this by checking the corresponding assets:

In [12]:
print(f"{'External ID':36} | {'Name':30} | {'Parent External ID':36} | {'Asset Type':20}")
print(132*"-")

for id, asset in asset_dictionary.items():
    if asset["parent_external_id"] == "orphanage" or asset["name"] == "terminal-without-name-property" or asset["name"] == "Alias Name":
        
        print(f"{asset['external_id']:36} | {asset['name']:30} | {asset['parent_external_id']:36} | {asset['metadata']['type']:20}")

External ID                          | Name                           | Parent External ID                   | Asset Type          
------------------------------------------------------------------------------------------------------------------------------------
lazarevac                            | LA                             | orphanage                            | GeographicalRegion  
f17696b3-9aeb-11e5-91da-b8763fd99c5f | FI1 SGR                        | orphanage                            | SubGeographicalRegion
2dd901a4-bdfb-11e5-94fa-c8f73332c8f4 | Alias Name                     | f1769682-9aeb-11e5-91da-b8763fd99c5f | Terminal            
terminal-without-name-property       | terminal-without-name-property | f1769688-9aeb-11e5-91da-b8763fd99c5f | Terminal            


Let's now categorize assets and see how many of them will be:
- freshly created
- updated
- decommissioned (setting their end date, and labeling them as historic)
- resurrected (stating date when they are reactivated and removing historic label)

We are passing cognite clinet, asset dictionary and dataset id to the method `categorize_assets` which will return a dictionary with categorized assets.
You might see following errors:
```
ERROR:root:Error while getting non-historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'
ERROR:root:Error while getting historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'
```

do not be alarmed, this is because we have not yet created any assets in CDF, so there are no assets to be fetched from CDF.

If this is the case the returned dictionary should only have assets under category "create":

In [13]:
assets = parser.categorize_assets(client, 
                                  asset_dictionary, 
                                  transformation_rules.metadata.data_set_id)

for category in assets.keys():
    print(f"{category}: {len(assets[category])}")

ERROR:root:Error while getting non-historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'
ERROR:root:Error while getting historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'


create: 510
update: 0
decommission: 0
resurrect: 0


Finally we can upload categorized assets to CDF:

In [14]:
parser.upload_assets(client, assets)

If we try now again to categorize assets we should see that there are no assets to be created:

In [15]:
assets = parser.categorize_assets(client, 
                                  asset_dictionary, 
                                  transformation_rules.metadata.data_set_id)

for category in assets.keys():
    print(f"{category}: {len(assets[category])}")

ERROR:root:Error while getting historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'


create: 0
update: 0
decommission: 0
resurrect: 0


Let's now for sake of demonstration of the NEAT capabilities, decommission "terminal-without-name-property" asset and see how NEAT will handle this situation:

In [16]:
asset_dictionary.pop("terminal-without-name-property")

assets = parser.categorize_assets(client, 
                                  asset_dictionary, 
                                  transformation_rules.metadata.data_set_id)

for category in assets.keys():
    print(f"{category}: {len(assets[category])}")

print(assets["decommission"]["terminal-without-name-property"])
parser.upload_assets(client, assets)

ERROR:root:Error while getting historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'


create: 0
update: 0
decommission: 1
resurrect: 0
{
    "external_id": "terminal-without-name-property",
    "name": "terminal-without-name-property",
    "parent_external_id": "f1769688-9aeb-11e5-91da-b8763fd99c5f",
    "data_set_id": 2626756768281823,
    "metadata": {
        "IdentifiedObject.aliasName": "",
        "IdentifiedObject.mRID": "terminal-without-name-property",
        "IdentifiedObject.name": "",
        "Terminal.Substation": "f1769688-9aeb-11e5-91da-b8763fd99c5f",
        "active": "false",
        "identifier": "terminal-without-name-property",
        "start_time": "2023-03-01 09:47:35.710839",
        "type": "Terminal",
        "update_time": "2023-03-01 09:48:15.216802",
        "end_time": "2023-03-01 09:48:15.216819"
    }
}


Let's now for sake of demonstration of the NEAT capabilities, resurrect "terminal-without-name-property" asset and see how NEAT will handle this situation:

In [18]:
# this will rebuild asset dictionary
asset_dictionary =  parser.rdf2asset_dictionary(target_graph,
                                                transformation_rules)



assets = parser.categorize_assets(client, 
                                  asset_dictionary, 
                                  transformation_rules.metadata.data_set_id)

for category in assets.keys():
    print(f"{category}: {len(assets[category])}")

print(assets["resurrect"]["terminal-without-name-property"])
parser.upload_assets(client, assets)


create: 0
update: 0
decommission: 0
resurrect: 1
{
    "external_id": "terminal-without-name-property",
    "name": "terminal-without-name-property",
    "parent_external_id": "f1769688-9aeb-11e5-91da-b8763fd99c5f",
    "data_set_id": 2626756768281823,
    "metadata": {
        "IdentifiedObject.aliasName": "",
        "IdentifiedObject.mRID": "terminal-without-name-property",
        "IdentifiedObject.name": "",
        "Terminal.Substation": "f1769688-9aeb-11e5-91da-b8763fd99c5f",
        "active": "true",
        "identifier": "terminal-without-name-property",
        "start_time": "2023-03-01 09:47:35.710839",
        "type": "Terminal",
        "update_time": "2023-03-01 09:49:13.118279",
        "resurrection_time": "2023-03-01 09:49:13.118301"
    }
}


To build relationships it is prerequisite to have assets created. 
Relationships are build in similar way as assets so will chain all methods together to create relationships between assets.

Again, if you see error such as:
```
ERROR:root:Error while getting historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'
```

Do not be alarmed, this is because at this point we don't have any historic assets in CDF (i.e. asset that has been decommissioned).

In [27]:
# Categorize relationships, in case the above assets are not present in CDF
# there will be no relationships to create, actually they will be removed
relationships = parser.upload_relationships(client,
                                            parser.categorize_relationships(client,
                                                                            parser.rdf2relationship_dict(target_graph,
                                                                                                         transformation_rules),
                                                                            transformation_rules.metadata.data_set_id,)
                                            )

ERROR:root:Error while getting historic assets from CDF. Reason: 'DataFrame' object has no attribute 'external_id'


Now it is time to inspect created asset hierarchy and relationships in CDF.

Visit:

https://cog-get-power.fusion.cognite.com/get-power-grid/explore/search/asset/5166671824792444/asset 