![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/main/assets/img/ODPi_Egeria_Logo_color.png)

### Egeria Hands-On Lab
# Welcome to the Open Lineage Lab

## Introduction

Egeria is an open source project that provides open standards and implementation libraries to connect tools, catalogs and platforms together so they can share information (called metadata) about data and the technology that supports it.

In this hands-on lab you will get a chance to work with Egeria metadata and governance servers and learn how to manually create  metadata to describe lineage for data movement processes. For this purpose we use **Open Lineage Services** governance server solution designed to capture and manage a historical warehouse of lineage information.
We will also show how using General **Egeria UI** you can search data assets and visualize lineage previously created.

To read more about lineage concepts and features in Egeria, see https://egeria-project.org/features/lineage-management/overview/.

## The Scenario

The Egeria team use the personas and scenarios from the fictitious company called [Coco Pharmaceuticals](https://egeria-project.org/practices/coco-pharmaceuticals/).

On their business transformation journey after they successfully created data catalog for the data lake, new challenge emerges. Due to regulatory requirements, business came up with request to improve data traceability. Introducing data lineage for critical data flows was ideal use-case for the next level of maturity in their governance program.

In this lab we discover how to manually catalogue data assets in the data lake and describe data movement for simple data transformation process executed by their in-house built ETL tool. Finally, the users can find data assets and visualize end to end lineage in the web UI.

Peter Profile and Erin Overview got assigned to work on a solution to capture and report data lineage using Egeria. 


## Setting up

Coco Pharmaceuticals make widespread use of Egeria for tracking and managing their data and related assets.
Figure 1 below shows their metadata servers and the Open Metadata and Governance (OMAG) Server Platforms that are hosting them.  Each metadata server supports a department in the organization.  The servers are distributed across the platform to even out the workload.  Servers can be moved to a different platform if needed.

![Figure 1](../images/coco-pharmaceuticals-systems-omag-server-platforms-metadata-server.png)
> **Figure 1:** Coco Pharmaceuticals' OMAG Server Platforms

For the scope of this lab, we are going to interact with two servers hosted on Data Lake platform:
 - `cocoMDS1` as metadata repository to store all the assets;
 - `cocoOLS1` as dedicated governance server to enable open lineage services and historical lineage repository;
 
 - `UI platform` running the APIs to support Egeria UI application. 
 
 > **Important**: When running this lab using [`kubernetes deployment`](https://odpi.github.io/egeria-docs/guides/operations/kubernetes/charts/lab/) the UI Platform is already configured and started for you.

The code below checks that the platforms are running.  It checks that the servers are configured and then if they are running on the platform.  If a server is configured, but not running, it will start it.

Look for the "Done." message.  This appears when `environment-check` has finished.


In [None]:
%run ../common/common-functions.ipynb
%run ../common/environment-check.ipynb

## Excercise 1 

### Capturing lineage manually

In this exercise Peter and Erin will start with minimal use-case and execute steps to create lineage manually. They are looking at simple high level transformation activity implemented using CocoETL, in-house developed ETL tool that uses python scripting language. Files from previous clinical trials are stored on server location accessible by the tool. `ConvertFileToCSV` is script that reads file coming out of legacy system of records and transform it to csv file structure.

![Figure 2](../images/open-lineage-service-lab-assets.png)
> Figure 2: Simple asset lineage


For use-cases like this one, **Data Engine Access Service (OMAS)** API seems perfect match. It enables external data platforms, tools or engines to interact with Egeria and share metadata needed to construct lineage graph.


#### Check if assets are present in the catalog

At first, Erin wants to be sure upfront that the assets are not present in the catalog. She uses Egeria UI Asset Catalog search option but fist she needs to log in.

> **Important:** When running this lab using kubernetes deployment, make sure that you [expose the Egeria UI](https://odpi.github.io/egeria-docs/guides/operations/kubernetes/charts/lab/#accessing-the-egeria-ui) running in the container to your local network and access it via localhost.



To access Egeria UI go to https://localhost:8443/ 
    
    username: erinoverview
    password: secret

![Erin Logon](../images/egeria-ui-erin-logon.png)
> **Figure 3** Log on as Erin Overview

Erin already knows the descriptive name of the data file asset in interest so she inputs the text "archive" in the search box and selects type `Asset` from the list.

![Search field input](../images/egeria-ui-asset-catalog-search-field-input.png) 
> **Figure 4** Assets search

The UI doesn't display any entries, meaning no asset matching this input was found. This is expected since at this moment the assets are not yet created.

![No assets found](../images/egeria-ui-asset-catalog-no-rows.png)
> **Figure 5** No assets found

#### Adding assets in the catalog

Peter is now ready to start creating assets using API calls. He is using Data Engine Access Service (OMAS) REST API available on Data Lake Platform `cocoMDS1` metadata server.

In [None]:
platformURL         = dataLakePlatformURL
serverName          = "cocoMDS1"

To ba able to call Data Engine OMAS endpoints, parameters like unique qualified name of the tool and service account are required.

In [None]:
cocoETLName         = "CocoPharma/DataEngine/CocoETL"
cocoETLUser         = "cocoETLnpa"
dataEngineOMASEndpoint = platformURL + '/servers/' + serverName + '/open-metadata/access-services/data-engine/users/' + cocoETLUser

##### Step 1 - Register the tool

External systems interacting with Egeria using Data Engine OMAS need to be registered first. This step is required only once as long as the cocoETLName does not change.
In our case, to register the tool properly Peter provides descriptive information that will be useful for others to understand as many details possible about the characteristics of the external source of metadata.

In [None]:

url = dataEngineOMASEndpoint + '/registration'

requestBody = {
    "dataEngine":
        {
            "qualifiedName": cocoETLName,
            "displayName": "CocoETL",
            "description": "Requesting to register external data engine capability for Coco Pharmaceuticals in-house Data Platform ETL tool CocoETL.",
            "engineType": "DataEngine",
            "engineVersion": "1",
            "enginePatchLevel": "0",
            "vendor": "Coco Pharmaceuticals",
            "version": "1",
            "source": "CocoPharma"
        }
}


print(requestBody)
postAndPrintResult(url, json=requestBody, headers=None)


At this point, the tool is properly registered and its name can be used as *externalSourceName* further on.

> Note: This information gets stored as [`SoftwareCapability`](https://egeria-project.org/types/0/0042-Software-Capabilities) in Egeria.

##### Step 2 - Create file assets
Lets look at the files. They are stored in well know server location defined by the networkAddress and filesystem location.

In [None]:
networkAddress      = "filesrv01.coco.net"
filesRoot           = "file://secured/research/previous-clinical-trials/"

Peter onboards the source file `old-archive.dat`. He is using [DataFile](https://egeria-project.org/types/2/0220-Files-and-Folders/) as fileType. 

In [None]:

url = dataEngineOMASEndpoint + '/data-files'

fileName1 = "old-archive.dat"
filePath1 = filesRoot + fileName1
fileQualifiedName1 = filePath1 + "@" + cocoETLName

requestFileBody = {
    "externalSourceName": cocoETLName,
    "file": {
        "fileType": "DataFile",
        "qualifiedName": fileQualifiedName1,
        "displayName": fileName1,
        "pathName": filePath1,
        "networkAddress": networkAddress,
        "columns": []
    }
}

print(requestFileBody)
postAndPrintResult(url, json=requestFileBody, headers=None)


Next, he calls the same endpoint but this time for the destination file `old-archive.csv`. He is using [CSVFile](https://egeria-project.org/types/2/0220-Files-and-Folders/) as fileType.

In [None]:

url = dataEngineOMASEndpoint + '/data-files'

fileName2 = "old-archive.csv"
filePath2 = filesRoot + fileName2
fileQualifiedName2 = filePath2 + "@" + cocoETLName

requestFileBody = {
    "externalSourceName": cocoETLName,
    "file": {
        "fileType": "CSVFile",
        "qualifiedName": fileQualifiedName2,
        "displayName": fileName2,
        "pathName": filePath2,
        "networkAddress": networkAddress,
        "columns": []
    }
}

print(requestFileBody)
postAndPrintResult(url, json=requestFileBody, headers=None)


> Note that in both calls, that the columns are not provided because in this exercise we are only focusing on the high level lineage without providing schema level details.

##### Step 3 - Create process assets
Using adequate name and description for the activity, he then requests new asset to represent the process.

In [None]:

url = dataEngineOMASEndpoint + '/processes'

activityName = "ConvertFileToCSV"
processQualifiedName = activityName + "@" + cocoETLName

requestProcessBody = {
    "process":
        {
            "qualifiedName": processQualifiedName,
            "displayName": activityName,
            "name": activityName,
            "description": "Process named 'ConvertFileToCSV' representing high level processing activity performed by CocoETL tool.",
            "owner": cocoETLUser,
            "updateSemantic": "REPLACE"
        },
    "externalSourceName": cocoETLName
}

print(requestProcessBody)
postAndPrintResult(url, json=requestProcessBody, headers=None)


Well done. At this point all the assets are stored in the catalog.

#### Adding data flows in the catalog

Finally, he needs to send the data flows connecting the assets. This is done using their fully qualified names.

In [None]:

url = dataEngineOMASEndpoint + '/data-flows'

requestDataFlowsBody = {
    "dataFlows": [
        {
            "dataSupplier": fileQualifiedName1,
            "dataConsumer": processQualifiedName
        },
        {
            "dataSupplier": processQualifiedName,
            "dataConsumer": fileQualifiedName2
        }
    ],
    "externalSourceName": cocoETLName
}

print(requestDataFlowsBody)
postAndPrintResult(url, json=requestDataFlowsBody, headers=None)


#### Finding assets in the UI and showing lineage

Erin is ready to inspect the catalog again. She goes back to the search page and searches the text "archive".
This time, she is able to find the file assets Peter created in the previous steps.

> Tip: Once logged on, Erin can directly navigate to the search results using https://localhost:8443/assets/catalog?q=archive&types=Asset

![Figure 1](../images/egeria-ui-asset-catalog-archive-search-results.png)

Clicking one of the file names, she can access the details page.

![Figure 1](../images/egeria-ui-asset-end-to-end-lineage.png)

To inspect the lineage graph, Erin clicks on `end-to-end`.

![Figure 1](../images/egeria-ui-end-to-end-lineage-graph.png)

This step completes **Exercise 1**.