# Part 3: Asset data dive
Let's get started with a guided exploration of the Valhall Platform. In this notebook we will pick one of the equipment that we visualized in operational intelligence and take a closer look at all of the available data!


## Quick links
* Back to the [Hackathon github repo](https://github.com/cognitedata/open-industrial-data/tree/master/workshops/uni-hackathon)
* Documentation of [CDP concepts](https://doc.cognitedata.com/concepts/)
* Reference documentation for the [Python SDK](https://cognite-sdk-python.readthedocs-hosted.com/en/latest/)
<hr>

# Step 0: Environment Setup

#### Install the Cognite SDK package

In [None]:
# if you're working in google colab or similar
!pip install -q cognite-sdk

#### Import the required packages

In [None]:
%matplotlib inline

import os
from datetime import datetime, timedelta
from datetime import datetime
from getpass import getpass
from typing import List, Any
from itertools import islice

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from cognite import CogniteClient

import networkx as nx
from networkx.algorithms.traversal.depth_first_search import dfs_tree
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)


pd.set_option('display.max_rows', 10)

If you are using google colab, you have to run `configure_plotly_browser_state()` in a cell, which creates plotly graphs

In [None]:
def configure_plotly_browser_state():
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))

#### Connect to the Cognite Data Platform
The SDK client is the entrypoint to all data in CDP, and simply requires the API key that you generated in Part 1.

When prompted for your API key, use the key generated by open industrial data as mentioned in the Getting Started steps.

In [None]:
client = CogniteClient(api_key=getpass("Open Industrial Data API-KEY: "))

# Step 1: Learn about Organizing Industrial Data

The Cognite Data Platform organizes digital information about the physical world. The building blocks of this representation are called *resources*, which you can read up on in detail [here](https://doc.cognitedata.com/concepts/#core-concepts).

An important resource to understand is the Asset resource. This is the foundation for organizing industrial data -- time series, work orders, event logs and arbitrary files -- from across complex industrial systems.
Assets are linked together with parent-child relationships to build a top-down hierarchical tree, known as "The Asset Hierarchy".
For example, an Asset Hierarchy could look like this:
```
  Gas Export Compressor
    |- First stage export compressor
    |    |- Compressor
    |    |- Scrubber
    |    |- ...
    |- Second stage export compressor
    |- ...
```
Timeseries, events, files and other resources are attached to each Asset.

The hierarchical structure can make it easier to find the timeseries data that you're looking for. Though there are [other ways](https://doc.cognitedata.com/concepts/#_3d-models-and-revisions) to do this, we'll focus on using the hierarchy today!

In [None]:
# download a sample of assets up to a certain depth in the hierarchy
df_sample_assets = client.assets.get_assets(limit=1000, depth=6).to_pandas().sort_values('depth')
df_sample_assets

To make it more clear let's visualize our assets hierarchy.
There are auxiliary functions for tree plots generation, don't be scared of it :)

In [None]:
def sliding_window(seq, n=2):
    """ iterator, which generates a sliding window for `n` elements from a given iterator """
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

In [None]:
def make_assets_tree(df: pd.DataFrame) -> nx.DiGraph:
    """ generates directional graph of assets from a given dataframe """
    G = nx.DiGraph()
    for path in df['path'].values:
        for parent_id, child_id in sliding_window(path, n=2):
            G.add_edge(parent_id, child_id)

    return G

In [None]:
def hierarchy_pos(G, root=None, width=10000., vert_gap = 0.2, vert_loc = 0, xcenter = 0.5):
    '''
    If the graph is a tree this will return the positions to plot this in a 
    hierarchical layout.

    G: the graph (must be a tree)

    root: the root node of current branch 
    - if the tree is directed and this is not given, the root will be found and used
    - if the tree is directed and this is given, then the positions will be just for the descendants of this node.
    - if the tree is undirected and not given, then a random choice will be used.

    width: horizontal space allocated for this branch - avoids overlap with other branches

    vert_gap: gap between levels of hierarchy

    vert_loc: vertical location of root

    xcenter: horizontal location of root
    '''
    if not nx.is_tree(G):
        raise TypeError('cannot use hierarchy_pos on a graph that is not a tree')

    def _hierarchy_pos(G, root, width=1., vert_gap = 0.2, vert_loc = 0, xcenter = 0.5, pos = None, parent = None):
        '''
        see hierarchy_pos docstring for most arguments

        pos: a dict saying where all nodes go if they have been assigned
        parent: parent of this branch. - only affects it if non-directed

        '''

        if pos is None:
            pos = {root:(xcenter,vert_loc)}
        else:
            pos[root] = (xcenter, vert_loc)
        children = list(G.neighbors(root))
        if not isinstance(G, nx.DiGraph) and parent is not None:
            children.remove(parent)  
        if len(children)!=0:
            dx = width/len(children) 
            nextx = xcenter - width/2 - dx/2
            for child in children:
                nextx += dx
                pos = _hierarchy_pos(G,child, width = dx, vert_gap = vert_gap, 
                                    vert_loc = vert_loc-vert_gap, xcenter=nextx,
                                    pos=pos, parent = root)
        return pos


    return _hierarchy_pos(G, root, width, vert_gap, vert_loc, xcenter)

In [None]:
def get_label(id_):
    """ Get asset's name by given asset id """
    asset_info = client.assets.get_asset(id_)
    return asset_info.to_json()['name']

In [None]:
def make_assets_tree_plot(df: pd.DataFrame, root_id: int=None, max_depth: int=None) -> go.Figure:
    assets_tree = make_assets_tree(df_sample_assets)
    
    if root_id is None:
        root_id = next(iter(nx.topological_sort(assets_tree)))  #allows back compatibility with nx version 1.11
    
    assets_tree = dfs_tree(assets_tree, source=root_id, depth_limit=max_depth)
    pos = hierarchy_pos(assets_tree, root=root_id)
    
    # extract node coordinates and labels
    Xn = [pos[i][0] for i in pos.keys()]
    Yn = [pos[i][1] for i in pos.keys()]
    labels = [get_label(id_) for id_ in pos.keys()]
    
    # extract edges from tree
    Xe = list()
    Ye = list()
    for e in assets_tree.edges():
        Xe.extend([pos[e[0]][0], pos[e[1]][0], None])
        Ye.extend([pos[e[0]][1], pos[e[1]][1], None])
    
    # make plotly traces
    trace_nodes=dict(
      type='scatter',
      x=Xn, 
      y=Yn,
      mode='markers',
      marker=dict(size=20, color='rgb(0, 0, 204)'),
      text=labels,
      hoverinfo='text'
    )
    trace_edges=dict(
      type='scatter',
      mode='lines',
      x=Xe,
      y=Ye,
      line=dict(width=1, color='rgb(25,25,25)'),
      hoverinfo='none' 
    )
    
    # some pretty details
    axis=dict(
        showline=False, # hide axis line, grid, ticklabels and  title
        zeroline=False,
        showgrid=False,
        showticklabels=False,
        title='' 
    )
    layout = go.Layout(
      autosize=True,
      showlegend=False,
      xaxis=axis,
      yaxis=axis,
      hovermode='closest',
    )
    
    fig = go.Figure(data=[trace_edges, trace_nodes], layout=layout)
    return fig

In [None]:
configure_plotly_browser_state()

fig = make_assets_tree_plot(df_sample_assets)
iplot(fig)

You may found this overcomplicated, but if you wanna explore your assets in interactive way - that's a good way. Also, you can play with `root_id` and `max_depth` arguments for `make_assets_tree_plot()` function, and build only a branch in details

In [None]:
df_sample_assets = client.assets.get_assets(limit=2000, depth=15).to_pandas().sort_values('depth')

In [None]:
configure_plotly_browser_state()

fig = make_assets_tree_plot(df_sample_assets, root_id=4518112062673878, max_depth=4)
iplot(fig)

# Step 2: Pick an asset for further investigation
For the rest of the workshop, you'll be working with one of the subsystems that you visualized in the [LIVE Operational Intelligence System Overview](https://opint.cogniteapp.com/publicdata/infographics/-LOHKEJPLvt0eRIZu8mE) (see below).
Either pick an asset yourself below, or let fate decide ;)


In [None]:
import random
SYSTEM_OVERVIEW_ASSETS = [
    '23-ESDV-92501A',
    '23-ESDV-92501B',
    '23-HA-9103',
    '23-PV-92538',
    '23-VG-9101',
    '23-KA-9101',
    '23-HA-9115',
    '23-HA-9114',
    '23-FV-92543',
    '23-ESDV-92551A',
    '23-ESDV-92551B',
]

In [None]:
# fetch the asset metadata from CDP using the assets client

df_system_overview_assets = pd.concat([
    client.assets.get_assets(name=n).to_pandas()
    for n in SYSTEM_OVERVIEW_ASSETS
])[['name', 'id', 'parentId', 'description']].set_index('name')

df_system_overview_assets

In [None]:
# Choose an asset for analysis, or let fate decide :)

asset_name = random.choice(SYSTEM_OVERVIEW_ASSETS)

asset_id = df_system_overview_assets.loc[asset_name, 'id']

print("And my asset is!")
df_system_overview_assets.loc[asset_name]

# Step 3: Find all the timeseries for your asset

The interface `client.assets.get_asset_subtree()` can be used to retrieve all of the *children* of an Asset. The `depth` parameter sets how far we traverse down the hierarchy.

In [None]:
df_asset_children = client.assets.get_asset_subtree(
    asset_id=df_system_overview_assets.loc[asset_name, 'id'],
    depth=10
).to_pandas().sort_values('depth')
df_asset_children[['depth', 'id', 'parentId', 'description']]

... Assets are interesting to see how things are put together, but what I'm sure you're really after are those petabytes of **time series**; those beautiful pressure (PT), temperature (TT) and flow (TT) sensors that have recorded the life of the platform for the last few years.

First we need to find all these time series. We can use the `path` parameter in the `time_series` client to get all the time series attached to assets below our system overview asset. Note that this parameter maps directly to the CDP API, and therefore needs to provide the asset id formatted carefully as a json string: `"[id, ]"`.

In [None]:
df_asset_children_timeseries = client.time_series.get_time_series(path=str([asset_id])).to_pandas()
df_asset_children_timeseries

Great! We have discovered the timeseries below our asset!

**Note**: CDP can also store string and step timeseries. Step timeseries have different aggregation methods, and support dead-band-compression for time series that do not change very often (e.g. valve opening angles).

# Step 4: Into the timeseries datapoints!
In CDP we do some very clever things in the backend to store serve up timeseries just the way you like it:
- Store the timeseries (timestamp, value) in their raw format, because one day we'll need it
- Precompute aggregations for millisecond response times
- Build tabular structures server side
- Enable natural language time specifications (e.g. `start='8d-ago'` and `granularity='10m'`)

So once you've located the timeseries you're interested in analyzing, the `datapoints` client has several options for downloading the data.

**Note:** The timeseries column is represented throughout CDP as milliseconds since epoch time. Pandas offers an easy conversion to python datetime with `pd.to_datetime(<column/value>, unit='ms')`.

In [None]:
# set aside string time series for now because they do not aggregate together with numerical time series
# consider investigating the string timeseries in part 4b
lst_timeseries = df_asset_children_timeseries[~df_asset_children_timeseries['isString']]['name'].tolist()
lst_timeseries

In [None]:
df_data = client.datapoints.get_datapoints_frame(
    time_series=lst_timeseries,
    aggregates=['avg'],
    granularity='1h',
    start='30d-ago',
)

df_data = df_data.set_index(pd.to_datetime(df_data['timestamp'], unit='ms')).drop('timestamp', axis=1)
df_data

In [None]:
# plot up to 10 random rows
df_plot_sample = df_data[random.sample(list(df_data.columns), min(10, len(df_data.columns)))]

df_plot_sample.plot(subplots=True, figsize=(20,20));

# Congratulations, you are done with part 3!

Save your notebook, and remember your asset for the next part, where we dig deeper into the data.