# Part 3: Asset data dive
Let's get started with a guided exploration of the Valhall Platform. In this notebook we will pick one of the equipment that we visualized in operational intelligence and take a closer look at all of the available data!


## Quick links
* Back to the [Hackathon github repo](https://github.com/cognitedata/open-industrial-data/tree/master/workshops/uni-hackathon)
* Documentation of [CDP concepts](https://doc.cognitedata.com/concepts/)
* Reference documentation for the [Python SDK](https://cognite-sdk-python.readthedocs-hosted.com/en/latest/))
<hr>

# Step 0: Environment Setup

#### Install the Cognite SDK package

In [None]:
# if you're working in google colab or similar
!pip install -q cognite-sdk

#### Import the required packages

In [None]:
%matplotlib inline

import os
from datetime import datetime, timedelta
from datetime import datetime
from getpass import getpass

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from cognite import CogniteClient

pd.set_option('display.max_rows', 10)

#### Connect to the Cognite Data Platform
The SDK client is the entrypoint to all data in CDP, and simply requires the API key that you generated in Part 1.

When prompted for your API key, use the key generated by open industrial data as mentioned in the Getting Started steps.

In [None]:
client = CogniteClient(api_key=getpass("Open Industrial Data API-KEY: "))

# Step 1: Learn about Organizing Industrial Data

The Cognite Data Platform organizes digital information about the physical world. The building blocks of this representation are called *resources*, which you can read up on in detail [here](https://doc.cognitedata.com/concepts/#core-concepts).

An important resource to understand is the Asset resource. This is the foundation for organizing industrial data -- time series, work orders, event logs and arbitrary files -- from across complex industrial systems.
Assets are linked together with parent-child relationships to build a top-down hierarchical tree, known as "The Asset Hierarchy".
For example, an Asset Hierarchy could look like this:
```
  Gas Export Compressor
    |- First stage export compressor
    |    |- Compressor
    |    |- Scrubber
    |    |- ...
    |- Second stage export compressor
    |- ...
```
Timeseries, events, files and other resources are attached to each Asset.

The hierarchical structure can make it easier to find the timeseries data that you're looking for. Though there are [other ways](https://doc.cognitedata.com/concepts/#_3d-models-and-revisions) to do this, we'll focus on using the hierarchy today!

In [None]:
# download a sample of assets up to a certain depth in the hierarchy
df_sample_assets = client.assets.get_assets(limit=1000, depth=4).to_pandas().sort_values('depth')
df_sample_assets

# Step 2: Pick an asset for further investigation
For the rest of the workshop, you'll be working with one of the subsystems that you visualized in the [LIVE Operational Intelligence System Overview](https://opint.cogniteapp.com/publicdata/infographics/-LOHKEJPLvt0eRIZu8mE) (see below).
Either pick an asset yourself below, or let fate decide ;)

![Operational Intelligence System Overview](SystemOverview.png)



In [None]:
import random
SYSTEM_OVERVIEW_ASSETS = [
    '23-ESDV-92501A',
    '23-ESDV-92501B',
    '23-HA-9103',
    '23-PV-92538',
    '23-VG-9101',
    '23-KA-9101',
    '23-HA-9115',
    '23-HA-9114',
    '23-FV-92543',
    '23-ESDV-92551A',
    '23-ESDV-92551B',
]

In [None]:
# fetch the asset metadata from CDP using the assets client

df_system_overview_assets = pd.concat([
    client.assets.get_assets(name=n).to_pandas()
    for n in SYSTEM_OVERVIEW_ASSETS
])[['name', 'id', 'parentId', 'description']].set_index('name')

df_system_overview_assets

In [None]:
# Choose an asset for analysis, or let fate decide :)

asset_name = random.choice(SYSTEM_OVERVIEW_ASSETS)

asset_id = df_system_overview_assets.loc[asset_name, 'id']

print("And my asset is!")
df_system_overview_assets.loc[asset_name]

# Step 3: Find all the timeseries for your asset

The interface `client.assets.get_asset_subtree()` can be used to retrieve all of the *children* of an Asset. The `depth` parameter sets how far we traverse down the hierarchy.

In [None]:
df_asset_children = client.assets.get_asset_subtree(
    asset_id=df_system_overview_assets.loc[asset_name, 'id'],
    depth=10
).to_pandas().sort_values('depth')
df_asset_children[['depth', 'id', 'parentId', 'description']]

... Assets are interesting to see how things are put together, but what I'm sure you're really after are those petabytes of **time series**; those beautiful pressure (PT), temperature (TT) and flow (TT) sensors that have recorded the life of the platform for the last few years.

First we need to find all these time series. We can use the `path` parameter in the `time_series` client to get all the time series attached to assets below our system overview asset. Note that this parameter maps directly to the CDP API, and therefore needs to provide the asset id formatted carefully as a json string: `"[id, ]"`.

In [None]:
df_asset_children_timeseries = client.time_series.get_time_series(path=str([asset_id])).to_pandas()
df_asset_children_timeseries

Great! We have discovered the timeseries below our asset!

**Note**: CDP can also store string and step timeseries. Step timeseries have different aggregation methods, and support dead-band-compression for time series that do not change very often (e.g. valve opening angles).

# Step 4: Into the timeseries datapoints!
In CDP we do some very clever things in the backend to store serve up timeseries just the way you like it:
- Store the timeseries (timestamp, value) in their raw format, because one day we'll need it
- Precompute aggregations for millisecond response times
- Build tabular structures server side
- Enable natural language time specifications (e.g. `start='8d-ago'` and `granularity='10m'`)

So once you've located the timeseries you're interested in analyzing, the `datapoints` client has several options for downloading the data.

**Note:** The timeseries column is represented throughout CDP as milliseconds since epoch time. Pandas offers an easy conversion to python datetime with `pd.to_datetime(<column/value>, unit='ms')`.

In [None]:
# set aside string time series for now because they do not aggregate together with numerical time series
# consider investigating the string timeseries in part 4b
lst_timeseries = df_asset_children_timeseries[~df_asset_children_timeseries['isString']]['name'].tolist()
lst_timeseries

In [None]:
df_data = client.datapoints.get_datapoints_frame(
    time_series=lst_timeseries,
    aggregates=['avg'],
    granularity='1h',
    start='30d-ago',
)

df_data = df_data.set_index(pd.to_datetime(df_data['timestamp'], unit='ms')).drop('timestamp', axis=1)
df_data

In [None]:
# plot up to 10 random rows
df_plot_sample = df_data[random.sample(list(df_data.columns), min(10, len(df_data.columns)))]

df_plot_sample.plot(subplots=True, figsize=(20,20));

# Congratulations, you are done with part 3!

Save your notebook, and remember your asset for the next part, where we dig deeper into the data.