# [Tour of the Cognite Data Platform using the SDK](https://doc.cognitedata.com)
*January 2019*

>This notebook provides a quick reference for common access patterns to the Cognite Data Platform via the [python SDK](https://github.com/cognitedata/cognite-sdk-python).
For a detailed explanation of concepts please see [our documentation homepage](https://doc.cognitedata.com).

<hr>

# Prerequisites

- Install dependencies, including the `cognite-sdk`, with either `pip` (requirements.txt) or `pipenv` (PipFile)
- Make sure that you have received an API key from http://openindustrialdata.com/ 
- Set the API key as an environment variable `PUBLICDATA_API_KEY`, or paste it into the prompt.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
import os

import datetime

from cognite import CogniteClient

# 1.  [Projects](https://doc.cognitedata.com/concepts/#projects) and Authentication with API keys

Projects are the top level organization in CDP. Each customer normally has a single project.

Authentication to CDP is done using API keys.
Keys are granted different permission levels.
For Open Industrial Data in this tutorial we only provide read access. Best practice is to use read only keys in exploratory analysis and development and keys with write access only when deployment.

In [None]:
from getpass import getpass
API_KEY = os.environ['PUBLICDATA_API_KEY'] if 'PUBLICDATA_API_KEY' in os.environ else getpass("Enter API key:")
client = CogniteClient(api_key=API_KEY, project="publicdata")

# 2.  [Assets](https://doc.cognitedata.com/concepts/#assets)

The data in the Cognite Data Platform is structured by assets, where an asset can be a specific piece of equipment or an equipment type.

### Discovering assets
To discover assets in the hierarchy, `clients.assets.search_for_assets` provides text search, but it is normally useful to load the entire hierarchy with `client.assets.get_assets()` and take advantage of pandas' filtering.
When downloading large hierarchies, be sure to set `autopaging=True` to get all the records.

In [None]:
# when you know the equipment, fuzzy search through names
client.assets.search_for_assets(name='23-KA-9101xxx', limit=5).to_pandas()

In [None]:
# or even wild card search through the asset descriptions
client.assets.search_for_assets(query="lube oil", limit=5).to_pandas()

In [None]:
# fetch all assets
df_assets = client.assets.get_assets().to_pandas()

# fetch the highest level assets that contain the word compressor
df_assets[df_assets.description.str.lower().str.contains("compressor")].sort_values('depth').head(5)

### Traversing the asset hierarchy
Once you know the asset Id, then there are further options for traversing the hierarchy.

The `get_assets()` function maps to the low level API functions. The following parameters are useful:
 - `path`: Fetches all assets below a certain node in the hierarchy. Passes directly to the API and therefore requires a json list of ids (i.e. `"[1,2,3]"`).
 - `depth`: How many levels down the asset hierarchy to traverse
 
The above parameters are useful to understand because they appear throughout the API, but more intuitive access to subtrees is available with `client.assets.get_asset_subtree`.


In [None]:
# Display the hierarchy below the first stage compressor node
dcols = ["depth", "description", "id", "name", "parentId", "path"]
FIRST_STAGE_COMPRESSOR_ID = 4518112062673878
df_asset_sample = client.assets.get_asset_subtree(FIRST_STAGE_COMPRESSOR_ID, depth=1)\
    .to_pandas()\
    .sort_values(by=["depth"])

df_asset_sample.head(5)

**Tip:** The hierarchy returns useful information about hierarchy in the near vicinity.
`parentId` gives the id of the immediate parent, and `path` gives the nodes up to the root node of the project.
This is useful for grabbing assets upwards in the hierarchy:

In [None]:
# show the lineage of an asset using path
lineage = df_asset_sample.loc[0, 'path']
print("path: ", lineage)

# show the uncles and aunts of the first stage compressor
client.assets.get_asset_subtree(lineage[-2], depth=1).to_pandas().sort_values("depth")

# 3. [Timeseries](https://doc.cognitedata.com/concepts/#time-series)

A time series consists of a sequence of data points connected to a single asset. Here we take a quick look at some common access patterns.

### Discovering timeseries
Similar parameters from the asset API are available for discovering timeseries by asset. `client.time_series.get_time_series` can metadata on timeseries below a certain node in the hierarchy with the following notable arguments:
 - `path`: Fetches all assets below a certain node in the hierarchy. Passes directly to the API and therefore requires a json list of ids (i.e. `"[1,2,3]"`).
 - `depth`: How many levels down the asset hierarchy to traverse
 
Again, if a large number of results are expected, the `autopaging` parameter is recommended.

In [None]:
#Get all time series of a particular asset
timeseries_metadata = client.time_series.get_time_series(path=str([FIRST_STAGE_COMPRESSOR_ID])).to_pandas()
timeseries_metadata

In [None]:
# again, pandas functionality is hard to beat for finding the timeseries you need
scrubber_timeseries = timeseries_metadata[
    timeseries_metadata['description'].str.lower().str.contains('scrubber') &\
    timeseries_metadata['name'].str.contains('Value') 
    ].reset_index(drop=True)
scrubber_timeseries

In [None]:
scrubber_timeseries.name.tolist()

### Downloading timeseries data with [datapoints](https://doc.cognitedata.com/concepts/#data-points)

One of the strengths of CDP is the expressiveness of timeseries downloads. The `datapoints` client provides access timeseries data in two ways:
- `client.datapoints.get_datapoints` for a single timeseries, capable of downloading raw datapoints
- `client.datapoints.get_datapoints_frame` for multiple timeseries clocked to a common time axis

For both APIs, the following parameters are important to understand:
- `start` and `end` can be specified as python `datetime` objects, milliseconds since epoch UTC or with `timeunits`.  Time units ; `N[timeunit]-ago` where `timeunit` is w,d,h,m,s 
- `granularity` specifies the aggregation windows using `timeunits`, e.g. `10m`
- `aggregates` specifies the list of aggregate functions you wish to apply to the data. Valid aggregate functions are: 'average/avg, max, min, count, sum, interpolation/int, stepinterpolation/step'.

The `granularity` parameter standardizes the time axis of the data returned from CDP, and the `aggregation` parameter sets the up- or downsampling algorithm. Providing neither of these parameters to the `get_datapoints` function returns raw datapoints.

**Tip:** the timestamps in CDP are in units of milliseconds since epoch time. The easiest way to transform this to a python `datetime` is to use `pd.to_datetime(df.timestamp, unit='ms')`.

#### Raw datapoints

In [None]:
# Note that: openindustrialdata.com has a lag of 1 week for data security

# Fetch an hour of raw data
df_raw = client.datapoints.get_datapoints(
    name=scrubber_timeseries.loc[0, 'name'],
    start=datetime.datetime.now() - datetime.timedelta(hours=8*24 + 1),
    end=datetime.datetime.now() - datetime.timedelta(hours=8*24),
).to_pandas()

df_raw.set_index(pd.to_datetime(df_raw.timestamp, unit='ms'))['value'].plot()
df_raw

#### Resampled dataframes
Tabular structures serve as a good entry point for downstream analysis. From these structures we can proceed to do data quality, exploratory data analysis and model development. Of course, for more complex feature engineering, we often need to use the raw datapoints using the datapoints approach shown above.

In [None]:
df_frame = client.datapoints.get_datapoints_frame(
    time_series=scrubber_timeseries['name'].tolist(),
    start='14d-ago',
    end='7d-ago',
    granularity='1h',
    aggregates=['avg'],
)

df_frame = df_frame.set_index(pd.to_datetime(df_frame.timestamp, unit='ms')).drop('timestamp', axis=1)
df_frame

In [None]:
from sklearn.preprocessing import QuantileTransformer
pd.DataFrame(
    data=QuantileTransformer(output_distribution='normal', n_quantiles=100).fit_transform(df_frame),
    columns=df_frame.columns,
    index=df_frame.index
).plot(legend=False)

#### Other useful datapoint access patterns


In [None]:
# get the latest datapoint
client.datapoints.get_latest(scrubber_timeseries.loc[0, 'name']).to_pandas()

In [None]:
# listening for streaming data
for dp in client.datapoints.live_data_generator(scrubber_timeseries.loc[0, 'name']):
    print(dp)