# Introduction

See also the [documentation](https://github.com/cognitedata/cdp-spark-datasource#reading-and-writing-cognite-data-platform-resource-types) for examples for each resource type. The general pattern is:

```scala
my_data_frame = spark.read.format("cognite.spark.v1") \
  .option("type", "some-resource-type") \
  .option("apiKey", dbutils.secrets.get("your-scope", "api-key-for-project"))
```
If you are using the Community version of Databricks, it does not support personal tokens which is used to create secret scope and secrets. Meaning if you are using the Community version of Databricks we need to use the API-keys in the notebook. 


The resource types are:
- `assets`
- `events`
- `timeseries` time series metadata
- `datapoints` data points for a time series, also supports aggregates
- `raw` "RAW" tables, which also require `.option("database", "some-database")` and `.option("table", "some-table")`

Let's start by reading some data from the `publicdata` project. If you don't have an API key, go get one from the [Open Industrial Data project](https://openindustrialdata.com/).

In [0]:
#insert API-key to publicdata project
API_KEY="ZTc3NmUxMzUtNGMwZC00YmM2LTgxNWQtZjQ4Yjc1ZGYyODlk"

In [0]:
assets = spark.read.format("cognite.spark.v1") \
    .option("type", "assets") \
    .option("apiKey", API_KEY) \
    .load()

# DataFrames

We get back a [Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html) from `spark.read.format...load()`, which "is conceptually equivalent to a table in a relational database or a data frame in R/Python".

`spark` is your entry point to the Spark API, and it's a `SparkSession` with a connection to the cluster. You will mostly use it to read data frames, and then interact with the data frames.

You may have noticed that the command finished almost immediately. The data frame is a lazy data structure, and doesn't actually load any data until it has to. You can view the schema (the column names and their types) by clicking the small arrow next to the output. The schema is constant for assets, so we didn't need to read any data to produce that schema.

Data is loaded only when you perform an *action* on a data frame, which requires data to be present. Examples of actions include `.count()` (for counting the rows), `.show()` (for printing the first few rows), `.toPandas()` (for converting to a Pandas data frame, downloading all data to your Python process), and pretty much anything else that uses data.

In [0]:
print(assets.count())
assets.show()
assets.toPandas()

Unnamed: 0,externalId,name,parentId,parentExternalId,description,metadata,source,id,createdTime,lastUpdatedTime,rootId,aggregates,dataSetId,labels
0,,23-LI-96182-01,2.357112e+15,,LEVEL INDICATOR,"{'WMT_TAG_ID_ANCESTOR': '346682', 'WMT_TAG_ISA...",,315863051423411,1970-01-01,1970-01-01,6687602007296940,,,
1,,23-YAHH-96133-02,2.770649e+15,,VRD - PH 1STSTG MOTOR JOURN BRG NDE : VIBRATIO...,"{'WMT_TAG_ID_ANCESTOR': '347006', 'WMT_TAG_ISA...",,322603444177639,1970-01-01,1970-01-01,6687602007296940,,,
2,,23-TT-96147-02,2.539007e+15,,VRD - PH 1STSTG COMP SEAL GAS HTR,"{'WMT_TAG_ID_ANCESTOR': '345716', 'WMT_TAG_ISA...",,323277268913879,1970-01-01,1970-01-01,6687602007296940,,,
3,,23-PAH-96150-01,4.050791e+15,,VRD - PH 1STSTG COMP INNER SEAL NDE : PRESSURE...,"{'WMT_TAG_ID_ANCESTOR': '346318', 'WMT_TAG_ISA...",,336762264880010,1970-01-01,1970-01-01,6687602007296940,,,
4,,23-PAL-96160,4.144009e+15,,SOFT TAG VRD - PH 1STSTG COMP SEPAR GAS,"{'WMT_TAG_ID_ANCESTOR': '346163', 'WMT_TAG_COM...",,227614941284667,1970-01-01,1970-01-01,6687602007296940,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1110,,23-PDT-92534,7.372310e+15,,VRD - PH 1STSTGCOMP SUCTION,"{'WMT_TAG_ID_ANCESTOR': '681824', 'WMT_TAG_ISA...",,8571696650722129,1970-01-01,1970-01-01,6687602007296940,,,
1111,,23-ESDV-92501B-J01,1.659946e+15,,EMERGENCY SHUTDOWN VALVE-JUNCTION BOX,"{'WMT_TAG_ID_ANCESTOR': '345655', 'WMT_TAG_ISA...",,8575451726408247,1970-01-01,1970-01-01,6687602007296940,,,
1112,,23-HV-92541-01,2.011813e+15,,VRD - PH 1STSTGCOMP DISCH PSV INLET,"{'WMT_TAG_ID_ANCESTOR': '346236', 'WMT_TAG_ISA...",,8594787774355555,1970-01-01,1970-01-01,6687602007296940,,,
1113,,23-TE-96131-02,6.191827e+15,,VRD - PH 1STSTG MOTOR JOURN BRG NDE,"{'WMT_TAG_ID_ANCESTOR': '346066', 'WMT_TAG_ISA...",,8598061348328169,1970-01-01,1970-01-01,6687602007296940,,,


DataFrames can be distributed with many partitions being placed on different nodes in our Spark cluster.

In [0]:
assets.rdd.getNumPartitions()

We can drop and rename columns, getting a DataFrame with a new schema.

In [0]:
assets.drop("metadata") \
  .drop("externalId") \
  .drop("source") \
  .withColumnRenamed("description", "descr") \
  .withColumnRenamed("lastUpdatedTime", "updatedAt") \
  .printSchema()

Notebooks have autocompletion built in and you can view keyboard shortcuts by clicking the keyboard icon at the top bar.

# Displaying data

In Databricks there's a convenient `display()` method you can use to show data in data frames (and a few other formats, like pandas and matplotlib figures). Since showing the data requires it to be loaded, this will also trigger an action.
By default, only the first 1000 rows are displayed in the widget, even if Spark needs to load more data than this in the background.

Note that you might need to scoll within the widget to show all the results.

You can sort the rows shown by different columns, and you can expand the "string to string" map in the metadata column by clicking the arrow.

In [0]:
display(assets)

externalId,name,parentId,parentExternalId,description,metadata,source,id,createdTime,lastUpdatedTime,rootId,aggregates,dataSetId,labels
,23-LI-96182-01,2357112351749647.0,,LEVEL INDICATOR,"Map(WMT_CATEGORY_ID -> 1116, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000278675, ELC_STATUS_ID -> 1212, WMT_TAG_STATUSCHGDATE -> 2014-09-26 12:52:05, WMT_AREA_ID -> 1585, WMT_TAG_ISACTIVE -> 0, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-LI-96182-01, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 8137, WMT_TAG_ID -> 694413, WMT_TAG_CREATED_DATE -> 2013-03-20 08:57:38, WMT_FUNC_CODE_ID -> 4442, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 346682, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2014-09-26 12:52:05, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> LEVEL INDICATOR, WMT_TAG_MAINID -> 681760)",,315863051423411,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-YAHH-96133-02,2770649304694274.0,,VRD - PH 1STSTG MOTOR JOURN BRG NDE : VIBRATION ALARM HIGH HIGH,"Map(WMT_CATEGORY_ID -> 1152, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000292009, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_PO_ID -> 8309, WMT_AREA_ID -> 1600, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-YAHH-96133-02, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 1001, WMT_TAG_ID -> 698840, WMT_TAG_CREATED_DATE -> 2013-05-16 11:50:16, WMT_FUNC_CODE_ID -> 11405, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 347006, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2015-10-06 12:28:32, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD - PH 1STSTG MOTOR JOURN BRG NDE : VIBRATION ALARM HIGH HIGH, WMT_TAG_MAINID -> 681760)",,322603444177639,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-TT-96147-02,2539007469802785.0,,VRD - PH 1STSTG COMP SEAL GAS HTR,"Map(WMT_CATEGORY_ID -> 1116, WMT_SYSTEM_ID -> 4440, WMT_TAG_LOOP -> 96147, WMT_TAG_GLOBALID -> 1000000000684487, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_TAG_STATUSCHGDATE -> 2017-10-26 14:00:33, WMT_PO_ID -> 8309, WMT_AREA_ID -> 1600, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, RES_ID -> 650245, WMT_TAG_NAME -> 23-TT-96147-02, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 8137, WMT_TAG_ID -> 346632, WMT_TAG_CREATED_DATE -> 2009-06-26 15:36:37, WMT_FUNC_CODE_ID -> 4588, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 345716, WMT_SAFETYCRITICALELEMENT_ID -> 1069, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_COMMENT -> Added Alarm_Setpoint_Units: ?C (Ref: VRD-CCR-000505)Alarm Limit_HH changed from 140 to 180, controller st = 160 on 08.02.2013 (Ref: VRD-BP-BP-EQ-000290-A-C-003)., WMT_TAG_UPDATED_DATE -> 2017-10-26 14:00:33, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD - PH 1STSTG COMP SEAL GAS HTR, WMT_TAG_MAINID -> 681760)",,323277268913879,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-PAH-96150-01,4050790831683279.0,,VRD - PH 1STSTG COMP INNER SEAL NDE : PRESSURE ALARM HIGH,"Map(WMT_CATEGORY_ID -> 1152, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000112421, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_PO_ID -> 8309, WMT_AREA_ID -> 1600, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-PAH-96150-01, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 1001, WMT_TAG_ID -> 769886, WMT_TAG_CREATED_DATE -> 2015-10-06 12:18:22, WMT_FUNC_CODE_ID -> 11405, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 346318, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2015-10-06 13:14:18, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD - PH 1STSTG COMP INNER SEAL NDE : PRESSURE ALARM HIGH, WMT_TAG_MAINID -> 681760)",,336762264880010,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,60-EN-9011A+80E2,2539007469802785.0,,FEEDER THYRISTOR +82B SEAL GAS HEATER FOR 1 ST STAGE COMP.,"Map(WMT_CATEGORY_ID -> 1114, WMT_SYSTEM_ID -> 3118, WMT_TAG_GLOBALID -> 1000000000281000, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_TAG_STATUSCHGDATE -> 2017-07-25 10:10:03, WMT_AREA_ID -> 1650, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, WMT_TAG_NAME -> 60-EN-9011A+80E2, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 8137, WMT_TAG_ID -> 694792, WMT_TAG_CREATED_DATE -> 2013-04-05 15:12:55, WMT_FUNC_CODE_ID -> 4291, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 345716, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2017-07-25 10:10:03, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> FEEDER THYRISTOR +82B SEAL GAS HEATER FOR 1 ST STAGE COMP., WMT_TAG_MAINID -> 681760)",,135258102320048,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-PIC-92538-11,3956345651792907.0,,SOFT TAG VRD - PH 1STSTGCOMP SUCTION STV,"Map(WMT_CATEGORY_ID -> 1152, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000291748, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_AREA_ID -> 1585, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-PIC-92538-11, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 1001, WMT_TAG_ID -> 698580, WMT_TAG_CREATED_DATE -> 2013-05-16 11:50:16, WMT_FUNC_CODE_ID -> 11275, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 346362, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2015-10-09 11:56:33, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> SOFT TAG VRD - PH 1STSTGCOMP SUCTION STV, WMT_TAG_MAINID -> 681760)",,137775225587472,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,48-TSL-96960,3476609559144829.0,,VRD - PH 1STSTG COMPR WATER MIST,"Map(WMT_CATEGORY_ID -> 1116, WMT_SYSTEM_ID -> 2491, WMT_TAG_LOOP -> 48-T-9, WMT_TAG_GLOBALID -> 1000000000711630, WMT_CONTRACTOR_ID -> 1350, ELC_STATUS_ID -> 1225, WMT_TAG_STATUSCHGDATE -> 2013-07-16 10:55:46, WMT_AREA_ID -> 1600, WMT_TAG_ISACTIVE -> 0, SOURCE_DB -> workmate, WMT_TAG_NAME -> 48-TSL-96960, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 1001, WMT_TAG_ID -> 351922, WMT_TAG_CREATED_DATE -> 2009-06-26 15:36:40, WMT_FUNC_CODE_ID -> 4577, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 411403, WMT_SAFETYCRITICALELEMENT_ID -> 1033, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2015-10-08 10:03:52, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD - PH 1STSTG COMPR WATER MIST, WMT_TAG_MAINID -> 681760)",,141678634874122,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-LSHH-96138 (2),6191827428964450.0,,VRD PH,"Map(WMT_CATEGORY_ID -> 1116, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000250774, ELC_STATUS_ID -> 1212, WMT_TAG_STATUSCHGDATE -> 2014-09-26 12:52:05, WMT_AREA_ID -> 1444, WMT_TAG_ISACTIVE -> 0, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-LSHH-96138 (2), WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 8137, WMT_TAG_ID -> 682995, WMT_TAG_CREATED_DATE -> 2012-12-13 14:13:52, WMT_FUNC_CODE_ID -> 4465, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 346066, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2014-09-26 12:52:05, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD PH, WMT_TAG_MAINID -> 681760)",,142563446487784,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-ZAHH-96107-01,3701946487227614.0,,VRD - PH 1STSTGCOMP SHAFT AXIAL POS : POSITION ALARM HIGH HIGH,"Map(WMT_CATEGORY_ID -> 1152, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000112529, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_TAG_STATUSCHGDATE -> 2016-06-06 08:48:28, WMT_PO_ID -> 8309, WMT_AREA_ID -> 1600, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-ZAHH-96107-01, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 2829, WMT_TAG_ID -> 769994, WMT_TAG_CREATED_DATE -> 2015-10-06 12:18:22, WMT_FUNC_CODE_ID -> 11405, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 347100, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2016-06-06 08:48:28, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD - PH 1STSTGCOMP SHAFT AXIAL POS : POSITION ALARM HIGH HIGH, WMT_TAG_MAINID -> 681760)",,146479202708416,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,
,23-PAH-92539,3147733389929639.0,,VRD - PH 1STSTGCOMP DISCHARGE : PRESSURE ALARM HIGH,"Map(WMT_CATEGORY_ID -> 1152, WMT_SYSTEM_ID -> 4440, WMT_TAG_GLOBALID -> 1000000000293597, WMT_CONTRACTOR_ID -> 1686, ELC_STATUS_ID -> 1211, WMT_PO_ID -> 8309, WMT_AREA_ID -> 1600, WMT_TAG_ISACTIVE -> 1, SOURCE_DB -> workmate, WMT_TAG_NAME -> 23-PAH-92539, WMT_LOCATION_ID -> 1004, WMT_TAG_UPDATED_BY -> 1001, WMT_TAG_ID -> 700429, WMT_TAG_CREATED_DATE -> 2013-05-16 11:50:16, WMT_FUNC_CODE_ID -> 11405, WMT_TAG_HISTORYREQUIRED -> Y, WMT_TAG_ID_ANCESTOR -> 346290, SOURCE_TABLE -> wmate_dba.wmt_tag, WMT_TAG_UPDATED_DATE -> 2015-10-06 12:28:33, WMT_TAG_ISOWNEDBYPROJECT -> 0, WMT_TAG_CRITICALLINE -> N, WMT_TAG_DESC -> VRD - PH 1STSTGCOMP DISCHARGE : PRESSURE ALARM HIGH, WMT_TAG_MAINID -> 681760)",,155186911505639,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,


In [0]:
# we will explain groupBy() and count() in the section on aggregations
display(assets.groupBy(assets.parentId).count())

parentId,count
5901066000673985.0,11
2861239574637735.0,24
786220428505816.0,2
4518112062673878.0,3
3904753668320840.0,25
1300024350927843.0,3
4050790831683279.0,2
3147733389929639.0,3
3995911524431747.0,6
8338677202728690.0,2


# Caching data

As we mentioned before, the data frame is a lazy structure that loads data when it is needed. Loading data over and over again can be slow and wasteful when we don't absolutely need it to be completely up-to-date.
In that case, we can created a cached data frame by adding `.cache()` at the end.

In [0]:
events = spark.read.format("cognite.spark.v1") \
    .option("type", "assets") \
    .option("apiKey", API_KEY) \
    .load() \
    .cache()

In [0]:
events.printSchema()

In [0]:
print(events.count())
events.show()
events.toPandas()

Unnamed: 0,externalId,name,parentId,parentExternalId,description,metadata,source,id,createdTime,lastUpdatedTime,rootId,aggregates,dataSetId,labels
0,,23-CB-9103B,2.137558e+15,,VRD - 1ST STAGE COMPRESSOR LUBE OIL FILTER B,"{'WMT_TAG_ID_ANCESTOR': '682544', 'WMT_TAG_ISA...",,379497117972793,1970-01-01,1970-01-01,6687602007296940,,,
1,,45-PAH-92508,7.059526e+14,,VRD - PH 1STSTGSUCTCOOL COOLMED OUT : PRESSURE...,"{'WMT_TAG_ID_ANCESTOR': '350293', 'WMT_TAG_ISA...",,380318889330669,1970-01-01,1970-01-01,6687602007296940,,,
2,,23-TAH-96147-02,3.232773e+14,,VRD - PH 1STSTG COMP SEAL GAS HTR : TEMPERATUR...,"{'WMT_TAG_ID_ANCESTOR': '346632', 'WMT_TAG_ISA...",,387454215987725,1970-01-01,1970-01-01,6687602007296940,,,
3,,23-YE-96106-02,3.047932e+15,,VRD - PH 1STSTG COMP JOURN BRG NDE,"{'WMT_TAG_ID_ANCESTOR': '346064', 'WMT_TAG_ISA...",,392115388490331,1970-01-01,1970-01-01,6687602007296940,,,
4,,48-SU-9106,3.476610e+15,,VRD - 1ST STAGE COMP WATER MIST AIR CYLINDER 2,"{'WMT_TAG_ID_ANCESTOR': '411403', 'WMT_TAG_ISA...",,402526641391660,1970-01-01,1970-01-01,6687602007296940,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1110,,23-KA-9101-SPD,3.047932e+15,,VRD - PH 1ST STAGE SHAFT POWER DEV,"{'WMT_TAG_ID_ANCESTOR': '346064', 'WMT_TAG_ISA...",,8610767224227528,1970-01-01,1970-01-01,6687602007296940,,,
1111,,23-YAH-96119-02,4.948409e+15,,VRD - PH 1STSTGGEAR 2 JOURNBRG NDE : VIBRATION...,"{'WMT_TAG_ID_ANCESTOR': '347000', 'WMT_TAG_COM...",,8616078351113857,1970-01-01,1970-01-01,6687602007296940,,,
1112,,23-PDAH-92602,8.777968e+15,,VRD - PH 1ST STG DISCH GAS COOLERS : PRESSURE ...,"{'WMT_TAG_ID_ANCESTOR': '346178', 'WMT_TAG_ISA...",,8624217952557185,1970-01-01,1970-01-01,6687602007296940,,,
1113,,23-YSV-92445,2.861240e+15,,VRD - PH 1STSTGSUCTCOOL GAS IN,"{'WMT_TAG_ID_ANCESTOR': '345949', 'WMT_TAG_ISA...",,8624522362488710,1970-01-01,1970-01-01,6687602007296940,,,


If we run the same commands again they should finish more quickly (potentially much more quickly if there's a lot of data).

In [0]:
print(events.count())
events.show()
events.toPandas()

Unnamed: 0,externalId,name,parentId,parentExternalId,description,metadata,source,id,createdTime,lastUpdatedTime,rootId,aggregates,dataSetId,labels
0,,23-CB-9103B,2.137558e+15,,VRD - 1ST STAGE COMPRESSOR LUBE OIL FILTER B,"{'WMT_TAG_ID_ANCESTOR': '682544', 'WMT_TAG_ISA...",,379497117972793,1970-01-01,1970-01-01,6687602007296940,,,
1,,45-PAH-92508,7.059526e+14,,VRD - PH 1STSTGSUCTCOOL COOLMED OUT : PRESSURE...,"{'WMT_TAG_ID_ANCESTOR': '350293', 'WMT_TAG_ISA...",,380318889330669,1970-01-01,1970-01-01,6687602007296940,,,
2,,23-TAH-96147-02,3.232773e+14,,VRD - PH 1STSTG COMP SEAL GAS HTR : TEMPERATUR...,"{'WMT_TAG_ID_ANCESTOR': '346632', 'WMT_TAG_ISA...",,387454215987725,1970-01-01,1970-01-01,6687602007296940,,,
3,,23-YE-96106-02,3.047932e+15,,VRD - PH 1STSTG COMP JOURN BRG NDE,"{'WMT_TAG_ID_ANCESTOR': '346064', 'WMT_TAG_ISA...",,392115388490331,1970-01-01,1970-01-01,6687602007296940,,,
4,,48-SU-9106,3.476610e+15,,VRD - 1ST STAGE COMP WATER MIST AIR CYLINDER 2,"{'WMT_TAG_ID_ANCESTOR': '411403', 'WMT_TAG_ISA...",,402526641391660,1970-01-01,1970-01-01,6687602007296940,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1110,,23-KA-9101-SPD,3.047932e+15,,VRD - PH 1ST STAGE SHAFT POWER DEV,"{'WMT_TAG_ID_ANCESTOR': '346064', 'WMT_TAG_ISA...",,8610767224227528,1970-01-01,1970-01-01,6687602007296940,,,
1111,,23-YAH-96119-02,4.948409e+15,,VRD - PH 1STSTGGEAR 2 JOURNBRG NDE : VIBRATION...,"{'WMT_TAG_ID_ANCESTOR': '347000', 'WMT_TAG_COM...",,8616078351113857,1970-01-01,1970-01-01,6687602007296940,,,
1112,,23-PDAH-92602,8.777968e+15,,VRD - PH 1ST STG DISCH GAS COOLERS : PRESSURE ...,"{'WMT_TAG_ID_ANCESTOR': '346178', 'WMT_TAG_ISA...",,8624217952557185,1970-01-01,1970-01-01,6687602007296940,,,
1113,,23-YSV-92445,2.861240e+15,,VRD - PH 1STSTGSUCTCOOL GAS IN,"{'WMT_TAG_ID_ANCESTOR': '345949', 'WMT_TAG_ISA...",,8624522362488710,1970-01-01,1970-01-01,6687602007296940,,,


All asset data is now kept in memory by Spark, if possible, and reloads will happen only if a node crashes. Even then, only the data that was kept on that node will be reloaded, if possible.

Caching is a good idea if you have a large amount of data that will not be changed.

However, if you cache events as above, your cached copy will not receive new events.
This might seem obvious, but it means that if you're doing something like waiting for new events that you
have just created to show up, you should *not* cache the DataFrame you're using to check for new events!

# Time series metadata

Let's read the time series metadata into another cached data frame.

In [0]:
tsmd = spark.read.format("cognite.spark.v1") \
    .option("type", "timeseries") \
    .option("apiKey", API_KEY) \
    .load() \
    .cache()

In [0]:
tsmd.printSchema()

In [0]:
tsmd.count()

In [0]:
display(tsmd)

name,isString,metadata,unit,assetId,isStep,description,securityCategories,id,externalId,createdTime,lastUpdatedTime,dataSetId
VAL_23-KA-9101-M01_TCS_Protection_Trip:VALUE,False,"Map(engunits -> , span -> 1, instrumenttag -> 23-KA-9101-M01_TCS_Protection_Trip:VALUE, location4 -> 1, totalcode -> 0, step -> 1, _replicatedInternalId -> 7763082528285459, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-KA-9101-M01_TCS_Protection_Trip:VALUE, ptclassrev -> 1, zero -> 147, location3 -> 1, pointsource -> VLH_PCN5, exdesc -> , descriptor -> PH (SwitchGear) MV-COMP.M. FEEDER/ TCS Protection Trip, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160259, recno -> 144759, typicalvalue -> 0, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 2, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> Valhall_True_False, compdev -> 0, userint2 -> 0, pointtype -> Digital)",,6191827428964450,False,PH (SwitchGear) MV-COMP.M. FEEDER/ TCS Protection Trip,,338601415916476,pi:160259,2020-06-24T18:51:54.298+0000,2020-06-30T10:26:53.087+0000,
VAL_23-TIC-92604:Control Module:PD,False,"Map(engunits -> s, span -> 100, instrumenttag -> 23-TIC-92604:Control Module:PD, location4 -> 1, totalcode -> 0, step -> 0, _replicatedInternalId -> 6952626872991737, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-TIC-92604:Control Module:PD, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> LAB-WAS-VAL800, exdesc -> , descriptor -> PH 1stStgDiscCool Gas Out Derivate time, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 0, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 86400, shutdown -> 0, pointid -> 160785, recno -> 145285, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 86400, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",s,5838942267158947,False,PH 1stStgDiscCool Gas Out Derivate time,,410193671385460,pi:160785,2020-06-24T18:51:54.298+0000,2020-06-30T10:27:52.615+0000,
VAL_23-KA-9101-M01_Motor_Winding_Temp_Trip:VALUE,False,"Map(engunits -> , span -> 1, instrumenttag -> 23-KA-9101-M01_Motor_Winding_Temp_Trip:VALUE, location4 -> 1, totalcode -> 0, step -> 1, _replicatedInternalId -> 1189681016960910, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-KA-9101-M01_Motor_Winding_Temp_Trip:VALUE, ptclassrev -> 1, zero -> 147, location3 -> 1, pointsource -> VLH_PCN5, exdesc -> , descriptor -> PH (SwitchGear) MV-COMP.M. FEEDER/ motor winding warning trip, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160244, recno -> 144744, typicalvalue -> 0, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 2, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> Valhall_True_False, compdev -> 0, userint2 -> 0, pointtype -> Digital)",,6191827428964450,False,PH (SwitchGear) MV-COMP.M. FEEDER/ motor winding warning trip,,292999953190750,pi:160244,2020-06-24T18:51:54.298+0000,2020-06-30T10:26:47.804+0000,
VAL_23-TIC-96147:Z.Y.Value,False,"Map(engunits -> degC, span -> 100, instrumenttag -> 23-TIC-96147:Z.Y.Value, location4 -> 1, totalcode -> 0, step -> 0, _replicatedInternalId -> 4841176833159959, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-TIC-96147:Z.Y.Value, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1stStg Comp Seal Gas Output, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160836, recno -> 145336, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 1200, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",degC,8054863939437682,False,PH 1stStg Comp Seal Gas Output,,297023562584540,pi:160836,2020-06-24T18:51:54.298+0000,2020-06-30T10:27:58.730+0000,
VAL_23_FT_92537_03:Z.X.Value,False,"Map(engunits -> , span -> 100, instrumenttag -> 23_FT_92537_03:Z.X.Value, location4 -> 1, totalcode -> 0, step -> 0, _replicatedInternalId -> 8635841543797800, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23_FT_92537_03:Z.X.Value, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1stStgComp Flow, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160061, recno -> 144561, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",,3111454725058294,False,PH 1stStgComp Flow,,138649441615650,pi:160061,2020-06-24T18:51:54.298+0000,2020-06-30T10:28:34.895+0000,
VAL_23_FT_92537_02:Z.X.Value,False,"Map(engunits -> , span -> 100, instrumenttag -> 23_FT_92537_02:Z.X.Value, location4 -> 1, totalcode -> 0, step -> 0, _replicatedInternalId -> 2310142689596435, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23_FT_92537_02:Z.X.Value, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1stStgComp Discharge, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160060, recno -> 144560, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",,3111454725058294,False,PH 1stStgComp Discharge,,163138308360797,pi:160060,2020-06-24T18:51:54.298+0000,2020-06-30T10:28:34.538+0000,
VAL_23_PIC_92538_02:Z.X.Value,False,"Map(engunits -> , span -> 100, instrumenttag -> 23_PIC_92538_02:Z.X.Value, location4 -> 1, totalcode -> 0, step -> 0, _replicatedInternalId -> 7039376143296283, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23_PIC_92538_02:Z.X.Value, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1stStgComp STV Perf Act CtrlMod, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160072, recno -> 144572, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",,7625290958805900,False,PH 1stStgComp STV Perf Act CtrlMod,,174000715181921,pi:160072,2020-06-24T18:51:54.298+0000,2020-06-30T10:28:36.795+0000,
VAL_23-PDI-96249:X.Value,False,"Map(engunits -> , span -> 100, instrumenttag -> 23-PDI-96249:X.Value, location4 -> 1, totalcode -> 0, step -> 0, _replicatedInternalId -> 3724033485234832, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-PDI-96249:X.Value, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1st Stg Comp Over Inner Seal, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160623, recno -> 145123, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",,1729948629622439,False,PH 1st Stg Comp Over Inner Seal,,263486820031535,pi:160623,2020-06-24T18:51:54.298+0000,2020-06-30T10:27:27.381+0000,
VAL_23-YZSL-92545-F:X.Value,False,"Map(engunits -> , span -> 1, instrumenttag -> 23-YZSL-92545-F:X.Value, location4 -> 1, totalcode -> 0, step -> 1, _replicatedInternalId -> 941068575373461, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-YZSL-92545-F:X.Value, ptclassrev -> 1, zero -> 147, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1st Stg Comp Disch BDV SOV ZSL Flt, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 161062, recno -> 145562, typicalvalue -> 0, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 28800, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> Valhall_True_False, compdev -> 0, userint2 -> 0, pointtype -> Digital)",,2523354509747914,False,PH 1st Stg Comp Disch BDV SOV ZSL Flt,,367129420543379,pi:161062,2020-06-24T18:51:54.298+0000,2020-06-30T10:28:27.503+0000,
VAL_23-PIC-96153:Control Module:YR,False,"Map(engunits -> barg, span -> 100, instrumenttag -> 23-PIC-96153:Control Module:YR, location4 -> 1, totalcode -> 0, step -> 1, _replicatedInternalId -> 7306338983811570, filtercode -> 0, excdevpercent -> 0, userreal1 -> 0, compdevpercent -> 0, ptclassname -> classic, archiving -> 1, _replicatedSource -> akerbp, excdev -> 0, future -> 0, compressing -> 1, sourcetag -> , tag -> VAL_23-PIC-96153:Control Module:YR, ptclassrev -> 1, zero -> 0, location3 -> 1, pointsource -> VLH_PCN1, exdesc -> , descriptor -> PH 1stStg Comp Inner Seal Working Setpoint, excmin -> 0, userreal2 -> 0, compmin -> 0, scan -> 1, displaydigits -> -5, ptclassid -> 2, squareroot -> 0, excmax -> 600, shutdown -> 0, pointid -> 160670, recno -> 145170, typicalvalue -> 50, convers -> 1, srcptid -> 0, userint1 -> 0, location2 -> 0, location1 -> 1, compmax -> 1200, _replicatedTime -> 1593024714000, location5 -> 0, digitalset -> , compdev -> 0, userint2 -> 0, pointtype -> Float32)",barg,2057246661426095,True,PH 1stStg Comp Inner Seal Working Setpoint,,373599046381078,pi:160670,2020-06-24T18:51:54.298+0000,2020-06-30T10:27:35.629+0000,


# Aggregations

We can do things like group by and count using PySpark. For example, how many time series do we have per asset?
One way to find out is to put time series metadata into different groups based on their asset id, and then count
the number of items in each group, and then order the counts in a descending order.

In [0]:
display(tsmd.groupBy("assetId").count().orderBy("count", ascending=False))

assetId,count
3111454725058294,60
6191827428964450,43
1081261865374641,34
5072327905985771,15
3047932288982463,12
1657658425202359,12
7625290958805900,11
7738334214915176,10
1552506104397035,10
5838942267158947,9


How many different asset descriptions do we have, and how many assets per description?

In [0]:
display(assets.groupBy("description").count().orderBy("count", ascending=False))

description,count
VRD - PH HV SWGR 1STSTG COMP MTR,24
SOFT TAG VRD - PH 1STSTGCOMP SUCTION STV,14
SOFT TAG VRD - PH 1STSTGCOMP DISCHARGE,13
VRD - PH 1STSTGSUCTCOOL GAS IN,12
VRD - PH 1STSTG COMP LO HEADER,12
VRD PH,11
SOFT TAG VRD - PH 1STSTGCOMP SUCTION,10
VRD - PH 1STSTGSUCTCLR GAS IN,9
VRD - PH 1STSTGDISCHCOOLER B/D,9
VRD - PH 1STSTGSUCTSCRUB B/D,9


Spark has support for many different [types of aggregations](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData), such as `min`, `max`, `mean`, `sum`, etc.

We can make a plot of the number of time series associated per asset, to get an overall view of how many time series assets have in general.

Since we have several "counts" in this query, we'll use `.withColumn` to rename the first one.
We'll say more about `F` in a little bit, but for now we only need to know that `F.col("count")` let's us refer to
the column with the name "count".

In [0]:
import pyspark.sql.functions as F

display(tsmd.groupBy("assetId").count() \
        # We will do two "count" calls here, so we need to remember the counts per asset,
        # by renaming the column named "count" at this point to "countsPerAsset"
        .withColumn("countsPerAsset", F.col("count")) \
        .groupBy("countsPerAsset") \
        # Now we count the number of assets with 1, 2, etc. time series connected to them.
        .count() \
        # In order to avoid a random order in our bar chart we can sort by "countsPerAsset"
        .orderBy("countsPerAsset"))

countsPerAsset,count
1,115
2,8
3,1
5,1
8,1
9,6
10,2
11,1
12,2
15,1


If we just want to know the average number of time series per asset, we can use `agg` and the `avg` function directly.

In [0]:
display(tsmd.groupBy("assetId").count().agg(F.avg(F.col("count"))))

avg(count)
2.893617021276596


We will see more of `F` from here on, the `pyspark.sql.functions` package.
Importing it as `F` allows us to use autocompletion to find functions in that package, and avoids
ambiguities for functions like `min`, but it is also common to see individual methods imported like this:

In [0]:
from pyspark.sql.functions import avg

Since autocompletion is very useful, we recommend the `F` style.

# Filtering

We can use `.filter` or `.where` (same method by different names) to select a subset of data. `select` can be used to pick out specific columns, or even parts of columns like `metadata.SOURCE_TABLE`.

In [0]:
display(assets.where(assets.description == "VRD - 1ST STAGE COMPRESSOR LUBE OIL HEATER") \
       .select("name", "description", "metadata.SOURCE_TABLE"))

name,description,SOURCE_TABLE
23-FE-9106A,VRD - 1ST STAGE COMPRESSOR LUBE OIL HEATER,wmate_dba.wmt_tag
23-FE-9106,VRD - 1ST STAGE COMPRESSOR LUBE OIL HEATER,wmate_dba.wmt_tag
23-FE-9106B,VRD - 1ST STAGE COMPRESSOR LUBE OIL HEATER,wmate_dba.wmt_tag
60-EN-9010A+24B1,VRD - 1ST STAGE COMPRESSOR LUBE OIL HEATER,wmate_dba.wmt_tag


Root nodes are defined as having no parent, so their `.parentId` should be null.

In [0]:
display(assets.where(assets.parentId.isNull()))

externalId,name,parentId,parentExternalId,description,metadata,source,id,createdTime,lastUpdatedTime,rootId,aggregates,dataSetId,labels
,Vulkan Control Room,,,,Map(source -> Point cloud model of control room at Vulkan provided by Energima),,4093404255107247,2020-10-06T12:35:55.451+0000,2020-10-06T12:35:55.451+0000,4093404255107247,,,
houston.00. Support systems.Reverse osmosis,Reverse osmosis,,,,"Map(_replicatedInternalId -> 1536954437306151, _replicatedOriginalParentExternalId -> houston.00. Support systems, _replicatedOriginalParentId -> 7151733852409234, _replicatedTime -> 1592572207000)",,5072327905985771,2020-06-19T13:10:07.249+0000,2020-06-19T14:51:35.754+0000,5072327905985771,,,
,Aker BP,,,Aker BP,Map(),,6687602007296940,1970-01-01T00:00:00.000+0000,1970-01-01T00:00:00.000+0000,6687602007296940,,,


Similarly, we can look for uncontextualized time series metadata, which have a null `assetId`.

In [0]:
display(tsmd.where(tsmd.assetId.isNull()))

name,isString,metadata,unit,assetId,isStep,description,securityCategories,id,externalId,createdTime,lastUpdatedTime,dataSetId


As expected, all time series in `publicdata` are contextualized. We can negate a filter expression using `~`
to instead filter for time series that have been contextualized.

In the case of filtering based on non-`NULL` values we can also use `.isNotNull()`.

In [0]:
print(tsmd.where(~tsmd.assetId.isNull()).count())
print(tsmd.where(tsmd.assetId.isNotNull()).count())

# Column objects

`assets.description` and `assets.parentId` return [Column](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column) objects.

Column objects have a wide range of useful methods, and we will see many examples from here on out.
We can also construct them from our DataFrame using string indexing, like `assets["description"]`,
which is necessary if the column name contains characters that are not valid Python identifiers.

For example, we can say `assets["VALUE (%C)"]`. We can also use `F.col("VALUE (%C)")` to create a Column directly.
However, if we do `F.col("name")` and there are several DataFrames involved that have a `name` column, we'd be in trouble
since we didn't specify which `name` column we meant, while `assets.name` would have been unambiguous.

We'll see more of that when looking at joins.

For those reasons, we recommend indexing the DataFrame (using `.name` when possible) to create Column objects, even if it can become a bit tedious to spell out the DataFrame name.

However, in the previous section we use `F.col("count")` because we didn't have a DataFrame object with a column
named count. Our "count" column only existed on an intermediate DataFrame. We could have stored that DataFrame and
given it a name, and then we could have used `df.count`, but sometimes it just makes sense to not bother naming each
intermediate DataFrame.

# Joins

We can join data from different data frames together to answer questions like, what are the time series for asset ids `4050790831683279` and `3195126756929465`?

In [0]:
display(assets.where(assets.id.isin([4050790831683279, 3195126756929465])) \
        .join(tsmd, tsmd.assetId == assets.id) \
        .select(assets.name, assets.description, tsmd.description, tsmd.name))

name,description,description.1,name.1
23-GK-9107B-M01,VRD - 1ST STAGE COMPRESSOR ENCLOSURE COOLING UNIT B,PH 1stStg Comp Encl Cooling UnitB,VAL_23-GK-9107B-M01:Z.Y.Value
23-GK-9107B-M01,VRD - 1ST STAGE COMPRESSOR ENCLOSURE COOLING UNIT B,PH 1stStg Comp Encl Cooling UnitB,VAL_23-GK-9107B-M01-EL:XS.MeasuredValues.CurrentL2.Value.Value
23-PT-96150-01,VRD - PH 1STSTG COMP INNER SEAL NDE,PH 1st Stg Inner Seal NDE,VAL_23-PT-96150:Z.X1.Value


When doing joins we often have the same column name in both tables, which can cause confusing results. As you can see, we ended up with two `description` columns and two `name` columns.

`.alias` can be used to rename columns and help us keep track of which description belongs to the asset and which one belongs to the time series.

In [0]:
display(assets.where(assets.id.isin([4050790831683279, 3195126756929465]))
        .join(tsmd, tsmd.assetId == assets.id)
        .select(assets.name, assets.description, tsmd.description.alias("tsDescription"), tsmd.name.alias("tsName")))

name,description,tsDescription,tsName
23-GK-9107B-M01,VRD - 1ST STAGE COMPRESSOR ENCLOSURE COOLING UNIT B,PH 1stStg Comp Encl Cooling UnitB,VAL_23-GK-9107B-M01:Z.Y.Value
23-GK-9107B-M01,VRD - 1ST STAGE COMPRESSOR ENCLOSURE COOLING UNIT B,PH 1stStg Comp Encl Cooling UnitB,VAL_23-GK-9107B-M01-EL:XS.MeasuredValues.CurrentL2.Value.Value
23-PT-96150-01,VRD - PH 1STSTG COMP INNER SEAL NDE,PH 1st Stg Inner Seal NDE,VAL_23-PT-96150:Z.X1.Value


# Data points

We can retrieve the data for a time series by using the `datapoints` resource type. This one is a bit special, because it will return no data unless you have specified the name(s) of the time series you want to get data for.

As a consequence, you should *not* cache data frames using the `datapoints` resource type, otherwise the data frame will cache an empty result (and remain empty!) if you don't specify a time series name when querying it.

In [0]:
dp = spark.read.format("cognite.spark.v1") \
    .option("type", "datapoints") \
    .option("apiKey", API_KEY) \
    .load()

In [0]:
display(dp.where(dp.externalId == "pi:160184") \
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.timestamp < F.lit("2017-10-31")))

id,externalId,timestamp,value,aggregation,granularity
,pi:160184,2017-10-01T00:00:00.182+0000,155938.515625,,
,pi:160184,2017-10-01T00:00:01.182+0000,158635.5,,
,pi:160184,2017-10-01T00:00:03.166+0000,159136.90625,,
,pi:160184,2017-10-01T00:00:04.182+0000,162804.8125,,
,pi:160184,2017-10-01T00:00:05.182+0000,162559.09375,,
,pi:160184,2017-10-01T00:00:06.182+0000,163509.84375,,
,pi:160184,2017-10-01T00:00:07.182+0000,165762.171875,,
,pi:160184,2017-10-01T00:00:08.182+0000,168093.34375,,
,pi:160184,2017-10-01T00:00:10.166+0000,166568.25,,
,pi:160184,2017-10-01T00:00:11.182+0000,170824.25,,


If we don't specify an upper bound, [getLatest](https://doc.cognitedata.com/api/0.5/#operation/getLatest) will be
used to retrieve the maximum timestamp available.

Similarly, if there is no lower bound the Spark data source will make a query to the time series API to find the timestamp
of the first available data point.

Raw data points are downloaded by default, but the data points DataFrame also has full support for aggregates.

In [0]:
display(dp.where(dp.externalId == "pi:160184") \
        .where(dp.granularity == "7d") \
        .where(dp.aggregation.isin(["min", "average", "max"]))
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.timestamp < F.lit("2017-10-31")))

id,externalId,timestamp,value,aggregation,granularity
,pi:160184,2017-10-05T00:00:00.000+0000,134688.28125,min,7d
,pi:160184,2017-10-12T00:00:00.000+0000,133945.765625,min,7d
,pi:160184,2017-10-19T00:00:00.000+0000,137619.421875,min,7d
,pi:160184,2017-10-26T00:00:00.000+0000,136727.890625,min,7d
,pi:160184,2017-10-05T00:00:00.000+0000,229973.625,max,7d
,pi:160184,2017-10-12T00:00:00.000+0000,242748.78125,max,7d
,pi:160184,2017-10-19T00:00:00.000+0000,226412.546875,max,7d
,pi:160184,2017-10-26T00:00:00.000+0000,213801.640625,max,7d
,pi:160184,2017-10-05T00:00:00.000+0000,165691.68889511496,average,7d
,pi:160184,2017-10-12T00:00:00.000+0000,165339.72175126543,average,7d


# Plotting data

The `display()` widget has a number of options for showing data in different ways, including a line plot that can group results by a column.

Using this we can easily create a plot showing the minimum, average, and maximum values for a time series.

In [0]:
display(dp.where(dp.externalId == "pi:160184") \
        .where(dp.granularity == "1d") \
        .where(dp.aggregation.isin(["min", "average", "max"]))
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.timestamp < F.lit("2017-10-31")))

id,externalId,timestamp,value,aggregation,granularity
,pi:160184,2017-10-02T00:00:00.000+0000,197392.984375,max,1d
,pi:160184,2017-10-03T00:00:00.000+0000,202713.421875,max,1d
,pi:160184,2017-10-04T00:00:00.000+0000,204082.140625,max,1d
,pi:160184,2017-10-05T00:00:00.000+0000,229973.625,max,1d
,pi:160184,2017-10-06T00:00:00.000+0000,198926.25,max,1d
,pi:160184,2017-10-07T00:00:00.000+0000,198872.640625,max,1d
,pi:160184,2017-10-08T00:00:00.000+0000,202595.515625,max,1d
,pi:160184,2017-10-09T00:00:00.000+0000,202687.828125,max,1d
,pi:160184,2017-10-10T00:00:00.000+0000,203020.140625,max,1d
,pi:160184,2017-10-11T00:00:00.000+0000,200695.15625,max,1d


# Joins with data points

Due to limitations in Spark (that we may perhaps one day be able to work around) it's not possible to join `datapoints` directly, but we can get the externalIds of the time series we want to look at as a Python list by using `.collect()`.

For example, let's say we want to look at data points from the time series with description `PH 1stStgComp Discharge` that are connected to the assets with description `VRD - PH 1STSTGCOMP DISCHARGE` that we found above. First we get the externalIds of those time series into a Python list.

In [0]:
discharge_time_series = assets.where(assets.description == "VRD - PH 1STSTGCOMP DISCHARGE") \
  .join(tsmd, tsmd.assetId == assets.id) \
  .select(tsmd.externalId.alias("tsName"))
discharge_time_series_names = [ t.tsName for t in discharge_time_series.collect() ]
discharge_time_series_names

Then we can use `.where(dp.name.isin(discharge_time_series_names))` to do the join we wanted.

In [0]:
display(dp.where(dp.externalId.isin(discharge_time_series_names)) \
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.aggregation == 'min') \
        .where(dp.granularity == "7d"))

id,externalId,timestamp,value,aggregation,granularity
,pi:160111,2017-10-05T00:00:00.000+0000,120.47130584716795,min,7d
,pi:160111,2017-10-12T00:00:00.000+0000,117.5018310546875,min,7d
,pi:160111,2017-10-19T00:00:00.000+0000,118.24420166015624,min,7d
,pi:160111,2017-10-26T00:00:00.000+0000,119.8217315673828,min,7d
,pi:160111,2017-11-02T00:00:00.000+0000,118.52259063720705,min,7d
,pi:160111,2017-11-09T00:00:00.000+0000,120.00732421875,min,7d
,pi:160111,2017-11-16T00:00:00.000+0000,68.36630249023438,min,7d
,pi:160111,2017-11-23T00:00:00.000+0000,109.9853515625,min,7d
,pi:160111,2017-11-30T00:00:00.000+0000,118.29059600830078,min,7d
,pi:160111,2017-12-07T00:00:00.000+0000,119.77533721923828,min,7d


# Files metadata

We also have support for files metadata. Currently we support reading and updating existing files metadata.

In [0]:
files = spark.read.format("cognite.spark.v1") \
  .option("type", "files") \
  .option("apiKey", API_KEY) \
  .load() \
  .cache()

In [0]:
files.printSchema()

In [0]:
display(files.groupBy(files.mimeType) \
        .count() \
        .orderBy("count", ascending=False))

mimeType,count
application/pdf,11
image/svg+xml,7
