# Introduction

See also the [documentation](https://github.com/cognitedata/cdp-spark-datasource#reading-and-writing-cognite-data-platform-resource-types) for examples for each resource type. The general pattern is:

```scala
my_data_frame = spark.read.format("cognite.spark.v1") \
  .option("type", "some-resource-type") \
  .option("apiKey", dbutils.secrets.get("your-scope", "api-key-for-project"))
```

The resource types are:
- `assets`
- `events`
- `timeseries` time series metadata
- `datapoints` data points for a time series, also supports aggregates
- `raw` "RAW" tables, which also require `.option("database", "some-database")` and `.option("table", "some-table")`

Let's start by reading some data from the `publicdata` project. If you don't have an API key, go get one from the [Open Industrial Data project](https://openindustrialdata.com/).

We'll assume that you have set up a [secret scope](https://docs.databricks.com/user-guide/secrets/index.html) in Databricks with a key `api-key-publicdata` containing your API key for Open Industrial Data.

In [None]:
secret_scope = "" # name your secret scope here
project_key = "" # name the key to use from secret scope here

In [None]:
assets = spark.read.format("cognite.spark.v1") \
    .option("type", "assets") \
    .option("apiKey", dbutils.secrets.get(secret_scope, project_key)) \
    .load()

# DataFrames

We get back a [Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html) from `spark.read.format...load()`, which "is conceptually equivalent to a table in a relational database or a data frame in R/Python".

`spark` is your entry point to the Spark API, and it's a `SparkSession` with a connection to the cluster. You will mostly use it to read data frames, and then interact with the data frames.

You may have noticed that the command finished almost immediately. The data frame is a lazy data structure, and doesn't actually load any data until it has to. You can view the schema (the column names and their types) by clicking the small arrow next to the output. The schema is constant for assets, so we didn't need to read any data to produce that schema.

Data is loaded only when you perform an *action* on a data frame, which requires data to be present. Examples of actions include `.count()` (for counting the rows), `.show()` (for printing the first few rows), `.toPandas()` (for converting to a Pandas data frame, downloading all data to your Python process), and pretty much anything else that uses data.

In [None]:
print(assets.count())
assets.show()
assets.toPandas()

DataFrames can be distributed with many partitions being placed on different nodes in our Spark cluster.

In [None]:
assets.rdd.getNumPartitions()

We can drop and rename columns, getting a DataFrame with a new schema.

In [None]:
assets.drop("metadata") \
  .drop("externalId") \
  .drop("source") \
  .withColumnRenamed("description", "descr") \
  .withColumnRenamed("lastUpdatedTime", "updatedAt") \
  .printSchema()

Notebooks have autocompletion built in and you can view keyboard shortcuts by clicking the keyboard icon at the top bar.

# Displaying data

In Databricks there's a convenient `display()` method you can use to show data in data frames (and a few other formats, like pandas and matplotlib figures). Since showing the data requires it to be loaded, this will also trigger an action.
By default, only the first 1000 rows are displayed in the widget, even if Spark needs to load more data than this in the background.

Note that you might need to scoll within the widget to show all the results.

You can sort the rows shown by different columns, and you can expand the "string to string" map in the metadata column by clicking the arrow.

In [None]:
display(assets)

In [None]:
# we will explain groupBy() and count() in the section on aggregations
display(assets.groupBy(assets.parentId).count())

# Caching data

As we mentioned before, the data frame is a lazy structure that loads data when it is needed. Loading data over and over again can be slow and wasteful when we don't absolutely need it to be completely up-to-date.
In that case, we can created a cached data frame by adding `.cache()` at the end.

In [None]:
events = spark.read.format("cognite.spark.v1") \
    .option("type", "assets") \
    .option("apiKey", dbutils.secrets.get(secret_scope, project_key)) \
    .load() \
    .cache()

In [None]:
events.printSchema()

In [None]:
print(events.count())
events.show()
events.toPandas()

If we run the same commands again they should finish more quickly (potentially much more quickly if there's a lot of data).

In [None]:
print(events.count())
events.show()
events.toPandas()

All asset data is now kept in memory by Spark, if possible, and reloads will happen only if a node crashes. Even then, only the data that was kept on that node will be reloaded, if possible.

Caching is a good idea if you have a large amount of data that will not be changed.

However, if you cache events as above, your cached copy will not receive new events.
This might seem obvious, but it means that if you're doing something like waiting for new events that you
have just created to show up, you should *not* cache the DataFrame you're using to check for new events!

# Time series metadata

Let's read the time series metadata into another cached data frame.

In [None]:
tsmd = spark.read.format("cognite.spark.v1") \
    .option("type", "timeseries") \
    .option("apiKey", dbutils.secrets.get(secret_scope, project_key)) \
    .load() \
    .cache()

In [None]:
tsmd.printSchema()

In [None]:
tsmd.count()

In [None]:
display(tsmd)

# Aggregations

We can do things like group by and count using PySpark. For example, how many time series do we have per asset?
One way to find out is to put time series metadata into different groups based on their asset id, and then count
the number of items in each group, and then order the counts in a descending order.

In [None]:
display(tsmd.groupBy("assetId").count().orderBy("count", ascending=False))

How many different asset descriptions do we have, and how many assets per description?

In [None]:
display(assets.groupBy("description").count().orderBy("count", ascending=False))

Spark has support for many different [types of aggregations](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData), such as `min`, `max`, `mean`, `sum`, etc.

We can make a plot of the number of time series associated per asset, to get an overall view of how many time series assets have in general.

Since we have several "counts" in this query, we'll use `.withColumn` to rename the first one.
We'll say more about `F` in a little bit, but for now we only need to know that `F.col("count")` let's us refer to
the column with the name "count".

In [None]:
import pyspark.sql.functions as F

display(tsmd.groupBy("assetId").count() \
        # We will do two "count" calls here, so we need to remember the counts per asset,
        # by renaming the column named "count" at this point to "countsPerAsset"
        .withColumn("countsPerAsset", F.col("count")) \
        .groupBy("countsPerAsset") \
        # Now we count the number of assets with 1, 2, etc. time series connected to them.
        .count() \
        # In order to avoid a random order in our bar chart we can sort by "countsPerAsset"
        .orderBy("countsPerAsset"))

If we just want to know the average number of time series per asset, we can use `agg` and the `avg` function directly.

In [None]:
display(tsmd.groupBy("assetId").count().agg(F.avg(F.col("count"))))

We will see more of `F` from here on, the `pyspark.sql.functions` package.
Importing it as `F` allows us to use autocompletion to find functions in that package, and avoids
ambiguities for functions like `min`, but it is also common to see individual methods imported like this:

In [None]:
from pyspark.sql.functions import avg

Since autocompletion is very useful, we recommend the `F` style.

# Filtering

We can use `.filter` or `.where` (same method by different names) to select a subset of data. `select` can be used to pick out specific columns, or even parts of columns like `metadata.SOURCE_TABLE`.

In [None]:
display(assets.where(assets.description == "VRD - 1ST STAGE COMPRESSOR LUBE OIL HEATER") \
       .select("name", "description", "metadata.SOURCE_TABLE"))

Root nodes are defined as having no parent, so their `.parentId` should be null.

In [None]:
display(assets.where(assets.parentId.isNull()))

Similarly, we can look for uncontextualized time series metadata, which have a null `assetId`.

In [None]:
display(tsmd.where(tsmd.assetId.isNull()))

As expected, all time series in `publicdata` are contextualized. We can negate a filter expression using `~`
to instead filter for time series that have been contextualized.

In the case of filtering based on non-`NULL` values we can also use `.isNotNull()`.

In [None]:
print(tsmd.where(~tsmd.assetId.isNull()).count())
print(tsmd.where(tsmd.assetId.isNotNull()).count())

# Column objects

`assets.description` and `assets.parentId` return [Column](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column) objects.

Column objects have a wide range of useful methods, and we will see many examples from here on out.
We can also construct them from our DataFrame using string indexing, like `assets["description"]`,
which is necessary if the column name contains characters that are not valid Python identifiers.

For example, we can say `assets["VALUE (%C)"]`. We can also use `F.col("VALUE (%C)")` to create a Column directly.
However, if we do `F.col("name")` and there are several DataFrames involved that have a `name` column, we'd be in trouble
since we didn't specify which `name` column we meant, while `assets.name` would have been unambiguous.

We'll see more of that when looking at joins.

For those reasons, we recommend indexing the DataFrame (using `.name` when possible) to create Column objects, even if it can become a bit tedious to spell out the DataFrame name.

However, in the previous section we use `F.col("count")` because we didn't have a DataFrame object with a column
named count. Our "count" column only existed on an intermediate DataFrame. We could have stored that DataFrame and
given it a name, and then we could have used `df.count`, but sometimes it just makes sense to not bother naming each
intermediate DataFrame.

# Joins

We can join data from different data frames together to answer questions like, what are the time series for asset ids `4050790831683279` and `3195126756929465`?

In [None]:
display(assets.where(assets.id.isin([4050790831683279, 3195126756929465])) \
        .join(tsmd, tsmd.assetId == assets.id) \
        .select(assets.name, assets.description, tsmd.description, tsmd.name))

When doing joins we often have the same column name in both tables, which can cause confusing results. As you can see, we ended up with two `description` columns and two `name` columns.

`.alias` can be used to rename columns and help us keep track of which description belongs to the asset and which one belongs to the time series.

In [None]:
display(assets.where(assets.id.isin([4050790831683279, 3195126756929465]))
        .join(tsmd, tsmd.assetId == assets.id)
        .select(assets.name, assets.description, tsmd.description.alias("tsDescription"), tsmd.name.alias("tsName")))

# Data points

We can retrieve the data for a time series by using the `datapoints` resource type. This one is a bit special, because it will return no data unless you have specified the name(s) of the time series you want to get data for.

As a consequence, you should *not* cache data frames using the `datapoints` resource type, otherwise the data frame will cache an empty result (and remain empty!) if you don't specify a time series name when querying it.

In [None]:
dp = spark.read.format("cognite.spark.v1") \
    .option("type", "datapoints") \
    .option("apiKey", dbutils.secrets.get(secret_scope, project_key)) \
    .load()

In [None]:
display(dp.where(dp.externalId == "pi:160184") \
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.timestamp < F.lit("2017-10-31")))

If we don't specify an upper bound, [getLatest](https://doc.cognitedata.com/api/0.5/#operation/getLatest) will be
used to retrieve the maximum timestamp available.

Similarly, if there is no lower bound the Spark data source will make a query to the time series API to find the timestamp
of the first available data point.

Raw data points are downloaded by default, but the data points DataFrame also has full support for aggregates.

In [None]:
display(dp.where(dp.externalId == "pi:160184") \
        .where(dp.granularity == "7d") \
        .where(dp.aggregation.isin(["min", "average", "max"]))
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.timestamp < F.lit("2017-10-31")))

# Plotting data

The `display()` widget has a number of options for showing data in different ways, including a line plot that can group results by a column.

Using this we can easily create a plot showing the minimum, average, and maximum values for a time series.

In [None]:
display(dp.where(dp.externalId == "pi:160184") \
        .where(dp.granularity == "1d") \
        .where(dp.aggregation.isin(["min", "average", "max"]))
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.timestamp < F.lit("2017-10-31")))

# Joins with data points

Due to limitations in Spark (that we may perhaps one day be able to work around) it's not possible to join `datapoints` directly, but we can get the externalIds of the time series we want to look at as a Python list by using `.collect()`.

For example, let's say we want to look at data points from the time series with description `PH 1stStgComp Discharge` that are connected to the assets with description `VRD - PH 1STSTGCOMP DISCHARGE` that we found above. First we get the externalIds of those time series into a Python list.

In [None]:
discharge_time_series = assets.where(assets.description == "VRD - PH 1STSTGCOMP DISCHARGE") \
  .join(tsmd, tsmd.assetId == assets.id) \
  .select(tsmd.externalId.alias("tsName"))
discharge_time_series_names = [ t.tsName for t in discharge_time_series.collect() ]
discharge_time_series_names

Then we can use `.where(dp.name.isin(discharge_time_series_names))` to do the join we wanted.

In [None]:
display(dp.where(dp.externalId.isin(discharge_time_series_names)) \
        .where(dp.timestamp > F.lit("2017-10-01")) \
        .where(dp.aggregation == 'min') \
        .where(dp.granularity == "7d"))

# Files metadata

We also have support for files metadata. Currently we support reading and updating existing files metadata.

In [None]:
files = spark.read.format("cognite.spark.v1") \
  .option("type", "files") \
  .option("apiKey", dbutils.secrets.get(secret_scope, project_key)) \
  .load() \
  .cache()

In [None]:
files.printSchema()

In [None]:
display(files.groupBy(files.mimeType) \
        .count() \
        .orderBy("count", ascending=False))