# Synthetic time series

When creating synthetic time series we often want to perform calculations on mutiple time series per timestamp.
In this notebook you'll find an example of how to do so. Feel free to copy this notebook and use it as template for your own sythetic time series.

## Create your DataFrame to access your data points in clean

Since you want to read your data points in the Cognite Data Platform, you need to create a Spark DataFrame for them using `.option("type", "datapoints")` with Cognite's Spark data source.

In [2]:
dp = spark.read.format("com.cognite.spark.datasource") \
    .option("apiKey", dbutils.secrets.get("linda", "publicdata")) \
    .option("type", "datapoints") \
    .load()

## Pivot the data points of the time series you need

Natively we look at the data points in CDP in a narrow and tall format like this:

| name          | timestamp | value |
|---------------|-----------|-------|
| time series 1 | 1         | 5     |
| time series 1 | 2         | 6     |
| time series 1 | 3         | 7     |
| time series 2 | 1         | 8     |
| time series 2 | 2         | 9     |
| time series 2 | 3         | 10    |

In order to do row based calculations, we need to get them into a wide format like that:

| timestamp | time series 1 | time series 2 |
|-----------|---------------|---------------|
| 1         | 5             | 8             |
| 2         | 6             | 9             |
| 3         | 7             | 10            |

So we select the `name`, `timestamp` and `value` of the time series of interest: `dp.select("value", "timestamp", "name").where(dp.name.isin(["VAL_23-ESDV-92501A-PST:VALUE","VAL_23-ESDV-92501B-PST:VALUE"]))`, with `VAL_23-ESDV-92501A-PST:VALUE` and `VAL_23-ESDV-92501B-PST:VALUE` being the `name`s of our time series.

Then we need to make sure that all our time series have values at the same time stamps, by using CDP's aggregates. `.where(dp.aggregation.isin('avg')).filter(dp.granularity == '5m')\`.  **Keep in mind that CDP aggregates are "best effort" and might not be available or updated for all time windows if you recently ingested or changed data**
- In this example we use a granularity of 5 minutes. That also means that the synthetic time series we create will have values every 5 minutes. You can adapt this to fit your needs.  `.filter(dp.granularity == '5m')\` *TODO: docu on possible granularity*
- Here we use the avarage values of our time series in the chosen granularity. The aggregation you want to use here depends on your data. You can adapt this to fit your needs.  `.where(dp.aggregation.isin('avg'))` *TODO: docu on possible aggregation functions*

We will group by timestamp (`.groupBy("timestamp")`), to get all values of all time series in one row that were measured at the same time into one group.

To get to a wide table, we will pivot on the `name` of the time series (`.pivot("name")`), so each unique name value will become columns. When we use the pivot function, we need to specify an aggregation. That is to tell Spark what to do if two values end up in the same group. Pivoting by name and grouping by timestamp on CDP data points will never result in that situation, but we still need to specify it. In thise case we chose `.sum("value")`.

In [4]:
pivot = dp.select("value", "timestamp", "name").where(dp.name.isin(["VAL_23-FT-92537-01:X.Value","VAL_23_FT_92537_03:Z.X.Value"])).where(dp.aggregation.isin('avg')).filter(dp.granularity == '5m')\
  .groupBy("timestamp")\
  .pivot("name")\
  .sum("value")

##  Apply calculations

We'll import everything from `pyspark.sql.functions` to be able to use `lit` (to define Spark SQL constants), `col` (to refer to columns using strings), `when` (similar to a SQL CASE statement), etc.

In [6]:
from pyspark.sql.functions import *

We'll refer to the columns of the time series we were looking at above in our calculations several times, so we can shorten the calculations by declaring some variables.

In [8]:
c1 = col("`VAL_23-FT-92537-01:X.Value`")
c2 = col("`VAL_23_FT_92537_03:Z.X.Value`")

Now we can easily create a DataFrame to hold the results for our synthetic time series.

- We'll name our new DataFrame `synth_dp` with a `name` column, which should have the name of our new time series as its value. We'll name the new time series `name of your new ts` and use `lit` to create this constant value: `synth_dp = pivot.withColumn("name", lit("name of your new ts"))`
- Then we create a new column named `value` and fill it with the output of our calculations. To define what calcuation needs to be done, we use [pyspark's column functions](http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column): `.withColumn("value", when(c1.isNull(), -2.0).when(c1 > 6224.0, c2*2.0 / c1).otherwise(c1*c2))`
- When writing data points to CDP we need to satisfy CDP Spark data source's data points schema which contains `aggregation` and `granularity`. When writing data we always write raw data points, so we should set those columns to `None`: `.withColumn("aggregation",lit(None)).withColumn("granularity", lit(None))`
- Finally we select all the columns we need for the data points schema: `.select("name", "timestamp", "value", "aggregation", "granularity")`

In [10]:
synth_dp = pivot.withColumn("name", lit("name of your new ts"))\
  .withColumn("value", when(c1.isNull(), -2.0).when(c1 > 6224.0, c2*2.0 / c1).otherwise(c1*c2))\
  .withColumn("aggregation", lit(None))\
  .withColumn("granularity", lit(None))\
  .select("name", "timestamp", "value", "aggregation", "granularity")

In [11]:
display(synth_dp)

##  Write back to CDP

Now that we have a proper data frame with our brand new synthetic timeseries, we want to store it back.
For that we need a DataFrame referencing at a data points resource with an API key that has write permissions (which we don't have for Open Industrial Data).

Requirements to run this sucessfully is that the metadata for your target time series has already been created. In other words, you need to have a time series named `name of your new ts` in your destination CDP project.

In [13]:
dp.createOrReplaceTempView("dp")
synth_dp.write.insertInto("dp")

##  Congratulations, you're done!

***But wait. What if my calculations are super complicated and cannot be done with the column functions?***

## Create a UDF for your calculation

A User Defined Function (UDF) is the most powerful option in Spark (but also the least performant one). It can do almost anything you can do using normal Python code, and you can use Python libraries in it as you please.

When using `Column` and `pyspark.sql.functions` all the transformations will be executed at full speed as Java byte code, and error messages might be more helpful as well. However, if you really cannot do your super complicated calculation using `Column` and `pyspark.sql.functions`, [UDFs](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html) might be the way to go.

Let's create a normal Python function and put it in a UDF.

In [16]:
from pyspark.sql.types import *
from pyspark.sql.functions import udf

def randomCalculation_(value1, value2):
  if value1 == None:
    return -2
  elif value1 > 6224:
    return ((value2*2)/value1)
  else:
    return value1*value2

randomCalculation = udf(lambda x, y: randomCalculation_(x, y), DoubleType())

Creating your dataframe using that UDF then looks like this:

In [18]:
from pyspark.sql.functions import *

c1 = col("`VAL_23-FT-92537-01:X.Value`")
c2 = col("`VAL_23_FT_92537_03:Z.X.Value`")

synth_dp = pivot.withColumn("name", lit("name of your new ts"))\
  .withColumn("value", randomCalculation(c1,c2))\
  .withColumn("aggregation", lit(None))\
  .withColumn("granularity", lit(None))\
  .select("name", "timestamp", "value", "aggregation", "granularity")

Alternatively you can use [Pandas UDFs](https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html), which will give you better performance.

Instead of calling your UDF for each row, Pandas UDFs are called with `pandas.Series` arguments containing the values for multiple rows of a column, and should also return a `pandas.Series` with one value for each row.

Similar to `Column` in PySpark, `pandas.Series` has support for builtins like `>`, `*`, `/` etc. so in this case our code looks much the same, but will run much faster.

In [20]:
import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

# Declare the function and create the UDF
def randomCalculationPandas(value1, value2):
  if value1 == None:
    return -2
  elif value1 > 6224:
    return ((value2*2)/value1)
  else:
    return value1*value2

randomCalculationPandas = pandas_udf(randomCalculationPandas, returnType=DoubleType())

#### Creating your DataFrame using that Pandas UDF then looks like this:

In [22]:
from pyspark.sql.functions import *

c1 = col("`VAL_23-FT-92537-01:X.Value`")
c2 = col("`VAL_23_FT_92537_03:Z.X.Value`")


synth_dp = pivot.withColumn("name", lit("name of your new ts"))\
  .withColumn("value", randomCalculationPandas(c1,c2))\
  .withColumn("aggregation", lit(None))\
  .withColumn("granularity", lit(None))\
  .select("name", "timestamp", "value", "aggregation", "granularity")