In [4]:
%run "./Includes/Classroom-Setup"

## Databricks Delta Time Travel

The Databricks Delta log has a list of what files are valid for each read / write operation.

By referencing this list, a request can be made for the data at a specific point in time. 

This is similar to the concept of code Revision histories.

Examples of Time Travel use cases are:
* Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). 
  * This could be useful for debugging or auditing, especially in regulated industries.
* Writing complex temporal queries.
* Fixing mistakes in your data.
* Providing snapshot isolation for a set of queries for fast changing tables.

## Slow Stream of Files

Our stream source is a repository of many small files.

In [7]:
from pyspark.sql.types import StructType, StructField, DoubleType
spark.conf.set("spark.sql.shuffle.partitions", 8)

dataPath = "/mnt/training/power-plant/streamed.parquet"

dataSchema = StructType([
  StructField("AT", DoubleType(), True),
  StructField("V", DoubleType(), True),
  StructField("AP", DoubleType(), True),
  StructField("RH", DoubleType(), True),
  StructField("PE", DoubleType(), True)
])

initialDF = (spark
  .readStream                        # Returns DataStreamReader
  .option("maxFilesPerTrigger", 1)   # Force processing of only 1 file per trigger 
  .schema(dataSchema)                # Required for all streaming DataFrames
  .parquet(dataPath) 
)

## Append to a Databricks Delta Table

Use this to create `powerTable`.

In [9]:
from pyspark.sql.types import TimestampType

writePath      = workingDir + "/output.parquet"    # A subdirectory for our output
checkpointPath = workingDir + "/output.checkpoint" # A subdirectory for our checkpoint & W-A logs

powerTable = "powerTable"

And to help us manage our streams better, we will make use of **`untilStreamIsReady()`**, **`stopAllStreams()`** and define the following, **`myStreamName`**:

In [11]:
myStreamName = "lesson08_ps"

##Introducing Time Travel

Databricks Delta time travel allows you to query an older snapshot of a table.

Here, we introduce a new option to Databricks Delta.

`.option("timestampAsOf", now)` 

Where `now` is the current timestamp, that must be a STRING that can be cast to a Timestamp.

There is an alternate notation as well 

`.option("versionAsOf", version)`

In [13]:
import datetime
now = datetime.datetime.now()

streamingQuery = (initialDF                     # Start with our "streaming" DataFrame
  .writeStream                                  # Get the DataStreamWriter
  .trigger(processingTime="3 seconds")          # Configure for a 3-second micro-batch
  .queryName(myStreamName)                       # Specify Query Name
  .format("delta")                              # Specify the sink type, a Parquet file
  .option("timestampAsOf", now)                 # Timestamp the stream in the form of string that can be converted to TimeStamp
  .outputMode("append")                         # Write only new data to the "file"
  .option("checkpointLocation", checkpointPath) # Specify the location of checkpoint files & W-A logs
  .table(powerTable)
)

In [14]:
# Wait until the stream is ready before proceeding
untilStreamIsReady(myStreamName)

## Retention Period and Table Properties

You configure retention periods using `ALTER TABLE` syntax with the following table properties:

* `delta.logRetentionDuration "interval interval-string" `
  * Configure how long you can go back in time. Default is interval 30 days.

* `delta.deletedFileRetentionDuration = "interval interval-string" `
  * Configure how long stale data files are kept around before being deleted with VACUUM. Default is interval 1 week.
  
* `interval-string` is in the form `30 days` or `1 week`

For full access to 30 days of historical data, set `delta.deletedFileRetentionDuration = "interval 30 days" ` on your table. 

Using a large number of days may cause your storage costs to go way up.

In [16]:
spark.sql(f"""ALTER TABLE {powerTable} SET TBLPROPERTIES (delta.deletedFileRetentionDuration="interval 10 days") """)
tblPropDF = spark.sql(f"SHOW TBLPROPERTIES {powerTable}")
display(tblPropDF)

key,value
delta.deletedFileRetentionDuration,interval 10 days


Run this cell multiple times to show that the data is changing.

In [18]:
countDF = spark.sql(f"SELECT count(*) FROM {powerTable}")
display(countDF)

count(1)
144


In [19]:
historyDF = spark.sql(f"SELECT timestamp FROM (DESCRIBE HISTORY {powerTable}) ORDER BY timestamp")
display(historyDF)

timestamp
2020-04-15T11:05:55.000+0000
2020-04-15T11:06:15.000+0000
2020-04-15T11:06:36.000+0000
2020-04-15T11:07:03.000+0000
2020-04-15T11:07:34.000+0000
2020-04-15T11:08:13.000+0000
2020-04-15T11:08:58.000+0000
2020-04-15T11:09:52.000+0000


Let's rewind back to almost the beginning (where we had just a handful of rows), let's say the 2nd write.

In [21]:
# List timestamps of when table writes occurred
historyDF = spark.sql(f"SELECT timestamp FROM (DESCRIBE HISTORY {powerTable}) ORDER BY timestamp")

# Pick out 2nd write
oldTimestamp = historyDF.take(2)[-1].timestamp

# Re-build the DataFrame as it was in the 2nd write
rewoundDF = spark.sql(f"SELECT * FROM {powerTable} TIMESTAMP AS OF '{oldTimestamp}'")

We had this many (few) rows back then.

In [23]:
rewoundDF.count()

## Clean Up

Stop all remaining streams.

In [26]:
stopAllStreams()

In [28]:
%run "./Includes/Classroom-Cleanup"