## Prepare the notebook

In [None]:
# Storage locations
DELTA_LOCATION = '/user/demo/delta-lake/events'
DELTA_LOCATION_PARTITIONED = '/user/demo/delta-lake/partitioned-events'

In [None]:
# Delete storage locations if they already exist.
# This will make the notebook idempotent (always executable)
dbutils.fs.rm(DELTA_LOCATION, True)
dbutils.fs.rm(DELTA_LOCATION_PARTITIONED, True)

Out[3]: False

## 1. Starting point: The data stored in the object store

For the purpose of this demo, we will be using a dataset already loaded in the Databricks environment. In reality, this could also be an object store, mounted to the workspace. Each folder contains different raw data, as is the case in the object store.

In [None]:
dbutils.fs.ls('dbfs:/databricks-datasets/')

Out[4]: [FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/COVID/', name='COVID/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/README.md', name='README.md', size=976),
 FileInfo(path='dbfs:/databricks-datasets/Rdatasets/', name='Rdatasets/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/SPARK_README.md', name='SPARK_README.md', size=3359),
 FileInfo(path='dbfs:/databricks-datasets/adult/', name='adult/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/airlines/', name='airlines/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/amazon/', name='amazon/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/asa/', name='asa/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/atlas_higgs/', name='atlas_higgs/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/bikeSharing/', name='bikeSharing/', size=0),
 FileInfo(path='dbfs:/databricks-datasets/cctvVideos/', name='cctvVideos/', size=0),
 FileInfo

We will be using `structured-streaming/events` data. It contains info about opening and closing a webpage. The file format is json. There's no other file here (log).

In [None]:
display(dbutils.fs.ls('dbfs:/databricks-datasets/structured-streaming/events'))

path,name,size
dbfs:/databricks-datasets/structured-streaming/events/file-0.json,file-0.json,72530
dbfs:/databricks-datasets/structured-streaming/events/file-1.json,file-1.json,72961
dbfs:/databricks-datasets/structured-streaming/events/file-10.json,file-10.json,73025
dbfs:/databricks-datasets/structured-streaming/events/file-11.json,file-11.json,72999
dbfs:/databricks-datasets/structured-streaming/events/file-12.json,file-12.json,72987
dbfs:/databricks-datasets/structured-streaming/events/file-13.json,file-13.json,73006
dbfs:/databricks-datasets/structured-streaming/events/file-14.json,file-14.json,73003
dbfs:/databricks-datasets/structured-streaming/events/file-15.json,file-15.json,73007
dbfs:/databricks-datasets/structured-streaming/events/file-16.json,file-16.json,72978
dbfs:/databricks-datasets/structured-streaming/events/file-17.json,file-17.json,73008


## 2. Ingest the data into the Delta Lake

The data is in object store. If we want to have it in the Delta Lake, we need to ingest it first.

In [None]:
# Read the data in Spark
from  pyspark.sql.functions import input_file_name

df = (spark.read
           .option("header", True)
           .option("inferSchema", True)
           .option('sep', '\t')
           .json('/databricks-datasets/structured-streaming/events/file-1*.json')
           .withColumn("filename", input_file_name())
     )

In [None]:
df.display()

action,time,filename
Open,1469539208,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539209,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539212,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539212,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539214,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539216,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Close,1469539217,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539217,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Close,1469539219,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Close,1469539222,dbfs:/databricks-datasets/structured-streaming/events/file-10.json


In [None]:
# Check that all the files were really read nd data inserted to the table
df.groupby('filename').count().sort('filename').display()

filename,count
dbfs:/databricks-datasets/structured-streaming/events/file-1.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-10.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-11.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-12.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-13.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-14.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-15.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-16.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-17.json,2000
dbfs:/databricks-datasets/structured-streaming/events/file-18.json,2000


In [None]:
# If we don't infer schema, all the columns will be of a type 'String'
df.printSchema()

root
 |-- action: string (nullable = true)
 |-- time: long (nullable = true)
 |-- filename: string (nullable = false)



In [None]:
print('The dataset has', df.count(), 'rows')
df.display()

The dataset has 22000 rows


action,time,filename
Open,1469539208,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539209,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539212,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539212,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539214,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539216,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Close,1469539217,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Open,1469539217,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Close,1469539219,dbfs:/databricks-datasets/structured-streaming/events/file-10.json
Close,1469539222,dbfs:/databricks-datasets/structured-streaming/events/file-10.json


Now Spark read in the data and created a DataFrame (in-memory). Since we want this data to become part of the Delta Lake, we need to save it to the Delta Format. The location can be anywhere. We will put ti in `/user/demo/data-lake/events`

In [None]:
# Save it to the Delta format 
df.write.format('delta').save(DELTA_LOCATION)

Look what we have on `DELTA LOCATION`.

In [None]:
display(dbutils.fs.ls(DELTA_LOCATION))

path,name,size
dbfs:/user/demo/delta-lake/events/_delta_log/,_delta_log/,0
dbfs:/user/demo/delta-lake/events/part-00000-1757a66d-2abd-4cb1-bdac-4e88ad117c53-c000.snappy.parquet,part-00000-1757a66d-2abd-4cb1-bdac-4e88ad117c53-c000.snappy.parquet,20241
dbfs:/user/demo/delta-lake/events/part-00001-d8ed9e6d-ebbc-40db-9665-d957dccc3ca0-c000.snappy.parquet,part-00001-d8ed9e6d-ebbc-40db-9665-d957dccc3ca0-c000.snappy.parquet,20271
dbfs:/user/demo/delta-lake/events/part-00002-e282cde4-9d12-497b-ba8f-2b653a7f02c8-c000.snappy.parquet,part-00002-e282cde4-9d12-497b-ba8f-2b653a7f02c8-c000.snappy.parquet,20219
dbfs:/user/demo/delta-lake/events/part-00003-536379f2-fd6b-4fbc-a3bb-c122b44802f8-c000.snappy.parquet,part-00003-536379f2-fd6b-4fbc-a3bb-c122b44802f8-c000.snappy.parquet,20119
dbfs:/user/demo/delta-lake/events/part-00004-44b2d8f8-c894-4ab5-97d2-8e02c4939071-c000.snappy.parquet,part-00004-44b2d8f8-c894-4ab5-97d2-8e02c4939071-c000.snappy.parquet,20124
dbfs:/user/demo/delta-lake/events/part-00005-038517c3-2626-48ea-8067-8086181091d2-c000.snappy.parquet,part-00005-038517c3-2626-48ea-8067-8086181091d2-c000.snappy.parquet,10813


What you should notice is the following:

1. The data is in parquet format. The total size is much smaller.
2. One additional file appeared: `_delta_log`. This is the core of the Delta Lake. It stores all the modifications and enables time travelling.
3. The number of partitions does not correspond to the number of files. We could partition the data differently.

In [None]:
# Exmple of partitioning
DELTA_LOCATION_PARTITIONED = '/user/demo/delta-lake/partitioned-events'
(df.write
 .format('delta')
 .partitionBy('filename')
 .save(DELTA_LOCATION_PARTITIONED)
)

In [None]:
# List of all partitions
display(dbutils.fs.ls(DELTA_LOCATION_PARTITIONED))

# Looking into one partition
display(dbutils.fs.ls(DELTA_LOCATION_PARTITIONED + '/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-1.json/'))

path,name,size
dbfs:/user/demo/delta-lake/partitioned-events/_delta_log/,_delta_log/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-1.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-1.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-10.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-10.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-11.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-11.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-12.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-12.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-13.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-13.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-14.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-14.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-15.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-15.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-16.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-16.json/,0
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-17.json/,filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-17.json/,0


path,name,size
dbfs:/user/demo/delta-lake/partitioned-events/filename=dbfs%3A%2Fdatabricks-datasets%2Fstructured-streaming%2Fevents%2Ffile-1.json/part-00005-d4de63bf-6bc0-4457-9091-888f05e612aa.c000.snappy.parquet,part-00005-d4de63bf-6bc0-4457-9091-888f05e612aa.c000.snappy.parquet,10095


Display the dalta log and explore its content.

In [None]:
display(dbutils.fs.ls(DELTA_LOCATION + '/_delta_log'))

path,name,size
dbfs:/user/demo/delta-lake/events/_delta_log/.s3-optimization-0,.s3-optimization-0,0
dbfs:/user/demo/delta-lake/events/_delta_log/.s3-optimization-1,.s3-optimization-1,0
dbfs:/user/demo/delta-lake/events/_delta_log/.s3-optimization-2,.s3-optimization-2,0
dbfs:/user/demo/delta-lake/events/_delta_log/00000000000000000000.crc,00000000000000000000.crc,91
dbfs:/user/demo/delta-lake/events/_delta_log/00000000000000000000.json,00000000000000000000.json,4249


Click on the arrow in the individual cells in the table below to expand the content and make it more readable.

In [None]:
# Json contains all the modifications
spark.read.json(DELTA_LOCATION + '/_delta_log/00000000000000000000.json').display()

add,commitInfo,metaData,protocol
,,,"List(1, 2)"
,,"List(1665594184614, List(parquet), 73eb4ea2-aaad-4ed1-b029-aa3cb8fe0ce7, List(), {""type"":""struct"",""fields"":[{""name"":""action"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""time"",""type"":""long"",""nullable"":true,""metadata"":{}},{""name"":""filename"",""type"":""string"",""nullable"":true,""metadata"":{}}]})",
"List(true, 1665594192000, part-00000-1757a66d-2abd-4cb1-bdac-4e88ad117c53-c000.snappy.parquet, 20241, {""numRecords"":4000,""minValues"":{""action"":""Close"",""time"":1469506633,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469542738,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594192000000, 268435456))",,,
"List(true, 1665594192000, part-00001-d8ed9e6d-ebbc-40db-9665-d957dccc3ca0-c000.snappy.parquet, 20271, {""numRecords"":4000,""minValues"":{""action"":""Close"",""time"":1469542744,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469549980,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594192000001, 268435456))",,,
"List(true, 1665594192000, part-00002-e282cde4-9d12-497b-ba8f-2b653a7f02c8-c000.snappy.parquet, 20219, {""numRecords"":4000,""minValues"":{""action"":""Close"",""time"":1469549981,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469557136,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594192000002, 268435456))",,,
"List(true, 1665594192000, part-00003-536379f2-fd6b-4fbc-a3bb-c122b44802f8-c000.snappy.parquet, 20119, {""numRecords"":4000,""minValues"":{""action"":""Close"",""time"":1469557136,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469564356,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594192000003, 268435456))",,,
"List(true, 1665594192000, part-00004-44b2d8f8-c894-4ab5-97d2-8e02c4939071-c000.snappy.parquet, 20124, {""numRecords"":4000,""minValues"":{""action"":""Close"",""time"":1469564356,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469571518,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594192000004, 268435456))",,,
"List(true, 1665594192000, part-00005-038517c3-2626-48ea-8067-8086181091d2-c000.snappy.parquet, 10813, {""numRecords"":2000,""minValues"":{""action"":""Close"",""time"":1469571524,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469575090,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594192000005, 268435456))",,,
,"List(1012-165842-a8mwgm4, true, WriteSerializable, List(1111564559459043), WRITE, List(6, 111787, 22000), List(ErrorIfExists, []), 1665594192745, 6145438733202696, petra@adaltas.com)",,


In [None]:
# crc contains the metadata
spark.read.text(DELTA_LOCATION + '/_delta_log/00000000000000000000.crc').display()

value
"{""tableSizeBytes"":111787,""numFiles"":6,""numMetadata"":1,""numProtocol"":1,""numTransactions"":0}"


**Remember:** The `_delta_log` materializes the differenc between data lake and the lakehouse.

## 3. Read the Delta table

We can read the data in Delta format with a programmatic API (PySpark, Scala) or with SQL. In any case, this returns the DataFrame.

In [None]:
# Read with Python API
delta_table = spark.read.format("delta").load(DELTA_LOCATION)
delta_table.display()

action,time,filename
Close,1469506633,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Close,1469506636,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506642,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506644,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506646,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506647,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506648,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506651,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Close,1469506653,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506653,dbfs:/databricks-datasets/structured-streaming/events/file-1.json


In [None]:
%sql
-- Read with SQL API
SELECT * FROM delta.`/user/demo/delta-lake/events`

action,time,filename
Close,1469506633,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Close,1469506636,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506642,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506644,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506646,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506647,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506648,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506651,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Close,1469506653,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506653,dbfs:/databricks-datasets/structured-streaming/events/file-1.json


In [None]:
display(dbutils.fs.ls('dbfs:/user/hive/warehouse'))

path,name,size
dbfs:/user/hive/warehouse/beans/,beans/,0
dbfs:/user/hive/warehouse/petra_adaltas_com_db.db/,petra_adaltas_com_db.db/,0
dbfs:/user/hive/warehouse/student/,student/,0
dbfs:/user/hive/warehouse/student1/,student1/,0


## 4. Modify the table

To be able to use the Delta table operations (delete, create, merge, vacuum...), we need to read the data as Delta table.

In [None]:
from delta.tables import *

deltaTable = DeltaTable.forPath(spark, DELTA_LOCATION)
deltaTable.delete("action = 'Close' ")        # predicate using SQL formatted string

In [None]:
# Check out the delta log.
# We have new .json and new .crc file. 
display(dbutils.fs.ls(DELTA_LOCATION + '/_delta_log'))

path,name,size
dbfs:/user/demo/delta-lake/events/_delta_log/.s3-optimization-0,.s3-optimization-0,0
dbfs:/user/demo/delta-lake/events/_delta_log/.s3-optimization-1,.s3-optimization-1,0
dbfs:/user/demo/delta-lake/events/_delta_log/.s3-optimization-2,.s3-optimization-2,0
dbfs:/user/demo/delta-lake/events/_delta_log/00000000000000000000.crc,00000000000000000000.crc,91
dbfs:/user/demo/delta-lake/events/_delta_log/00000000000000000000.json,00000000000000000000.json,4249
dbfs:/user/demo/delta-lake/events/_delta_log/00000000000000000001.crc,00000000000000000001.crc,90
dbfs:/user/demo/delta-lake/events/_delta_log/00000000000000000001.json,00000000000000000001.json,5569


In [None]:
spark.read.json(DELTA_LOCATION + '/_delta_log/00000000000000000001.json').display()

add,commitInfo,remove
,,"List(true, 1665594239987, true, part-00001-d8ed9e6d-ebbc-40db-9665-d957dccc3ca0-c000.snappy.parquet, 20271, List(1665594192000001, 268435456))"
,,"List(true, 1665594239987, true, part-00000-1757a66d-2abd-4cb1-bdac-4e88ad117c53-c000.snappy.parquet, 20241, List(1665594192000000, 268435456))"
,,"List(true, 1665594239987, true, part-00002-e282cde4-9d12-497b-ba8f-2b653a7f02c8-c000.snappy.parquet, 20219, List(1665594192000002, 268435456))"
,,"List(true, 1665594239987, true, part-00004-44b2d8f8-c894-4ab5-97d2-8e02c4939071-c000.snappy.parquet, 20124, List(1665594192000004, 268435456))"
,,"List(true, 1665594239987, true, part-00003-536379f2-fd6b-4fbc-a3bb-c122b44802f8-c000.snappy.parquet, 20119, List(1665594192000003, 268435456))"
,,"List(true, 1665594239987, true, part-00005-038517c3-2626-48ea-8067-8086181091d2-c000.snappy.parquet, 10813, List(1665594192000005, 268435456))"
"List(true, 1665594240000, part-00000-50401728-66a0-4c16-bc85-8f818b773f3e-c000.snappy.parquet, 8736, {""numRecords"":2012,""minValues"":{""action"":""Open"",""time"":1469542744,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469549980,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594240000000, 268435456))",,
"List(true, 1665594240000, part-00001-c8a14108-3cc7-4444-b9d0-2428d61e2326-c000.snappy.parquet, 8688, {""numRecords"":2012,""minValues"":{""action"":""Open"",""time"":1469506642,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469542733,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594240000001, 268435456))",,
"List(true, 1665594240000, part-00002-c9c4033a-27a9-485a-81a2-c2ca0f705003-c000.snappy.parquet, 8607, {""numRecords"":1989,""minValues"":{""action"":""Open"",""time"":1469549981,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469557136,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594240000002, 268435456))",,
"List(true, 1665594240000, part-00003-6bef8fb6-7aa6-45d8-9bc8-8f3d3c5dbdf3-c000.snappy.parquet, 8639, {""numRecords"":1988,""minValues"":{""action"":""Open"",""time"":1469564356,""filename"":""dbfs:/databricks-datasets/struct""},""maxValues"":{""action"":""Open"",""time"":1469571511,""filename"":""dbfs:/databricks-datasets/struct�""},""nullCount"":{""action"":0,""time"":0,""filename"":0}}, List(1665594240000003, 268435456))",,


## 5. History and time travel

The transaction log saves all the modifications to the Delta table and it enables us to recover one of the previous versions (= time travel). But keep in mind that the 'historic data' has a retention period (default 30 days). After that, the files which are no longer part of the table, will be deleted.

In [None]:
%sql
DESCRIBE HISTORY delta.`/user/demo/delta-lake/events`

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata
1,2022-10-12T17:04:01.000+0000,6145438733202696,petra@adaltas.com,DELETE,"Map(predicate -> [""(`action` = 'Close')""])",,List(1111564559459043),1012-165842-a8mwgm4,0.0,WriteSerializable,False,"Map(numRemovedFiles -> 6, numCopiedRows -> 10999, numAddedChangeFiles -> 0, executionTimeMs -> 5447, numDeletedRows -> 11001, scanTimeMs -> 2996, numAddedFiles -> 6, rewriteTimeMs -> 2442)",
0,2022-10-12T17:03:14.000+0000,6145438733202696,petra@adaltas.com,WRITE,"Map(mode -> ErrorIfExists, partitionBy -> [])",,List(1111564559459043),1012-165842-a8mwgm4,,WriteSerializable,True,"Map(numFiles -> 6, numOutputRows -> 22000, numOutputBytes -> 111787)",


In [None]:
# If you don't want to check out the whole history, you can get only the last n operations
display(deltaTable.history(1))

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata
1,2022-10-12T17:04:01.000+0000,6145438733202696,petra@adaltas.com,DELETE,"Map(predicate -> [""(`action` = 'Close')""])",,List(1111564559459043),1012-165842-a8mwgm4,0,WriteSerializable,False,"Map(numRemovedFiles -> 6, numCopiedRows -> 10999, numAddedChangeFiles -> 0, executionTimeMs -> 5447, numDeletedRows -> 11001, scanTimeMs -> 2996, numAddedFiles -> 6, rewriteTimeMs -> 2442)",


In [None]:
# If we read the data as dataframe now, we will have the current state of the dataset (only rows with action 'Open')
df_open = (spark
           .read
           .format("delta")
           .load(DELTA_LOCATION)
          )

print('Number of rows with action = Open:', df_open.count())
display(df_open)

Number of rows with action = Open: 10999


action,time,filename
Open,1469542744,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542746,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542746,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542758,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542763,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542763,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542777,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542779,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542781,dbfs:/databricks-datasets/structured-streaming/events/file-11.json
Open,1469542784,dbfs:/databricks-datasets/structured-streaming/events/file-11.json


In [None]:
# Transaction log enables the time travelling. This means, that we can access any previous version of the Delta table. Let's read the version 0 (where actions aer 'Open' and 'Close').
# The command is exactly the same as above, but with the specified version in addition. 
df_initial = (spark
              .read
              .format("delta")
              .option("versionAsOf", 0)
              .load(DELTA_LOCATION)
            )

print('Number of rows in version 0:', df_initial.count())
display(df_initial)

Number of rows in version 0: 22000


action,time,filename
Close,1469506633,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Close,1469506636,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506642,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506644,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506646,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506647,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506648,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506651,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Close,1469506653,dbfs:/databricks-datasets/structured-streaming/events/file-1.json
Open,1469506653,dbfs:/databricks-datasets/structured-streaming/events/file-1.json


When you finish, familiarize yourself with:
- [VACUUM](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html)
- [OPTIMIZE and Z-ORDER](https://docs.databricks.com/delta/file-mgmt.html)

Answer the following questions:
- what do they do?
- write at least one functional example for each
- think how you would test/illustrate what happened (what was the state before you ran the command and after it ran)