# Delta Lake

##### Objectives
1. Create a Delta Table
1. Understand the transaction Log
1. Read data from your Delta Table
1. Update data in your Delta Table
1. Access previous versions of table using time travel
1. Vacuum

##### Documentation
- <a href="https://docs.delta.io/latest/quick-start.html#create-a-table" target="_blank">Delta Table</a> 
- <a href="https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html" target="_blank">Transaction Log</a> 
- <a href="https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html" target="_blank">Time Travel</a>

In [0]:
%run "./Includes/Classroom-Setup"

### Create a Delta Table
Let's first read the Parquet-format BedBricks events dataset.

In [0]:
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

Write the data in Delta format to the directory given by `deltaPath`.

In [0]:
deltaPath = workingDir + "/delta-events"
eventsDF.write.format("delta").mode("overwrite").save(deltaPath)

Write the data in Delta format as a managed table in the metastore.

In [0]:
eventsDF.write.format("delta").mode("overwrite").saveAsTable("delta_events")

As with other file formats, Delta supports partitioning your data in storage using the unique values in a specified column (often referred to as "Hive partitioning").

Let's **overwrite** the Delta dataset in the `deltaPath` directory to partition by state. This can accelerate queries that filter by state.

In [0]:
from pyspark.sql.functions import col

stateEventsDF = eventsDF.withColumn("state", col("geo.state"))

stateEventsDF.write.format("delta").mode("overwrite").partitionBy("state").option("overwriteSchema", "true").save(deltaPath)

### Understand the Transaction Log
We can see how Delta stores the different state partitions in separate directories.

Additionally, we can also see a directory called `_delta_log`, which is the transaction log.

When a Delta Lake dataset is created, its transaction log is automatically created in the `_delta_log` subdirectory.

In [0]:
display(dbutils.fs.ls(deltaPath))

When changes are made to that table, these changes are recorded as ordered, atomic commits in the transaction log.

Each commit is written out as a JSON file, starting with 00000000000000000000.json.

Additional changes to the table generate subsequent JSON files in ascending numerical order.

<div style="img align: center; line-height: 0; padding-top: 9px;">
  <img src="https://user-images.githubusercontent.com/20408077/87174138-609fe600-c29c-11ea-90cc-84df0c1357f1.png" width="500"/>
</div>

In [0]:
display(dbutils.fs.ls(deltaPath + "/_delta_log/"))

Next, let's take a look at a transaction log File.


The <a href="https://docs.databricks.com/delta/delta-utility.html" target="_blank">four columns</a> each represent a different part of the very first commit to the Delta Table, creating the table.
- The `add` column has statistics about the DataFrame as a whole and individual columns.
- The `commitInfo` column has useful information about what the operation was (WRITE or READ) and who executed the operation.
- The `metaData` column contains information about the column schema.
- The `protocol` version contains information about the minimum Delta version necessary to either write or read to this Delta Table.

In [0]:
display(spark.read.json(deltaPath + "/_delta_log/00000000000000000000.json"))

One key difference between these two transaction logs is the size of the JSON file, this file has 206 rows compared to the previous 7.

To understand why, let's take a look at the `commitInfo` column. We can see that in the `operationParameters` section, `partitionBy` has been filled in by the `state` column. Furthermore, if we look at the add section on row 3, we can see that a new section called `partitionValues` has appeared. As we saw above, Delta stores partitions separately in memory, however, it stores information about these partitions in the same transaction log file.

In [0]:
display(spark.read.json(deltaPath + "/_delta_log/00000000000000000001.json"))

Finally, let's take a look at the files inside one of the state partitions. The files inside corresponds to the partition commit (file 01) in the _delta_log directory.

In [0]:
display(dbutils.fs.ls(deltaPath + "/state=CA/"))

### Read from your Delta table

In [0]:
df = spark.read.format("delta").load(deltaPath)
display(df)

### Update your Delta Table

Let's filter for rows where the event takes place on a mobile device.

In [0]:
df_update = stateEventsDF.filter(col("device").isin(["Android", "iOS"]))
display(df_update)

In [0]:
df_update.write.format("delta").mode("overwrite").save(deltaPath)

In [0]:
df = spark.read.format("delta").load(deltaPath)
display(df)

Let's look at the files in the California partition post-update. Remember, the different files in this directory are snapshots of your DataFrame corresponding to different commits.

In [0]:
display(dbutils.fs.ls(deltaPath + "/state=CA/"))

### Access previous versions of table using Time  Travel

Oops, it turns out we actually we need the entire dataset! You can access a previous version of your Delta Table using Time Travel. Use the following two cells to access your version history. Delta Lake will keep a 30 day version history by default, but if necessary, Delta can store a version history for longer.

In [0]:
spark.sql("DROP TABLE IF EXISTS train_delta")
spark.sql(f"CREATE TABLE train_delta USING DELTA LOCATION '{deltaPath}'")

In [0]:
%sql
DESCRIBE HISTORY train_delta

Using the `versionAsOf` option allows you to easily access previous versions of our Delta Table.

In [0]:
df = spark.read.format("delta").option("versionAsOf", 0).load(deltaPath)
display(df)

You can also access older versions using a timestamp.

Replace the timestamp string with the information from your version history. Note that you can use a date without the time information if necessary.

In [0]:
# TODO
timeStampString = <FILL_IN>
df = spark.read.format("delta").option("timestampAsOf", timeStampString).load(deltaPath)
display(df)

### Vacuum 

Now that we're happy with our Delta Table, we can clean up our directory using `VACUUM`. Vacuum accepts a retention period in hours as an input.

It looks like our code doesn't run! By default, to prevent accidentally vacuuming recent commits, Delta Lake will not let users vacuum a period under 7 days or 168 hours. Once vacuumed, you cannot return to a prior commit through time travel, only your most recent Delta Table will be saved.

In [0]:
# from delta.tables import *

# deltaTable = DeltaTable.forPath(spark, deltaPath)
# deltaTable.vacuum(0)

We can workaround this by setting a spark configuration that will bypass the default retention period check.

In [0]:
from delta.tables import *

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
deltaTable = DeltaTable.forPath(spark, deltaPath)
deltaTable.vacuum(0)

Let's take a look at our Delta Table files now. After vacuuming, the directory only holds the partition of our most recent Delta Table commit.

In [0]:
display(dbutils.fs.ls(deltaPath + "/state=CA/"))

Since vacuuming deletes files referenced by the Delta Table, we can no longer access past versions. The code below should throw an error.

In [0]:
# df = spark.read.format("delta").option("versionAsOf", 0).load(deltaPath)
# display(df)

# Delta Lab
##### Tasks
1. Write sales data to Delta
1. Modify sales data to show item count instead of item array
1. Rewrite sales data to same Delta path
1. Create table and view version history
1. Time travel to read previous version

In [0]:
salesDF = spark.read.parquet(salesPath)
deltaSalesPath = workingDir + "/delta-sales"

### 1. Write sales data to Delta
Write **`salesDF`** to **`deltaSalesPath`**

In [0]:
# TODO
salesDF.FILL_IN

**CHECK YOUR WORK**

In [0]:
assert len(dbutils.fs.ls(deltaSalesPath)) > 0

### 2. Modify sales data to show item count instead of item array
Replace values in the **`items`** column with an integer value of the items array size.  
Assign the resulting DataFrame to **`updatedSalesDF`**.

In [0]:
# TODO
updatedSalesDF = FILL_IN
display(updatedSalesDF)

**CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import IntegerType

assert updatedSalesDF.schema[6].dataType == IntegerType()

### 3. Rewrite sales data to same Delta path
Write **`updatedSalesDF`** to the same Delta location **`deltaSalesPath`**.

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> This will fail without an option to overwrite the schema.

In [0]:
# TODO
updatedSalesDF.FILL_IN

**CHECK YOUR WORK**

In [0]:
assert spark.read.format("delta").load(deltaSalesPath).schema[6].dataType == IntegerType()

### 4. Create table and view version history
Run SQL queries to perform the following steps.
- Drop table **`sales_delta`** if it exists
- Create **`sales_delta`** table using the **`deltaSalesPath`** location
- List version history for the **`sales_delta`** table

In [0]:
# TODO

In [0]:
# TODO

**CHECK YOUR WORK**

In [0]:
salesDeltaDF = spark.sql("SELECT * FROM sales_delta")
assert salesDeltaDF.count() == 210370
assert salesDeltaDF.schema[6].dataType == IntegerType()

### 5. Time travel to read previous version
Read delta table at **`deltaSalesPath`** at version 0.  
Assign the resulting DataFrame to **`oldSalesDF`**.

In [0]:
# TODO
oldSalesDF = FILL_IN
display(oldSalesDF)

**CHECK YOUR WORK**

In [0]:
assert oldSalesDF.select(size(col("items"))).first()[0] == 1

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup