# Delta Tables

* In this notebook you will learn basic functionality of Delta-lake

## Useful links:
* [User Guide](https://docs.delta.io/latest/delta-batch.html)
* [Python API](https://docs.delta.io/latest/api/python/spark/index.html)
* [Delta Protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-transaction-log-protocol)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import DeltaTable
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('delta-conf')
    .config('spark.jars.packages', 'io.delta:delta-spark_2.12:3.2.1')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
    .getOrCreate()
)

In [None]:
print(spark.version)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')
accounts_output_path = os.path.join(project_path, 'output/accounts')

### Create a Delta table

* First drop the table accounts if it already exists
  * use sql command to drop the table
  * see [docs](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-drop-table.html#drop-table) for the drop command
  * be aware that this doesn't remove the actual files, it just removes the information from metastore. You need to go and delete the actual files manually if they exist
* Take the data from the `users_base_path` and save it as a Delta table with the name `accounts`
* as the location for the table use `accounts_output_path`

In [None]:
# drop the table if it exists:

spark.sql('drop table if exists accounts')

In [None]:
# Now create a new table accounts from the data
# use the format delta when saving

(
    spark.read.parquet(users_base_path)
    .write
    .mode('overwrite')
    .format('delta')
    .option('path', accounts_output_path)
    .saveAsTable('accounts')
)

### Verify that the table is created

you can use the following SQL commands:
* [show tables](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-tables.html)
* [describe extended](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-describe-table.html) table_name
* [describe detail](https://docs.delta.io/latest/delta-utility.html#retrieve-delta-table-details) table_name

you can also use the API of the Delta table:
* create the delta table object using the [DeltaTable.forName](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaTable.forName)

In [None]:
# check if the table was successfully created:

spark.sql('show tables').show()

In [None]:
spark.sql('desc extended accounts').show(truncate=False, n=50)

In [None]:
spark.sql('describe detail accounts').printSchema()

In [None]:
spark.sql('describe detail accounts').select('location', 'numFiles', 'sizeInBytes', 'properties').show(truncate=30)

In [None]:
# detail() function called on the Delta table:

DeltaTable.forName(spark, 'accounts').detail().printSchema()

### Version history of the delta table

See the history of the delta table. You can use:
* SQL command: [describe history](https://docs.delta.io/latest/delta-utility.html#retrieve-delta-table-history) table_name
* [history](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaTable.history) function on the Delta table object

In [None]:
spark.sql('describe history accounts').printSchema()

In [None]:
DeltaTable.forName(spark, 'accounts').history().select('version', 'timestamp', 'operation').show(truncate=50)

### Upsert / Merge

* load the increment into a Spark DataFrame 
  * use the path users_increment_path
* upsert the increment on the accounts table (use merge)
* Useful links for merge:
  * docs for [merge](https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge)
  * [merge](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaMergeBuilder) api
  * delta blog for [merge](https://delta.io/blog/2023-02-14-delta-lake-merge/)

In [None]:
# read the increment:

increment = spark.read.parquet(users_increment_path)

In [None]:
# do the merge:

(
    DeltaTable.forName(spark, 'accounts')
    .alias('accounts')
    .merge(
        increment.alias('increment'),
        'accounts.user_id == increment.user_id'
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

In [None]:
# check the history again to see that we have created a new version

spark.sql('describe history accounts').select('version', 'timestamp', 'operation', 'operationMetrics').show(truncate=50)

In [None]:
# from the field operaionMetrics see the keys: numTargetRowsInserted, numTargetRowsUpdated
(
    spark.sql('describe history accounts')
    .select(
        'version',
        col('operationMetrics')['numTargetRowsInserted'],
        col('operationMetrics')['numTargetRowsUpdated']
    )
).show(truncate=130)

### Optimize

After you are done with the writes on the delta table, it might be useful to call optimize on it.

* call [optimize](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaOptimizeBuilder)
* use [z-order](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaOptimizeBuilder.executeZOrderBy) by the column `user_id`
* other useful links:
  *   https://docs.delta.io/latest/optimizations-oss.html#optimize-performance-with-file-management
* check manually the files under the table to see that it was compacted


In [None]:
DeltaTable.forName(spark, 'accounts').optimize().executeZOrderBy('user_id')

### Vacuum

* remove old files that you don't need anymore
* run [vacuum](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaTable.vacuum) on the accounts delta table
  * you can also use [SQL](https://docs.delta.io/latest/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-table)
* check that the files from the directory were removed

Note:
* if you want to remove files that are not older than 7 days, you will have to disable the following setting: `spark.databricks.delta.retentionDurationCheck.enabled`

In [None]:
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', False)

DeltaTable.forName(spark, 'accounts').vacuum(0)

### Delete from delta

* [delete](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaTable.delete) from the accounts table a row where user_id = 79
* check that the row was really deleted

In [None]:
# delete:

DeltaTable.forName(spark, 'accounts').delete(col('user_id') == 79)

In [None]:
# check that it was deleted:

spark.table('accounts').filter(col('user_id') == 79).count()

### Time travel

* Now imagine, that you have made a mistake and you actually don't want to remove the user. Delta allows you to revert this operation. 

1) You can first take a look at a particular snapshot using the option `versionAsOf` on the DataFrameReader
2) Then if you decide that you really want to revert your operation, you can use `restoreToVersion` on the Delta Table
3) First check the table history to see in which version it was deleted

useful links:
* docs for [time travel](https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel) feature
* docs for [restoreToVersion](https://docs.delta.io/latest/api/python/spark/index.html#delta.tables.DeltaTable.restoreToVersion)
* delta blog for [time travel](https://delta.io/blog/2023-02-01-delta-lake-time-travel/)
* delta blog for [rollback](https://delta.io/blog/2022-10-03-rollback-delta-lake-restore/)

In [None]:
DeltaTable.forName(spark, 'accounts').history().select('version', 'timestamp', 'operation').show(truncate=50)

In [None]:
# take a look at the version of the table where the user wasn't deleted:

spark.read.option('versionAsOf', 4).table('accounts').filter(col('user_id') == 79).count()

In [None]:
# Rollback to the version

DeltaTable.forName(spark, 'accounts').restoreToVersion(4)

In [None]:
# check the table to see that the row is present there

spark.table('accounts').filter(col('user_id') == 79).count()

In [None]:
# see how it was written to the history

spark.sql('describe history accounts').select('version', 'timestamp', 'operation', 'operationMetrics').show(truncate=50)

In [None]:
spark.stop()