# Delta Tables

* In this notebook you will learn basic functionality of delta

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable
import os

In [None]:
spark = configure_spark_with_delta_pip(
    SparkSession.builder.appName('delta_tables_test')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
    .enableHiveSupport()
).getOrCreate()

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')
accounts_output_path = os.path.join(project_path, 'data/delta/accounts')

### Create a Delta table

* First drop the table accounts if it already exists
  * use sql command to drop the table
  * see [docs](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-drop-table.html#drop-table) for the drop command
  * be aware that this doesn't remove the actual files, it just removes the information from metastore. You need to go and delete the actual files manually
* Take the data from the `users_base_path` and save it as a delta table with the name `accounts`
* as the location for the table use `accounts_output_path`

In [None]:
# drop the table if it exists:

spark.sql("drop table if exists accounts")

In [None]:
# Now create a new table accounts from the data
(
    spark.read.parquet(users_base_path)
    .write
    .format('delta')
    .option('path', accounts_output_path)
    .saveAsTable('accounts')
)

### Verify that the table is created

you can use the following SQL commands:
* show tables
* describe formatted table_name
* describe extended table_name
* describe detail table_name

you can also use the API of the Delta table:
* create the delta table object using the [DeltaTable.forName](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.forName)

In [None]:
# check if the table was successfully created:

spark.sql('show tables').show()

In [None]:
spark.sql('desc formatted accounts').show()

In [None]:
spark.sql('describe detail accounts').printSchema()

In [None]:
DeltaTable.forName(spark, 'accounts').detail().printSchema()

### Version history of the delta table

See the history of the delta table. You can use:
* SQL command: describe history table_name
* [history](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.history) function on the Delta table object

In [None]:
spark.sql('describe history accounts').printSchema()

In [None]:
DeltaTable.forName(spark, 'accounts').history().select('version', 'timestamp').show()

### Upsert / Merge

* load the increment in to a Spark DataFrame 
 * use the path users_increment_path
* upsert the increment on the accounts table (use merge)
* Useful links for merge:
 * docs for [merge](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.merge)
 * delta blog for [merge](https://delta.io/blog/2023-02-14-delta-lake-merge/)

In [None]:
# read the increment:

increment = spark.read.parquet(users_increment_path)

In [None]:
# do the merge:

(
    DeltaTable.forName(spark, 'accounts')
    .alias('accounts')
    .merge(
        increment.alias('increment'),
        'accounts.user_id == increment.user_id'
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

In [None]:
# check the history again to see that we have created a new version

spark.sql('describe history accounts').select('version', 'timestamp', 'operationMetrics').show()

### Optimize

After you are done with the writes on the delta table, it might be useful to call optimize on it.

* call [optimize](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.optimize)
* use [z-order](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaOptimizeBuilder.executeZOrderBy) by the column `user_id`
* check manually the files under the table to see that it was compacted


In [None]:
DeltaTable.forName(spark, 'accounts').optimize().executeZOrderBy('user_id')

### Delete from delta

* [delete](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.delete) from the accounts table a row where user_id = 79
* check that the row was really deleted

In [None]:
# delete:

DeltaTable.forName(spark, 'accounts').delete(col('user_id') == 79)

In [None]:
# check that it was deleted:

spark.table('accounts').filter(col('user_id') == 79).count()

### Time travel

* Now imagine, that you have made a mistake and you actually don't want to remove the user. Delta allows you to revert this operation. 

1) You can first take a look at a particular snapshot using the option `versionAsOf` on the DataFrameReader
2) Then if you decide that you really want to revert your operation, you can use `restoreToVersion` on the Delta Table

useful links:
* docs for [restoreToVersion](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.restoreToVersion)
* delta blog for [time travel](https://delta.io/blog/2023-02-01-delta-lake-time-travel/)
* delta blog for [rollback](https://delta.io/blog/2022-10-03-rollback-delta-lake-restore/)

In [None]:
# take a look at the version of the table where the user wasn't deleted:

spark.read.option("versionAsOf", "2").table('accounts').filter(col('user_id') == 79).count()

In [None]:
# Rollback to the version

DeltaTable.forName(spark, 'accounts').restoreToVersion(2)

In [None]:
# check the table to see that the row is present there

spark.table('accounts').filter(col('user_id') == 79).count()

### Vacuum

* remove old files that you don't need anymore
* run [vacuum](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.vacuum) on the accounts delta table

Note:
* if you want to remove files that are not older then 7 days, you will have to disable the following setting: `spark.databricks.delta.retentionDurationCheck.enabled`

In [None]:
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', False)

In [None]:
DeltaTable.forName(spark, 'accounts').vacuum(1)

In [None]:
spark.stop()