# Delta Tables

* In this notebook you will learn basic functionality of delta

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import DeltaTable
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('delta-conf')
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.2.1")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate()
)

In [None]:
spark.version

In [5]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')
accounts_output_path = os.path.join(project_path, 'output/delta/accounts')

### Create a Delta table

* First drop the table accounts if it already exists
  * use sql command to drop the table
  * see [docs](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-drop-table.html#drop-table) for the drop command
  * be aware that this doesn't remove the actual files, it just removes the information from metastore. You need to go and delete the actual files manually
* Take the data from the `users_base_path` and save it as a delta table with the name `accounts`
* as the location for the table use `accounts_output_path`

In [6]:
# drop the table if it exists:

spark.sql("drop table if exists accounts")

DataFrame[]

In [7]:
# Now create a new table accounts from the data
(
    spark.read.parquet(users_base_path)
    .write
    .format('delta')
    .option('path', accounts_output_path)
    .saveAsTable('accounts')
)

                                                                                

### Verify that the table is created

you can use the following SQL commands:
* show tables
* describe formatted table_name
* describe extended table_name
* describe detail table_name

you can also use the API of the Delta table:
* create the delta table object using the [DeltaTable.forName](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.forName)

In [8]:
# check if the table was successfully created:

spark.sql('show tables').show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|  default| accounts|      false|
+---------+---------+-----------+



In [10]:
spark.sql('desc formatted accounts').show(truncate=False, n=50)

+----------------------------+-----------------------------------------------------------------------+-------+
|col_name                    |data_type                                                              |comment|
+----------------------------+-----------------------------------------------------------------------+-------+
|user_id                     |bigint                                                                 |NULL   |
|display_name                |string                                                                 |NULL   |
|about                       |string                                                                 |NULL   |
|location                    |string                                                                 |NULL   |
|downvotes                   |bigint                                                                 |NULL   |
|upvotes                     |bigint                                                                 |NULL   |
|

In [11]:
spark.sql('describe detail accounts').printSchema()

24/10/06 14:42:09 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

root
 |-- format: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location: string (nullable = true)
 |-- createdAt: timestamp (nullable = true)
 |-- lastModified: timestamp (nullable = true)
 |-- partitionColumns: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- clusteringColumns: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- numFiles: long (nullable = true)
 |-- sizeInBytes: long (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- minReaderVersion: integer (nullable = true)
 |-- minWriterVersion: integer (nullable = true)
 |-- tableFeatures: array (nullable = true)
 |    |-- element: string (containsNull = true)



                                                                                

In [17]:
spark.sql('describe detail accounts').select('location', 'numFiles', 'sizeInBytes', 'properties').show(truncate=30)

+------------------------------+--------+-----------+----------+
|                      location|numFiles|sizeInBytes|properties|
+------------------------------+--------+-----------+----------+
|file:/home/ubuntu/Apache-Sp...|       4|    3995164|        {}|
+------------------------------+--------+-----------+----------+



In [18]:
DeltaTable.forName(spark, 'accounts').detail().printSchema()

root
 |-- format: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location: string (nullable = true)
 |-- createdAt: timestamp (nullable = true)
 |-- lastModified: timestamp (nullable = true)
 |-- partitionColumns: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- clusteringColumns: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- numFiles: long (nullable = true)
 |-- sizeInBytes: long (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- minReaderVersion: integer (nullable = true)
 |-- minWriterVersion: integer (nullable = true)
 |-- tableFeatures: array (nullable = true)
 |    |-- element: string (containsNull = true)



### Version history of the delta table

See the history of the delta table. You can use:
* SQL command: describe history table_name
* [history](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.history) function on the Delta table object

In [19]:
spark.sql('describe history accounts').printSchema()

root
 |-- version: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- userId: string (nullable = true)
 |-- userName: string (nullable = true)
 |-- operation: string (nullable = true)
 |-- operationParameters: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- job: struct (nullable = true)
 |    |-- jobId: string (nullable = true)
 |    |-- jobName: string (nullable = true)
 |    |-- jobRunId: string (nullable = true)
 |    |-- runId: string (nullable = true)
 |    |-- jobOwnerId: string (nullable = true)
 |    |-- triggerType: string (nullable = true)
 |-- notebook: struct (nullable = true)
 |    |-- notebookId: string (nullable = true)
 |-- clusterId: string (nullable = true)
 |-- readVersion: long (nullable = true)
 |-- isolationLevel: string (nullable = true)
 |-- isBlindAppend: boolean (nullable = true)
 |-- operationMetrics: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull 

In [21]:
DeltaTable.forName(spark, 'accounts').history().select('version', 'timestamp').show(truncate=30)

+-------+-----------------------+
|version|              timestamp|
+-------+-----------------------+
|      0|2024-10-06 14:41:24.372|
+-------+-----------------------+



### Upsert / Merge

* load the increment in to a Spark DataFrame 
 * use the path users_increment_path
* upsert the increment on the accounts table (use merge)
* Useful links for merge:
 * docs for [merge](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.merge)
 * delta blog for [merge](https://delta.io/blog/2023-02-14-delta-lake-merge/)

In [22]:
# read the increment:

increment = spark.read.parquet(users_increment_path)

In [23]:
# do the merge:

(
    DeltaTable.forName(spark, 'accounts')
    .alias('accounts')
    .merge(
        increment.alias('increment'),
        'accounts.user_id == increment.user_id'
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

                                                                                

In [35]:
# check the history again to see that we have created a new version

spark.sql('describe history accounts').select('version', 'timestamp', 'operationMetrics').show(truncate=100)

+-------+-----------------------+----------------------------------------------------------------------------------------------------+
|version|              timestamp|                                                                                    operationMetrics|
+-------+-----------------------+----------------------------------------------------------------------------------------------------+
|      3|2024-10-06 14:47:20.673|{numRemovedFiles -> 1, numRemovedBytes -> 4059419, numCopiedRows -> 124224, numDeletionVectorsAdd...|
|      2|2024-10-06 14:46:46.875|{numRemovedFiles -> 4, numRemovedBytes -> 4120185, p25FileSize -> 4059419, numDeletionVectorsRemo...|
|      1|2024-10-06 14:45:40.178|{numTargetRowsCopied -> 91560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 4, numTargetByte...|
|      0|2024-10-06 14:41:24.372|                                 {numFiles -> 4, numOutputRows -> 124225, numOutputBytes -> 3995164}|
+-------+-----------------------+----------------------

### Optimize

After you are done with the writes on the delta table, it might be useful to call optimize on it.

* call [optimize](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.optimize)
* use [z-order](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaOptimizeBuilder.executeZOrderBy) by the column `user_id`
* check manually the files under the table to see that it was compacted


In [30]:
DeltaTable.forName(spark, 'accounts').optimize().executeZOrderBy('user_id')

                                                                                

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,clusteringStats:struct<inputZCubeFiles:struct<numFiles:bigint,size:bigint>,inputOtherFiles:struct<numFiles:bigint,size:bigint>,inputNumZCubes:bigint,mergedFiles:struct<numFiles:bigint,size:bigint>,numOutputZCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterPar

### Delete from delta

* [delete](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.delete) from the accounts table a row where user_id = 79
* check that the row was really deleted

In [33]:
# delete:

DeltaTable.forName(spark, 'accounts').delete(col('user_id') == 79)

                                                                                

In [34]:
# check that it was deleted:

spark.table('accounts').filter(col('user_id') == 79).count()

                                                                                

0

### Time travel

* Now imagine, that you have made a mistake and you actually don't want to remove the user. Delta allows you to revert this operation. 

1) You can first take a look at a particular snapshot using the option `versionAsOf` on the DataFrameReader
2) Then if you decide that you really want to revert your operation, you can use `restoreToVersion` on the Delta Table

useful links:
* docs for [restoreToVersion](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.restoreToVersion)
* delta blog for [time travel](https://delta.io/blog/2023-02-01-delta-lake-time-travel/)
* delta blog for [rollback](https://delta.io/blog/2022-10-03-rollback-delta-lake-restore/)

In [36]:
# take a look at the version of the table where the user wasn't deleted:

spark.read.option("versionAsOf", "2").table('accounts').filter(col('user_id') == 79).count()

                                                                                

1

In [37]:
# Rollback to the version

DeltaTable.forName(spark, 'accounts').restoreToVersion(2)

24/10/06 14:48:20 WARN DAGScheduler: Broadcasting large task binary with size 1078.2 KiB
                                                                                

DataFrame[table_size_after_restore: bigint, num_of_files_after_restore: bigint, num_removed_files: bigint, num_restored_files: bigint, removed_files_size: bigint, restored_files_size: bigint]

In [38]:
# check the table to see that the row is present there

spark.table('accounts').filter(col('user_id') == 79).count()

                                                                                

1

### Vacuum

* remove old files that you don't need anymore
* run [vacuum](https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.vacuum) on the accounts delta table

Note:
* if you want to remove files that are not older then 7 days, you will have to disable the following setting: `spark.databricks.delta.retentionDurationCheck.enabled`

In [39]:
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', False)

In [40]:
DeltaTable.forName(spark, 'accounts').vacuum(1)

                                                                                

Deleted 0 files and directories in a total of 1 directories.


DataFrame[]

In [None]:
spark.stop()