## Write to and read from a Delta Lake table

### Write a Spark DataFrame to a Delta Lake table

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Delta Lake Demo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

26/02/07 06:48:30 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
data = spark.range(0, 5)

(data
  .write
  .format("delta")
  .save("/tmp/delta-table")
)

                                                                                

### Read the above Delta Lake table to a Spark DataFrame and display the DataFrame

In [3]:
df = (spark
        .read
        .format("delta")
        .load("/tmp/delta-table")
        .orderBy("id")
      )

df.show()

26/02/07 06:51:33 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+



## Overwrite a Delta Lake table

### Overwrite the Delta Lake table written in the above step

In [4]:
data = spark.range(5, 10)

(data
  .write
  .format("delta")
  .mode("overwrite")
  .save("/tmp/delta-table")
)

In [5]:
view_data = data.createOrReplaceTempView("new_data")
spark.sql("""
  MERGE INTO delta.`/tmp/delta-table` AS target
  USING new_data AS source
  ON target.id = source.id
  WHEN NOT MATCHED THEN
    INSERT *
""")

26/02/07 06:54:41 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
26/02/07 06:54:41 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.19.0.2
26/02/07 06:54:41 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
26/02/07 06:54:41 WARN ObjectStore: Failed to get database delta, returning NoSuchObjectException


DataFrame[num_affected_rows: bigint, num_updated_rows: bigint, num_deleted_rows: bigint, num_inserted_rows: bigint]

### Read the above overwritten Delta Lake table to a Spark DataFrame and display the DataFrame

In [6]:
df = (spark
        .read
        .format("delta")
        .load("/tmp/delta-table")
        .orderBy("id")
      )

df.show()

+---+
| id|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
+---+



## Delta Lake and [ACID](https://en.wikipedia.org/wiki/ACID)

### Update Delta Lake Table

In [7]:
from delta.tables import *
from pyspark.sql.functions import *

delta_table = DeltaTable.forPath(spark, "/tmp/delta-table")

(delta_table
  .update(
    condition = expr("id % 2 == 0"),
    set = { "id": expr("id + 100") }
  )
)

(delta_table
  .toDF()
  .orderBy("id")
  .show()
)

26/02/07 06:59:00 WARN UpdateCommand: Could not validate number of records due to missing statistics.


+---+
| id|
+---+
|  5|
|  7|
|  9|
|106|
|108|
+---+



###  `delete`

In [8]:
# Delete every even value
(delta_table
  .delete(
    condition = expr("id % 2 == 0")
  )
)

(delta_table
  .toDF()
  .orderBy("id")
  .show()
)

26/02/07 06:59:35 WARN DeleteCommand: Could not validate number of records due to missing statistics.


+---+
| id|
+---+
|  5|
|  7|
|  9|
+---+



### `merge` Delta Lake Table

In [9]:
# Upsert (merge) new data
new_data = spark.range(0, 20)

(delta_table.alias("old_data")
  .merge(
      new_data.alias("new_data"),
      "old_data.id = new_data.id"
      )
  .whenMatchedUpdate(set = { "id": col("new_data.id") })
  .whenNotMatchedInsert(values = { "id": col("new_data.id") })
  .execute()
)

(delta_table
  .toDF()
  .orderBy("id")
  .show()
)

26/02/07 07:00:52 WARN MapPartitionsRDD: RDD 152 was locally checkpointed, its lineage has been truncated and cannot be recomputed after unpersisting


+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+



## Time travel feature of Delta Lake

### Display the entire history of the above Delta Lake table

In [10]:
# get the full history of the table
delta_table_history = (DeltaTable
                        .forPath(spark, "/tmp/delta-table")
                        .history()
                      )

(delta_table_history
   .select("version", "timestamp", "operation", "operationParameters", "operationMetrics", "engineInfo")
   .show()
)

+-------+--------------------+---------+--------------------+--------------------+--------------------+
|version|           timestamp|operation| operationParameters|    operationMetrics|          engineInfo|
+-------+--------------------+---------+--------------------+--------------------+--------------------+
|      4|2026-02-07 07:00:...|    MERGE|{predicate -> ["(...|{numTargetRowsCop...|Apache-Spark/4.0....|
|      3|2026-02-07 06:59:...|   DELETE|{predicate -> ["(...|{numRemovedFiles ...|Apache-Spark/4.0....|
|      2|2026-02-07 06:59:...|   UPDATE|{predicate -> ["(...|{numRemovedFiles ...|Apache-Spark/4.0....|
|      1|2026-02-07 06:53:...|    WRITE|{mode -> Overwrit...|{numFiles -> 6, n...|Apache-Spark/4.0....|
|      0|2026-02-07 06:49:...|    WRITE|{mode -> ErrorIfE...|{numFiles -> 6, n...|Apache-Spark/4.0....|
+-------+--------------------+---------+--------------------+--------------------+--------------------+



### Latest version of the Delta Lake table

In [15]:
# get the full history of the table
delta_table_history = (DeltaTable
                        .forPath(spark, "/tmp/delta-table")
                        .history()
                      )

(delta_table_history
   .select("version", "timestamp", "operation", "operationParameters", "operationMetrics", "engineInfo")
   .show()
)

### Latest version of the Delta Lake table

In [11]:
df = (spark
        .read
        .format("delta")
        .load("/tmp/delta-table")
        .orderBy("id")
      )

df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+



### Time travel to the version `0` of the Delta Lake table using Delta Lake's history feature

In [12]:
df = (spark
        .read
        .format("delta")
        .option("versionAsOf", 0) # we pass an option `versionAsOf` with the required version number we are interested in
        .load("/tmp/delta-table")
        .orderBy("id")
      )

df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+



### Time travel to the version `3` of the Delta Lake table using Delta Lake's  history feature

In [13]:
df = (spark
        .read
        .format("delta")
        .option("versionAsOf", 3) # we pass an option `versionAsOf` with the required version number we are interested in
        .load("/tmp/delta-table")
        .orderBy("id")
      )

df.show()

+---+
| id|
+---+
|  5|
|  7|
|  9|
+---+

