# Delta Lake Deletion Vectors

This notebook demonstrates the performance of deletion vectors by comparing the performance of the same operations on a Delta table without deletion vectors enabled and another Delta table with deletion vectors enabled.  Here are the operations performed:

* a delete operation on a row that's in few files
* a delete operation on a row that's in many files

In [5]:
from pathlib import Path

import delta
import pyspark
from delta import *
from pyspark.sql import functions as F

In [6]:
builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.executor.memory", "10G")
    .config("spark.driver.memory", "25G")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()

:: loading settings :: url = jar:file:/Users/matthew.powers/opt/miniconda3/envs/pyspark-350-delta-320/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/matthew.powers/.ivy2/cache
The jars for the packages stored in: /Users/matthew.powers/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e31ade8f-7ae7-4f65-b053-4481de1c6685;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.2.0 in central
	found io.delta#delta-storage;3.2.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 90ms :: artifacts dl 4ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.2.0 from central in [default]
	io.delta#delta-storage;3.2.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default  

In [7]:
df = (
    spark.read.format("csv")
    .option("header", True)
    .load(f"{Path.home()}/data/G1_1e9_1e2_0_0.csv")
)

In [8]:
df.show()

+-----+-----+------------+---+---+-------+---+---+---------+
|  id1|  id2|         id3|id4|id5|    id6| v1| v2|       v3|
+-----+-----+------------+---+---+-------+---+---+---------+
|id016|id059|id0009584273| 31| 54|7579268|  5|  2|92.709317|
|id039|id028|id0008226858| 32| 73|1462759|  3| 14|23.308717|
|id047|id073|id0004357983| 52| 71| 354157|  2| 15|98.462728|
|id043|id069|id0006903604| 37| 35| 372382|  5|  5|32.566149|
|id054|id095|id0005719264| 94| 99|6957127|  5| 11| 97.89284|
|id029|id027|id0007119528| 11| 41|6768037|  2|  7|26.394021|
|id047|id053|id0003186028| 93| 64|3300443|  3| 14|79.319642|
|id091|id097|id0007718026| 22| 50|3609381|  5| 15|94.510853|
|id090|id033|id0007857423|  5| 65|3618630|  1|  5| 1.579951|
|id070|id062|id0001399833| 90| 99|6131090|  5| 14|24.892749|
|id039|id030|id0000654974| 22| 18|1298417|  1| 10|15.321252|
|id023|id095|id0005131426| 52| 38|7811474|  4| 12| 25.65414|
|id070|id013|id0009420524| 14| 33|7075062|  5| 14|43.468912|
|id022|id026|id000519192

## Baseline - Delta table without deletion vectors enabled

In [36]:
delta_path = f"{Path.home()}/data/delta_baseline_G1_1e9_1e2_0_0"

In [37]:
%%time

df.write.format("delta").save(delta_path)

                                                                                

CPU times: user 330 ms, sys: 116 ms, total: 447 ms
Wall time: 6min 2s


In [38]:
%%time

delta_table.optimize().executeZOrderBy("id1")

                                                                                

CPU times: user 646 ms, sys: 214 ms, total: 860 ms
Wall time: 10min 39s


DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bigint>,de

In [39]:
dt = deltalake.DeltaTable(delta_path)
levi.delta_file_sizes(dt)

{'num_files_<1mb': 0,
 'num_files_1mb-500mb': 0,
 'num_files_500mb-1gb': 1,
 'num_files_1gb-2gb': 24,
 'num_files_>2gb': 0}

In [40]:
levi.skipped_stats(dt, filters=[("id1", "=", "'id001'")])

{'num_files': 25, 'num_files_skipped': 24, 'num_bytes_skipped': 25997502904}

### Computations before deletion vectors are enabled

In [42]:
%%time

delta_table = delta.DeltaTable.forPath(spark, delta_path)
delta_table.delete(F.col("id1") == "id001")

                                                                                

CPU times: user 6.4 ms, sys: 3.96 ms, total: 10.4 ms
Wall time: 10.3 s


In [43]:
spark.read.format("delta").load(delta_path).createOrReplaceTempView("x0")

In [44]:
%%time

spark.sql("select id1, sum(v1) as v1 from x0 group by id1").show()



+-----+-----------+
|  id1|         v1|
+-----+-----------+
|id089|2.9990077E7|
|id087|2.9997379E7|
|id086|3.0003608E7|
|id085|3.0010513E7|
|id088|2.9999642E7|
|id013|2.9989026E7|
|id014|2.9998476E7|
|id016|3.0003304E7|
|id017|2.9995061E7|
|id015|3.0006177E7|
|id023|2.9988818E7|
|id021|2.9982118E7|
|id025|3.0016745E7|
|id022|2.9994847E7|
|id024|3.0003956E7|
|id051|2.9994785E7|
|id052|3.0014118E7|
|id053| 2.999236E7|
|id050|3.0008271E7|
|id049|2.9978475E7|
+-----+-----------+
only showing top 20 rows

CPU times: user 68.9 ms, sys: 23.1 ms, total: 92 ms
Wall time: 19.5 s


                                                                                

In [50]:
%%time

delta_table = delta.DeltaTable.forPath(spark, delta_path)
delta_table.delete((F.col("id2") == F.lit("id030")) & (F.col("id4") == F.lit(22)))

                                                                                

CPU times: user 219 ms, sys: 70.5 ms, total: 289 ms
Wall time: 3min 58s


In [51]:
%%time

spark.sql("select id2, sum(id4) as sum_id4 from x0 group by id2").show()



+-----+------------+
|  id2|     sum_id4|
+-----+------------+
|id089|4.99702831E8|
|id080|5.00174365E8|
|id087|4.99700307E8|
|id073|4.99578553E8|
|id043|4.99827245E8|
|id064|4.99800549E8|
|id051|4.99712039E8|
|id045|4.99602129E8|
|id074|4.99709236E8|
|id023|5.00068708E8|
|id006|  4.999594E8|
|id013| 5.0012219E8|
|id055|4.99983587E8|
|id099| 4.9972291E8|
|id056| 5.0021648E8|
|id052|4.99804639E8|
|id093|4.99838447E8|
|id034|4.99737564E8|
|id075|4.99923346E8|
|id036|4.99873408E8|
+-----+------------+
only showing top 20 rows

CPU times: user 58.3 ms, sys: 18.3 ms, total: 76.6 ms
Wall time: 21.2 s


                                                                                

## Computations with deletion vectors enabled

In [52]:
delta_path = f"{Path.home()}/data/delta_dv_G1_1e9_1e2_0_0"

In [53]:
%%time

df.write.format("delta").save(delta_path)

AnalysisException: Cannot write to already existent path file:/Users/matthew.powers/data/delta_dv_G1_1e9_1e2_0_0 without setting OVERWRITE = 'true'.

In [None]:
%%time

delta_table.optimize().executeZOrderBy("id1")

In [54]:
# enable deletion vectors

spark.sql(
    f"ALTER TABLE delta.`{delta_path}` SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)"
)

DataFrame[]

In [55]:
spark.read.format("delta").load(delta_path).createOrReplaceTempView("x1")

In [57]:
%%time

delta_table = delta.DeltaTable.forPath(spark, delta_path)
delta_table.delete(F.col("id1") == "id001")

                                                                                

CPU times: user 11.9 ms, sys: 7.52 ms, total: 19.4 ms
Wall time: 1min 33s


In [59]:
%%time

spark.sql("select id1, sum(v1) as v1 from x1 group by id1").show()



+-----+-----------+
|  id1|         v1|
+-----+-----------+
|id080| 2.997988E7|
|id084|3.0005578E7|
|id083|3.0005209E7|
|id085|3.0010513E7|
|id082|3.0007351E7|
|id081|2.9988686E7|
|id034|3.0010786E7|
|id036|2.9994349E7|
|id032|2.9986434E7|
|id037|2.9996759E7|
|id033|2.9983262E7|
|id035|3.0003917E7|
|id005|2.9993888E7|
|id003|3.0003365E7|
|id002|2.9996534E7|
|id004| 3.001599E7|
|id056|2.9987234E7|
|id059|3.0010798E7|
|id057|2.9991822E7|
|id058|2.9999957E7|
+-----+-----------+
only showing top 20 rows

CPU times: user 21.3 ms, sys: 10.1 ms, total: 31.4 ms
Wall time: 26.8 s


                                                                                

In [60]:
%%time

delta_table = delta.DeltaTable.forPath(spark, delta_path)
delta_table.delete((F.col("id2") == F.lit("id030")) & (F.col("id4") == F.lit(22)))

                                                                                

CPU times: user 25.4 ms, sys: 10 ms, total: 35.4 ms
Wall time: 16.7 s


In [61]:
%%time

spark.sql("select id2, sum(id4) as sum_id4 from x0 group by id2").show()



+-----+------------+
|  id2|     sum_id4|
+-----+------------+
|id089|4.99702831E8|
|id080|5.00174365E8|
|id087|4.99700307E8|
|id073|4.99578553E8|
|id043|4.99827245E8|
|id064|4.99800549E8|
|id051|4.99712039E8|
|id045|4.99602129E8|
|id074|4.99709236E8|
|id023|5.00068708E8|
|id006|  4.999594E8|
|id013| 5.0012219E8|
|id055|4.99983587E8|
|id099| 4.9972291E8|
|id056| 5.0021648E8|
|id052|4.99804639E8|
|id093|4.99838447E8|
|id034|4.99737564E8|
|id075|4.99923346E8|
|id036|4.99873408E8|
+-----+------------+
only showing top 20 rows

CPU times: user 61.2 ms, sys: 20.9 ms, total: 82 ms
Wall time: 20.8 s


                                                                                

## Purging Deletion Vectors

In [17]:
df = spark.createDataFrame([("bob", 1), ("li", 2), ("leonard", 3)]).toDF(
    "first_name", "id"
)

In [18]:
delta_path = f"{Path().absolute()}/tmp/some_ppl"

In [19]:
df.repartition(1).write.format("delta").save(delta_path)

In [20]:
spark.sql(
    f"ALTER TABLE delta.`{delta_path}` SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)"
)

DataFrame[]

In [21]:
!tree tmp/some_ppl

[01;34mtmp/some_ppl[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   └── [00m00000000000000000001.json[0m
└── [00mpart-00000-1d16df58-7cbe-4608-9782-429f3a9bd639-c000.snappy.parquet[0m

1 directory, 3 files


In [23]:
delta_table = delta.DeltaTable.forPath(spark, delta_path)

In [24]:
delta_table.delete(F.col("id") == F.lit(1))

In [25]:
!tree tmp/some_ppl

[01;34mtmp/some_ppl[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   ├── [00m00000000000000000001.json[0m
│   └── [00m00000000000000000002.json[0m
├── [00mdeletion_vector_6a611095-5f85-4961-9394-950bfb49a4a0.bin[0m
└── [00mpart-00000-1d16df58-7cbe-4608-9782-429f3a9bd639-c000.snappy.parquet[0m

1 directory, 5 files


In [26]:
delta_table.delete(F.col("id") == F.lit(2))

In [27]:
!tree tmp/some_ppl

[01;34mtmp/some_ppl[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   ├── [00m00000000000000000001.json[0m
│   ├── [00m00000000000000000002.json[0m
│   └── [00m00000000000000000003.json[0m
├── [00mdeletion_vector_6a611095-5f85-4961-9394-950bfb49a4a0.bin[0m
├── [00mdeletion_vector_e2928f07-529a-48d7-9cc0-a97b4ce20582.bin[0m
└── [00mpart-00000-1d16df58-7cbe-4608-9782-429f3a9bd639-c000.snappy.parquet[0m

1 directory, 7 files


In [28]:
spark.sql(f"REORG TABLE delta.`{delta_path}` APPLY (PURGE);")

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bigint>,de

In [29]:
!tree tmp/some_ppl

[01;34mtmp/some_ppl[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   ├── [00m00000000000000000001.json[0m
│   ├── [00m00000000000000000002.json[0m
│   ├── [00m00000000000000000003.json[0m
│   └── [00m00000000000000000004.json[0m
├── [00mdeletion_vector_6a611095-5f85-4961-9394-950bfb49a4a0.bin[0m
├── [00mdeletion_vector_e2928f07-529a-48d7-9cc0-a97b4ce20582.bin[0m
├── [00mpart-00000-1d16df58-7cbe-4608-9782-429f3a9bd639-c000.snappy.parquet[0m
└── [00mpart-00000-40ba36d8-8de6-4728-8c0c-17ac407f5340-c000.snappy.parquet[0m

1 directory, 9 files


In [30]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table.vacuum(0)

                                                                                

Deleted 3 files and directories in a total of 1 directories.


DataFrame[]

In [31]:
!tree tmp/some_ppl

[01;34mtmp/some_ppl[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   ├── [00m00000000000000000001.json[0m
│   ├── [00m00000000000000000002.json[0m
│   ├── [00m00000000000000000003.json[0m
│   ├── [00m00000000000000000004.json[0m
│   ├── [00m00000000000000000005.json[0m
│   └── [00m00000000000000000006.json[0m
└── [00mpart-00000-40ba36d8-8de6-4728-8c0c-17ac407f5340-c000.snappy.parquet[0m

1 directory, 8 files


## Deletion vector interop

In [5]:
df = spark.createDataFrame([("bob", 1), ("li", 2), ("leonard", 3)]).toDF(
    "first_name", "id"
)

In [6]:
delta_path = f"{Path().absolute()}/tmp/some_ppl2"

In [7]:
df.repartition(1).write.format("delta").save(delta_path)

23/09/23 10:57:16 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [8]:
spark.sql(
    f"ALTER TABLE delta.`{delta_path}` SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)"
)

DataFrame[]

In [9]:
delta_table = delta.DeltaTable.forPath(spark, delta_path)

In [10]:
delta_table.delete(F.col("id") == F.lit(1))

In [11]:
delta_table.toDF().show()

+----------+---+
|first_name| id|
+----------+---+
|        li|  2|
|   leonard|  3|
+----------+---+



23/09/23 10:59:42 WARN JavaUtils: Attempt to delete using native Unix OS command failed for path = /private/var/folders/19/_52w4zps3xjc6plz_f63j8sh0000gp/T/blockmgr-66edd5cb-b773-4f44-ab86-b65919e6fbb6. Falling back to Java IO way
java.io.IOException: Failed to delete: /private/var/folders/19/_52w4zps3xjc6plz_f63j8sh0000gp/T/blockmgr-66edd5cb-b773-4f44-ab86-b65919e6fbb6
	at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:177)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:113)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:94)
	at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1231)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1(DiskBlockManager.scala:368)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1$adapted(DiskBlockManager.scala:364)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.Indexe

Switch to Python deltalake to make sure that table is properly read

In [1]:
from pathlib import Path

import pandas as pd
from deltalake import DeltaTable

In [2]:
delta_path = f"{Path().absolute()}/tmp/some_ppl2"

In [3]:
DeltaTable(delta_path).to_pandas()

DeltaProtocolError: The table's minimum reader version is 3but deltalake only supports up to version 1.

In [4]:
dt = DeltaTable(delta_path)

In [5]:
dt.protocol()

ProtocolVersions(min_reader_version=3, min_writer_version=7)

## Read DAT DV table

In [9]:
path = "/Users/matthew.powers/data/out/reader_tests/generated/deletion_vectors/delta"

In [10]:
spark.read.format("delta").load(path).show()

24/10/05 16:46:32 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------+---+----------+
|letter|int|      date|
+------+---+----------+
|     b|228|1978-12-01|
+------+---+----------+



                                                                                