We are excited to announce the release of Delta Lake 1.1.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. The key features in this release are as follows.
Performance improvements in MERGE operation
- On partitioned tables, MERGE operations will automatically repartition the output data before writing to files. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations.
- On very wide tables (e.g., 1000 columns), MERGE operation can be faster since it now avoids quadratic complexity when resolving column names in a table with ~1000 or more columns.
Support for passing Hadoop configurations via DataFrameReader/Writer options - You can now set Hadoop FileSystem configurations (e.g., access credentials) via DataFrameReader/Writer options. Earlier the only way to pass such configurations was to set Spark session configuration which would set them to the same value for all reads and writes. Now you can set them to different values for each read and write. See the documentation for more details.
Support for arbitrary expressions in
replaceWhereDataFrameWriter option - Instead of expressions only on partition columns, you can now use arbitrary expressions in the
replaceWhereDataFrameWriter option. That is you can replace arbitrary data in a table directly with DataFrame writes. See the documentation for more details.
Improvements to nested field resolution and schema evolution in MERGE operation on array of structs - When applying the MERGE operation on a target table having a column typed as an array of nested structs, the nested columns between the source and target data are now resolved by name and not by position in the struct. This ensures structs in arrays have a consistent behavior with structs outside arrays. When automatic schema evolution is enabled for MERGE, nested columns in structs in arrays will follow the same evolution rules (e.g., column added if no column by the same name exists in the table) as columns in structs outside arrays. See the documentation for more details.
Support for Generated Columns in MERGE operation - You can now apply MERGE operations on tables having Generated Columns.
Fix for rare data corruption issue on GCS - Experimental GCS support released in Delta Lake 1.0 has a rare bug that can lead to Delta tables being unreadable due to partially written transaction log files. This issue has now been fixed (1, 2).
Fix for the incorrect return object in Python
DeltaTable.convertToDelta()- This existing API now returns the correct Python object of type
delta.tables.DeltaTableinstead of an incorrectly-typed, and therefore unusable object.
Python type annotations - We have added Python type annotations which improve auto-completion performance in editors which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools).
Other notable changes
- Removed support to read tables with certain special characters in partition column name. See migration guide for details.
- Support for “delta.`path`” in
DeltaTable.forName()for consistency with other APIs
- Improvements to DeltaTableBuilder API introduced in Delta 1.0.0
- Improved support for MERGE/UPDATE/DELETE on temp views.
- Support for setting
userMetadatain the commit information when creating or replacing tables.
- Fix for an incorrect analysis exception in MERGE with multiple INSERT and UPDATE clauses and automatic schema evolution enabled.
- Fix for incorrect handling of special characters (e.g. spaces) in paths by MERGE/UPDATE/DELETE operations.
- Fix for Vacuum parallel mode from being affected by the Adaptive Query Execution enabled by default in Apache Spark 3.2.
- Fix for earliest valid time travel version.
- Fix for Hadoop configurations not being used to write checkpoints.
- Multiple fixes (1, 2, 3) to Delta Constraints.
Abhishek Somani, Adam Binford, Alex Jing, Alexandre Lopes, Allison Portis, Bogdan Raducanu, Bart Samwel, Burak Yavuz, David Lewis, Eunjin Song, Feng Zhu, Flavio Cruz, Florian Valeye, Fred Liu, Guy Khazma, Jacek Laskowski, Jackie Zhang, Jarred Parrett, JassAbidi, Jose Torres, Junlin Zeng, Junyong Lee, KamCheung Ting, Karen Feng, Lars Kroll, Li Zhang, Linhong Liu, Liwen Sun, Maciej, Max Gekk, Meng Tong, Prakhar Jain, Pranav Anand, Rahul Mahadev, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Shuting Zhang, Tathagata Das, Terry Kim, Tom Lynch, Vijayan Prabhakaran, Vítor Mussa, Wenchen Fan, Yaohua Zhao, Yijia Cui, YuXuan Tay, Yuchen Huo, Yuhong Chen, Yuming Wang, Yuyuan Tang, Zach Schuermann, ericfchang, gurunath
We are excited to announce the release of Delta Lake 1.0.0 on Apache Spark 3.1. The key features in this release are as follows.
NOT MATCHEDclauses for merge operations in SQL - With the upgrade to Apache Spark 3.1,
MERGESQL command now supports any number of
WHEN NOT MATCHEDclauses (Scala, Java and Python APIs already support unlimited clauses since 0.8.0 on Spark 3.0). See the documentation on MERGE for more details.
New programmatic APIs to create tables - Delta Lake now allows you to directly create new Delta tables programmatically (Scala, Java, and Python) without using DataFrame APIs. We have introduced new DeltaTableBuilder and DeltaColumnBuilder APIs to specify all the table details that you can specify through SQL
CREATE TABLE. See the documentation for details and examples.
Experimental support for Generated Columns - Delta Lake now supports Generated Columns which are a special type of columns whose values are automatically generated based on a user-specified function over other columns in the Delta table. You can use most built-in SQL functions in Apache Spark to generate the values of these generated columns. For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column. You can create Delta tables with Generated Columns using the new programmatic APIs to create tables. See the documentation for details.
Simplified storage configuration - Delta Lake can now automatically load the correct LogStore needed for common storage systems hosting the Delta table being read or written to. Users no longer need to explicitly configure the LogStore implementation if they are running Delta Lake on AWS S3, Azure blob stores, and HDFS. This also allows the same application to simultaneously read and write to Delta tables on different cloud storage systems. The scheme of the Delta table path is used to dynamically load the necessary LogStore implementation. Using storage systems other than the ones listed above still needs explicit configuration. See the documentation on storage configuration for details.
Experimental support for additional cloud storage systems - Delta Lake now has experimental support for Google Cloud Storage, Oracle Cloud Storage, IBM Cloud Object Storage. You will have to add an additional maven artifact
delta-contribsto access the LogStores corresponding to them, and explicitly configure the LogStore names corresponding to the relevant path schemes. See the documentation on storage configuration for details. In addition, we have also defined a more stable LogStore API for building custom implementations.
Public APIs for catching exceptions due to conflicts - The exceptions thrown on conflict between concurrent operations have now been converted to public APIs. This allows you to catch those exceptions and retry your write operations. See the API documentation for details.
PyPI release - Delta Lake can now be installed from PyPI with
pip install delta-spark. However, along with pip installation, you also have to configure the SparkSession. See the documentation for details.
Other notable changes
- New Maven artifact
delta-contribswhich contain contributions from the community that are still experimental and need more testing before being packaged in the main artifact
- Execution time metrics for UPDATE, DELETE, and MERGE operations are available in table history.
- Fixed multiple bugs in schema evolution of nested columns in MERGE operation.
- Fixed bug in handling dots in column names.
- New Maven artifact
In relation to this release, we have also introduced a new Delta Sharing project which is an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real-time regardless of which computing platforms they use. It is a simple REST protocol that securely shares access to part of a cloud dataset and leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer data. See the project repository and the release notes for details.
Alex Ott, Ali Afroozeh, Antonio, Bruno Palos, Burak Yavuz, Christopher Grant, Denny Lee, Gengliang Wang, Guy Khazma, Howard Xiao, Jacek Laskowski, Joe Widen, Jose Torres, Lars Kroll, Linhong Liu, Meng Tong, Prakhar Jain, Pranav Anand, R. Tyler Croy, Rahul Mahadev, Ranu Vikram, Sabir Akhadov, Shixiong Zhu, Stefan Zeiger, Tathagata Das, Tom van Bussel, Vijayan Prabhakaran, Vivek Bhaskar, Wenchen Fan, Yijia Cui, Yingyi Bu, Yuchen Huo, Brenner Heintz, fvaleye, Herman van Hovell, Liwen Sun, Mahmoud Mahdi, Sabir Akhadov, Yaohua Zhao
We are excited to announce the release of Delta Lake 0.8.0, which introduces the following key features.
NOT MATCHEDclauses for merge operations in Scala, Java, and Python - merge operations now support any number of
whenNotMatchedclauses. In addition, merge queries that unconditionally delete matched rows no longer throw errors on multiple matches. See the documentation for details.
MERGE operation now supports schema evolution of nested columns - Schema evolution of nested columns now has the same semantics as that of top-level columns. For example, new nested columns can be automatically added to a StructType column. See Automatic schema evolution in Merge for details.
MERGE INTO and UPDATE operations now resolve nested struct columns by name - Update operations UPDATE and MERGE INTO commands now resolve nested struct columns by name. That is, when comparing or assigning columns of type StructType, the order of the nested columns does not matter (exactly in the same way as the order of top-level columns). To revert to resolving by position, set the Spark configuration
Check constraints on Delta tables - Delta now supports
CHECKconstraints. When supplied, Delta automatically verifies that data added to a table satisfies the specified constraint expression. To add
CHECKconstraints, use the
ALTER TABLE ADD CONSTRAINTScommand. See the documentation for details.
Start streaming a table from a specific version (#474) - When using Delta as a streaming source, you can use the options
startingVersionto start processing the table from a given version and onwards. You can also set
latestto skip existing data in the table and stream from the new incoming data. See the documentation for details.
Ability to perform parallel deletes with VACUUM (#395) - When using
VACUUM, you can set the session configuration
“true”in order to use Spark to perform the deletion of files in parallel (based on the number of shuffle partitions). See the documentation for details.
Use Scala implicits to simplify read and write APIs - You can import
io.delta.implicits._to use the
deltamethod with Spark read and write APIs such as
spark.read.delta(“/my/table/path”). See the documentation for details.
Adam Binford, Alan Jin, Alex liu, Ali Afroozeh, Andrew Fogarty, Burak Yavuz, David Lewis, Gengliang Wang, HyukjinKwon, Jacek Laskowski, Jose Torres, Kian Ghodoussi, Linhong Liu, Liwen Sun, Mahmoud Mahdi, Maryann Xue, Michael Armbrust, Mike Dias, Pranav Anand, Rahul Mahadev, Scott Sandre, Shixiong Zhu, Stephanie Bodoff, Tathagata Das, Wenchen Fan, Wesley Hoffman, Xiao Li, Yijia Cui, Yuanjian Li, Zach Schuermann, contrun, ekoifman, Yi Wu
We are excited to announce the release of Delta Lake 0.7.0 on Apache Spark 3.0. This is the first release on Spark 3.x and adds support for metastore-defined tables and SQL DDLs. The key features in this release are as follows.
Support for defining tables in the Hive metastore (#85) - You can now define Delta tables in the Hive metastore and use the table name in all SQL operations. Specifically, we have added support for:
- SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, etc. See the Scala and Python examples for details.
DeltaTable.forName(tableName)API to create instances of
This integration uses Catalog APIs introduced in Spark 3.0. You must enable the Delta Catalog by setting additional configurations when starting your SparkSession. See the documentation for details.
Support for SQL Delete, Update and Merge - With Spark 3.0, you can now use SQL DML operations
MERGE. See the documentation for details.
Support for automatic and incremental Presto/Athena manifest generation (#453) - You can now use
ALTER TABLE SET TBLPROPERTIESto enable automatic regeneration of the Presto/Athena manifest files on every operation on a Delta table. This regeneration is incremental, that is, manifest files are updated for only the partitions that have been updated by the operation. See the documentation for details.
Support for controlling the retention of the table history - You can now use
ALTER TABLE SET TBLPROPERTIESto configure how long the table history and delete files are maintained in Delta tables. See the documentation for details.
Support for adding user-defined metadata in Delta table commits - You can now add user-defined metadata as strings in commits made to a Delta table by any operation. For
DataFrame.writeStreamoperations, you can set the option
userMetadata. For other operations, you can set the SparkSession configuration
spark.databricks.delta.commitInfo.userMetadata. See the documentation for details.
Support Azure Data Lake Storage Gen2 (#288) - Spark 3.0 has support for Hadoop 3.2 libraries which enables support for Azure Data Lake Storage Gen2. See the documentation for details on how to configure Delta Lake with the correct versions of Spark and Hadoop libraries for Azure storage systems.
Improved support for streaming one-time triggers - With Spark 3.0, we now ensure that one-time trigger (also known as
Trigger.Once) processes all outstanding data in a Delta table in a single micro-batch even if rate limits are set with the
Due to the significant internal changes, workloads on previous versions of Delta using the
DeltaTable programmatic APIs may require additional changes to migrate to 0.7.0. See the Migration Guide for details.
Alan Jin, Alex Ott, Burak Yavuz, Jose Torres, Pranav Anand, QP Hou, Rahul Mahadev, Rob Kelly, Shixiong Zhu, Subhash Burramsetty, Tathagata Das, Wesley Hoffman, Yin Huai, Youngbin Kim, Zach Schuermann, Eric Chang, Herman van Hovell, Mahmoud Mahdi
We are excited to announce the release of Delta Lake 0.6.1, which fixes a few critical bugs in merge operation and operation metrics. If you are using version 0.6.0, it is strongly recommended that you upgrade to version 0.6.1. The details of the fixed bugs are as follows:
Invalid MERGE INTO AnalysisExceptions (#419) - A couple of bugs related to merge operation were causing analysis errors in 0.6.0 on previously supported merge queries.
- Fixing one of these bugs required reverting a minor change to the DeltaTable 0.6.0 API. In 0.6.1 (similar to 0.5.0), if the table’s schema has changed since the creation of the DeltaTable instance
DeltaTable.toDF()does not return a DataFrame with the latest schema. In such scenarios, you must recreate the DeltaTable instance for it to recognize the latest schema.
- Fixing one of these bugs required reverting a minor change to the DeltaTable 0.6.0 API. In 0.6.1 (similar to 0.5.0), if the table’s schema has changed since the creation of the DeltaTable instance
Incorrect operations metrics in history - 0.6.0 reported an incorrect number of rows processed during Update and Delete. This is fixed in 0.6.1.
Alan Jin, Jose Torres, Rahul Mahadev, Tathagata Das
We are excited to announce the release of Delta Lake 0.6.0, which introduces schema evolution and performance improvements in merge, and operation metrics in table history. The key features in this release are:
Support for schema evolution in merge operations (#170) - You can now automatically evolve the schema of the top-level columns of a Delta table with the merge operation. This is useful in scenarios where you want to upsert change data into a table and the schema of the data changes over time. Instead of detecting and applying schema changes before upserting, merge can simultaneously evolve the schema and upsert the changes. See the documentation for details.
Improved merge performance with automatic repartitioning (#349) - When merging into partitioned tables, you can choose to automatically repartition the data by the partition columns before writing to the table. In cases where the merge operation on a partitioned table is slow because it generates too many small files (#345), enabling automatic repartition can improve performance. See the documentation for details.
Improved performance when there is no insert clause (#342) - You can now get better performance in a merge operation if it does not have any insert clause.
Operation metrics in DESCRIBE HISTORY (#312) - You can now see operation metrics (for example, number of files and rows changed) for all writes, updates, and deletes on a Delta table in the table history. See the documentation for details.
Support for reading Delta tables from any file system (#347) - You can now read Delta tables on any storage system with a Hadoop FileSystem implementation. However, writing to Delta tables still requires configuring a LogStore implementation that gives the necessary guarantees on the storage system. See the documentation for details.
Ali Afroozeh, Andrew Fogarty, Anurag870, Burak Yavuz, Erik LaBianca, Gengliang Wang, IonutBoicuAms, Jakub Orłowski, Jose Torres, KevinKarlBob, Michael Armbrust, Pranav Anand, Rahul Govind, Rahul Mahadev, Shixiong Zhu, Steve Suh, Tathagata Das, Timothy Zhang, Tom van Bussel, Wesley Hoffman, Xiao Li, chet, Eugene Koifman, Herman van Hovell, hongdd, lswyyy, lys0716, Mahmoud Mahdi, Maryann Xue
We are excited to announce the release of Delta Lake 0.5.0, which introduces Presto/Athena support and improved concurrency. The key features in this release are:
Support for other processing engines using manifest files (#76) - You can now query Delta tables from Presto and Amazon Athena using manifest files, which you can generate using Scala, Java, Python and SQL APIs. See the documentation for details.
Improved concurrency for all Delta Lake operations (#9, #72, #228) - You can now run more Delta Lake operations concurrently. Delta Lake’s optimistic concurrency control has been improved by making conflict detection more fine-grained. This makes it easier to run complex workflows on Delta tables. For example:
- Running deletes (e.g. for GDPR compliance) concurrently on older partitions while newer partitions are being appended.
- Running updates and merges concurrently on disjoint sets of partitions.
- Running file compactions concurrently with appends (see below).
See the documentation on concurrency control for more details.
Improved support for file compaction (#146) - You can now compact files by rewriting them with the
false. This option allows a compaction operation to run concurrently with other batch and streaming operations. See this example in the documentation for details.
Improved performance for insert-only merge (#246) - Delta Lake now provides more optimized performance for merge operations that have only insert clauses and no update clauses. Furthermore, Delta Lake ensures that writes from such insert-only merges only append new data to the table. Hence, you can now use Structured Streaming and insert-only merges to do continuous deduplication of data (e.g. logs). See this example in the documentation for details.
Experimental support for Snowflake and Redshift Spectrum - You can now query Delta tables from Snowflake and Redshift Spectrum. This support is considered experimental in this release. See the documentation for details.
Andreas Neumann, Andrew Fogarty, Burak Yavuz, Denny Lee, Fabio B. Silva, JassAbidi, Matthew Powers, Mukul Murthy, Nicolas Paris, Pranav Anand, Rahul Mahadev, Reynold Xin, Shixiong Zhu, Tathagata Das, Tomas Bartalos, Xiao Li
Thank you for your contributions.
We are excited to announce the release of Delta Lake 0.4.0 which introduces Python APIs for manipulating and managing data in Delta tables. The key features in this release are:
Python APIs for DML and utility operations (#89) - You can now use Python APIs to update/delete/merge data in Delta Lake tables and to run utility operations (i.e., vacuum, history) on them. These are great for building complex workloads in Python, e.g., Slowly Changing Dimension (SCD) operations, merging change data for replication, and upserts from streaming queries. See the documentation for more details.
Convert-to-Delta (#78) - You can now convert a Parquet table in place to a Delta Lake table without rewriting any of the data. This is great for converting very large Parquet tables which would be costly to rewrite as a Delta table. Furthermore, this process is reversible - you can convert a Parquet table to Delta Lake table, operate on it (e.g., delete or merge), and easily convert it back to a Parquet table. See the documentation for more details.
SQL for utility operations - You can now use SQL to run utility operations vacuum and history. See the documentation for more details on how to configure Spark to execute these Delta-specific SQL commands.
To try out Delta Lake 0.4.0, please follow the Delta Lake Quickstart.
We are excited to announce the availability of Delta Lake 0.3.0 which introduces new programmatic APIs for manipulating and managing data in Delta Lake tables. Here are the main features:
Scala/Java APIs for DML commands - You can now modify data in Delta Lake tables using programmatic APIs for Delete (#44), Update (#43) and Merge (#42). These APIs mirror the syntax and semantics of their corresponding SQL commands and are great for many workloads, e.g., Slowly Changing Dimension (SCD) operations, merging change data for replication, and upserts from streaming queries. See the documentation for more details.
Scala/Java APIs for query commit history (#54) - You can now query a table’s commit history to see what operations modified the table. This enables you to audit data changes, time travel queries on specific versions, debug and recover data from accidental deletions, etc. See the documentation for more details.
Scala/Java APIs for vacuuming old files (#48) - Delta Lake uses MVCC to enable snapshot isolation and time travel. However, keeping all versions of a table forever can be prohibitively expensive. Stale snapshots (as well as other uncommitted files from aborted transactions) can be garbage collected by vacuuming the table. See the documentation for more details.
To try out Delta Lake 0.3.0, please follow the Delta Lake Quickstart.
We are delighted to announce the availability of Delta Lake 0.2.0!
To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart.
This release introduces two main features:
Cloud storage support - In addition to HDFS, you can now configure Delta Lake to read and write data on cloud storage services such as Amazon S3 (issue #39) and Azure Blob Storage (issue #40). See here for configuration instructions.
Improved concurrency (issue #69) - Delta Lake now allows concurrent append-only writes while still ensuring serializability. To be considered as append-only, a writer must be only adding new data without reading or modifying existing data in any way. See here for more details.
We have also greatly expanded the test coverage as part of this release.