Skip to content
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Scala Python Java Shell Other
Branch: master
Clone or download

Latest commit

tdas [DELTA-OSS-EXTERNAL] Fixed doc generation
For java docs, we copy the jquery library file generated in the Scala API docs and inject it into the Java API docs and using it later to show the "evolving" badges. scala docs changed. However, the jquery library was changed (probably due to scala 2.11 to scala 2.12 dependency change) from `jquery.js` to `jquery.min.js` causing the script to fail.

Closes #449

Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Tathagata Das <tdas@databricks.com>

#10535 is resolved by tdas/cxe28xsf.

GitOrigin-RevId: ac5d1d349fd02f6a7032cbe5d3ea8bbc05a441a7
Latest commit 956ffc7 Jun 12, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci [SC-37690] [Delta] Update OSS Spark to depend on Spark 3.0 RC3 Jun 11, 2020
build [SC-37690] [Delta] Update OSS Spark to depend on Spark 3.0 RC3 Jun 11, 2020
dev [DELTA-OSS-EXTERNAL] Add python linting to circleci build Mar 18, 2020
docs [DELTA-OSS-EXTERNAL] Fixed doc generation Jun 12, 2020
examples [DELTA-OSS-EXTERNAL] Integrate scala examples with the integration tests Apr 30, 2020
project [SC-34483][Delta] Enable SQL and table support for Update May 13, 2020
python [SC-34725] [Delta] Added Python DDL tests Jun 11, 2020
src [SC-36916][DELTA] Added more error messages Jun 12, 2020
.gitattributes [DELTA-REFACTOR] Copy Spark's git settings to Delta OSS Apr 22, 2019
.gitignore [DELTA-OSS-EXTERNAL] Add python linting to circleci build Mar 18, 2020
CONTRIBUTING.md [DELTA-OSS-EXTERNAL] Update docs with LF processes Apr 7, 2020
Dockerfile [SC-37690] [Delta] Update OSS Spark to depend on Spark 3.0 RC3 Jun 11, 2020
LICENSE.txt [SC-30048][Delta]Changed copyright in license headers based on linux … Apr 7, 2020
NOTICE.txt [SC-30048][Delta]Changed copyright in license headers based on linux … Apr 7, 2020
PROTOCOL.md [DELTA-OSS-EXTERNAL] add missing fields for metadata and txn action r… May 20, 2020
README.md [DELTA-OSS-EXTERNAL] Update binaries version in Readme to latest (0.6.1) Jun 8, 2020
build.sbt [SC-37690] [Delta] Update OSS Spark to depend on Spark 3.0 RC3 Jun 11, 2020
run-integration-tests.py [DELTA-OSS-EXTERNAL] Integrate scala examples with the integration tests Apr 30, 2020
run-tests.py [SC-33906] Upgrade Delta OSS to Spark 3.0 Apr 30, 2020
scalastyle-config.xml [SC-30048][Delta]Changed copyright in license headers based on linux … Apr 7, 2020
version.sbt [SC-33906] Upgrade Delta OSS to Spark 3.0 Apr 30, 2020

README.md

Delta Lake Logo

CircleCI

Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

See the Delta Lake Documentation for details.

See the Quick Start Guide to get started with Scala, Java and Python.

Latest Binaries

Maven

You include Delta Lake in your Maven project by adding it as a dependency in your POM file. Delta Lake is cross compiled with Scala versions 2.11 and 2.12; choose the version that matches your project. If you are writing a Java project, you can use either version.

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>0.6.1</version>
</dependency>

SBT

You include Delta Lake in your SBT project by adding the following line to your build.sbt file:

libraryDependencies += "io.delta" %% "delta-core" % "0.6.1"

API Documentation

Compatibility

Compatibility with Apache Spark Versions

Delta Lake currently requires Apache Spark 2.4.2. Earlier versions are missing SPARK-27453, which breaks the partitionBy clause of the DataFrameWriter.

API Compatibility

The only stable public APIs, currently provided by Delta Lake, are through the DataFrameReader/Writer (i.e. spark.read, df.write, spark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).

All other interfaces in this library are considered internal, and they are subject to change across minor/patch releases.

Data Storage Compatibility

Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e., newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forwards compatibility as new features are introduced to the transaction protocol (i.e., an older version of Delta Lake may not be able to read a table produced by a newer version).

Breaking changes in the protocol are indicated by incrementing the minimum reader/writer version in the Protocol action.

Roadmap

Delta Lake is a recent open-source project based on technology developed at Databricks. We plan to open-source all APIs that are required to correctly run Spark programs that read and write Delta tables. For a detailed timeline on this effort see the project roadmap.

Building

Delta Lake Core is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

Refer to SBT docs for more commands.

Transaction Protocol

Delta Transaction Log Protocol document provides a specification of the transaction protocol.

Requirements for Underlying Storage Systems

Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.

  1. Atomic visibility: There must be a way for a file to be visible in its entirety or not visible at all.
  2. Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
  3. Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.

Given that storage systems do not necessarily provide all of these guarantees out-of-the-box, Delta Lake transactional operations typically go through the LogStore API instead of accessing the storage system directly. We can plug in custom LogStore implementations in order to provide the above guarantees for different storage systems. Delta Lake has built-in LogStore implementations for HDFS, Amazon S3 and Azure storage services. Please see Delta Lake Storage Configuration for more details. If you are interested in adding a custom LogStore implementation for your storage system, you can start discussions in the community mailing group.

As an optimization, storage systems can also allow partial listing of a directory, given a start marker. Delta Lake can use this ability to efficiently discover the latest version of a table, without listing all of the files in the transaction log.

Concurrency Control

Delta Lake ensures serializability for concurrent reads and writes. Please see Delta Lake Concurrency Control for more details.

Reporting issues

We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.

Contributing

We welcome contributions to Delta Lake. See our CONTRIBUTING.md for more details.

License

Apache License 2.0, see LICENSE.

Community

There are two mediums of communication within the Delta Lake community.

You can’t perform that action at this time.