Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for setting dataChange=false #146

Open
marmbrus opened this issue Aug 30, 2019 · 3 comments

Comments

@marmbrus
Copy link
Contributor

commented Aug 30, 2019

The Delta transaction protocol contains the ability to mark entries in the transaction log as dataChange=false indicating that they are only rearranging data that is already part of the table. This is powerful as it allows you to perform compaction and other read-performance optimizations, without breaking the ability to use a Delta table as a streaming source. We should expose this as a DataFrame writer option for overwrites.

@btbbass

This comment has been minimized.

Copy link

commented Aug 31, 2019

Thanks, that is exactly what I meant.

I confirm that in the test I did for the compaction of existing data, the transaction log contained the field
'dataChange=true'.

@MrPowers

This comment has been minimized.

Copy link

commented Sep 3, 2019

@btbbass - after compacting a Delta data lake, should the transaction log contain dataChange=true for the files that are added, the files that are removed, or both?

Here's the current behavior. Suppose we have a Delta data lake with 1,000 files and run this code:

val df = spark
  .read
  .format("delta")
  .load("/some/path/data")

df
  .repartition(10)
  .write
  .format("delta")
  .mode("overwrite")
  .save("/some/path/data")

The _delta_log/00000000000000000001.json file will contain 10 rows like this:

{
  "add":{
    "path":"part-00000-compacted-data-c000.snappy.parquet",
    "partitionValues":{

    },
    "size":123456,
    "modificationTime":1567528453000,
    "dataChange":true
  }
}

The _delta_log/00000000000000000001.json file will also contain 1,000 rows like this:

{
  "remove":{
    "path":"part-00154-some-stuff-c000.snappy.parquet",
    "deletionTimestamp":1567528580934,
    "dataChange":true
  }
}
@btbbass

This comment has been minimized.

Copy link

commented Sep 12, 2019

I think that the expected result is to have dataChange=false for the 10 new files, as there is no data change, but only reorganisation of underlying data into fewer files.

That is to obtain the effect that:

  1. A streaming job that is using the delta table as a data source
  2. and have already processed the data in the 1000 files,

-> should not reprocess data that is being reorganised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.