Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SymlinkTextInputFormat Manifest Generation for Presto/Athena read support #76

Closed
marmbrus opened this issue Jun 21, 2019 · 24 comments
Closed
Labels
Milestone

Comments

@marmbrus
Copy link
Contributor

@marmbrus marmbrus commented Jun 21, 2019

Delta's use of MVCC causes external readers (presto, hive, etc) to see inconsistent or duplicate data. The SymlinkTextInputFormat allows these systems to read a set of manifest files, each containing a list of file paths, in order to determine which data files to read (rather that listing the files present on the file system).

This issue tracks adding support for generating these manifest files from the Delta transaction log, automatically after each commit.

This feature would give support for reading Delta tables from Presto. Hive would require some addition work on the Hive side, as Hive does not use the file extension to determine the final InputFormat to use to decode the data (and as such interprets the files incorrectly as text).

@kkr78

This comment has been minimized.

Copy link

@kkr78 kkr78 commented Aug 5, 2019

@marmbrus is this feature available in 0.3.0 release? If not, will it be available anytime soon?

@kkr78

This comment has been minimized.

Copy link

@kkr78 kkr78 commented Aug 5, 2019

Let me know if I can be of any help developing any features for this project. is this something need to be ported from data bricks?

@millecker

This comment has been minimized.

Copy link

@millecker millecker commented Oct 15, 2019

@marmbrus Any updates on this issue?

@kkr78

This comment has been minimized.

Copy link

@kkr78 kkr78 commented Oct 15, 2019

@marmbrus if we generate the manifest based on Delta transaction log of the last version, does that works with Athena? My understanding is the last version should have all the files that required to create the manifest. I m banking on this approach until its available in Delta Lake.

@marmbrus

This comment has been minimized.

Copy link
Contributor Author

@marmbrus marmbrus commented Oct 21, 2019

Yes, if you generate a manifest using the transaction log that should work with Athena.

@kkr78

This comment has been minimized.

Copy link

@kkr78 kkr78 commented Oct 22, 2019

@marmbrus thanks for confirming

@FabioBatSilva FabioBatSilva mentioned this issue Oct 31, 2019
3 of 6 tasks complete
@tdas tdas added this to the 0.5.0 milestone Nov 5, 2019
@tdas tdas pinned this issue Nov 5, 2019
@tdas tdas changed the title Support for SymlinkTextInputFormat Manifest Generation Support for SymlinkTextInputFormat Manifest Generation (Presto/Athena read support) Nov 5, 2019
@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Nov 5, 2019

We are working on pushing our existing manifest generation code and tests to OSS..

@kkr78

This comment has been minimized.

Copy link

@kkr78 kkr78 commented Nov 5, 2019

Right now, I m developing the code to generate that manifest. I will probably use it temporarily until its available in Delta Lake. please let me know if you have any ETA.

Some of our Delta Lake tables got partition columns. do you know if the manifest generation works w/ partition columns as well?

@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Nov 5, 2019

@kkr78 yeah, our manifest generation will work with partition columns. for partitioned tables, the manifest files are itself partitioned by the same columns, so query (at least in Presto/Athena) will read manifests for only the partitions that it will query.

@ChethanUK

This comment has been minimized.

Copy link

@ChethanUK ChethanUK commented Nov 11, 2019

work with Athena.

So, Today we can't query in Presto (PrestoDB or PrestoSQL), right?

When #232 is resolved, presto can query delta lake?

@FabioBatSilva

This comment has been minimized.

Copy link

@FabioBatSilva FabioBatSilva commented Nov 12, 2019

For anyone looking for a workaround this is what i'm using for now : https://gist.github.com/FabioBatSilva/d6b168a01cf4ba991d9e77881c5ea1e5

@cozos

This comment has been minimized.

Copy link

@cozos cozos commented Nov 12, 2019

@FabioBatSilva So we can use SymlinkManifestWriter right now to make Presto compatible manifest files? ;)

@FabioBatSilva

This comment has been minimized.

Copy link

@FabioBatSilva FabioBatSilva commented Nov 12, 2019

@FabioBatSilva So we can use SymlinkManifestWriter right now to make Presto compatible manifest files? ;)

@cozos Yes, That is what i'm using to get data into AWS Athena.
But keep in mind this is something I've put together in a few hours to test out Delta Lake + Athena.

@cozos

This comment has been minimized.

Copy link

@cozos cozos commented Nov 12, 2019

@FabioBatSilva Do you mind telling me about the manifest workflow? Do you generate a new manifest file everytime the Delta Transaction Log is modified?

@kkr78

This comment has been minimized.

Copy link

@kkr78 kkr78 commented Nov 12, 2019

@cozos See the sample python code below that I developed to generate manifest based on Delta Transaction Log. Basically the code traverses through all the JSON files in _delta_log directory, identifies the list of parquet files and writes them to symlink text. The manifest is regenerated when the Transaction Log is modified. This code only works on s3.

Disclaimer: This code is not production-ready, not fully tested, this is just to give an idea of how it can be implemented.

https://github.com/kkr78/delta-util

The Athena table points to _symlink directory where symlink.txt files are located. For partitions, generate a symlink.txt file for each partition.

@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Nov 13, 2019

Hey all, we are open-sourcing our Databricks Delta's manifest generation code in this PR #250
This PR contains the core functionality to generate the manifest, but not the public APIs yet. Future PRs are going to add Scala, Python and SQL APIs to generate the manifest.

Thank you, everyone, for showing so much interest in Delta Lake and its compatibility with other engines.

@seddonm1

This comment has been minimized.

Copy link

@seddonm1 seddonm1 commented Nov 13, 2019

@tdas is this PR sufficient scope for a point release?

@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Nov 13, 2019

By "point release", do you mean patch release ... as in 0.4.1?
The current plan is to release it in 0.5.0 which is expected to be some time in december.

@seddonm1

This comment has been minimized.

Copy link

@seddonm1 seddonm1 commented Nov 13, 2019

Hi. Sorry I did not see the 7 Dec milestone for the 0.5.0 point release.

@tdas tdas changed the title Support for SymlinkTextInputFormat Manifest Generation (Presto/Athena read support) SymlinkTextInputFormat Manifest Generation for Presto/Athena read support Nov 15, 2019
mukulmurthy added a commit that referenced this issue Nov 18, 2019
…tion for Presto/Athena support

## What changes were proposed in this pull request?

This PR is the first in the sequence of PRs to add manifest file generation (SymlinkInputFormat) to OSS Delta for Presto/Athena read support (issue #76). Specifically, this PR adds the core functionality for manifest generation and rigorous tests to verify the contents of the manifest. Future PRs will add the public APIs for on-demand generation.

- Added post-commit hooks to run tasks after a successful commit.

- Added GenerateSymlinkManifest implementation of post-commit hook to generate the manifests.
  - Each manifest contains the name of data files to read for querying the whole table or partition
  - Non-partitioned table produces a single manifest file containing all the data files.
  - Partitioned table produces partitioned manifest files; same partition structured like the table, each partition directory containing one manifest file containing data files of that partition. This allows Presto/Athena partition-pruned queries to read only manifest files of the necessary partitions.
  - Each attempt to generate manifest will atomically (as much as possible) overwrite the manifest files in the directories (if they exist) and also delete manifest files of partitions that have been deleted from the table.

Closes #250

Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Co-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#6910 is resolved by tdas/SC-25511.

GitOrigin-RevId: a3e04f2fcdafb6ac29c3adcfb791a3d0611583dc
@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Nov 20, 2019

The core manifest generation code for symlink style manifest generation has been merged in b18ffba We are currently working on the Scala/Python/SQL APIs for the manifest generation.

@hudsondba

This comment has been minimized.

Copy link

@hudsondba hudsondba commented Nov 25, 2019

If someone would like to compile the project to test into the AWS Athena, the command to export the Symlink is:

`import org.apache.spark.sql.delta._
import org.apache.spark.sql.delta.hooks.GenerateSymlinkManifest

val deltaLog = DeltaLog.forTable(spark, "s3a://....")

GenerateSymlinkManifest.generateFullManifest(spark, deltaLog)`

Table syntax:
CREATE EXTERNAL TABLE .....( ....) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://.../_symlink_format_manifest/

@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Nov 26, 2019

This is the PR for public APIs for manifest generation - #262

@joakibo

This comment has been minimized.

Copy link

@joakibo joakibo commented Dec 2, 2019

Great stuff, looking forward to getting this and #262 in 0.5.0!

@tdas

This comment has been minimized.

Copy link
Contributor

@tdas tdas commented Dec 11, 2019

The public APIs were committed in 5b3e3eb

@tdas tdas closed this Dec 11, 2019
@zsxwing zsxwing unpinned this issue Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.