Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delta (parquet) format #13

Closed
rambabu-posa opened this issue Apr 26, 2019 · 9 comments
Closed

delta (parquet) format #13

rambabu-posa opened this issue Apr 26, 2019 · 9 comments
Labels
question Questions on how to use Delta Lake

Comments

@rambabu-posa
Copy link

rambabu-posa commented Apr 26, 2019

Hi Delta team,

I tried delta, interesting. I have few questions.

Even though we use "delta" format, its underlying format is "parquet". So is it possible to use this Spark Delta format to read my existing parquet data written without using this Delta.

Why its supporting only delta for parquet? Why not for other spark supported formats? Do you have them in future?

I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.

Many thanks,
Ram

@tdas
Copy link
Contributor

tdas commented Apr 26, 2019

@rambabu-posa Let me try to answer each question one by one

is it possible to use this Spark Delta format to read my existing parquet data written without using this Delta.

It is not possible directly because Delta format relies on the transaction log being present, and obviously, a parquet table does not have that log. So attempt to read a parquet table using delta format will throw an error. That said, Managed Delta Lake already has a CONVERT command that can do in-place convert a parquet table to delta table by writing a new transaction log inside the same directory. We are hoping that we can eventually put that command in the open source.

Why its supporting only delta for parquet? Why not for other spark supported formats? Do you have them in future?

As of now, we are supporting only the parquet format so that Delta Lake users can get the maximum benefit of parquet data skipping, etc. when querying the delta lake. We may make this configurable in the future. But really, as a Delta Lake user, you should not have to worry about what the underlying file format as you can query it using the "delta" format and you get the maximum benefit of partition pruning, data skipping, etc.

I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.

When you say "see that data", do you mean visually see it in Hue? I am not so familiar with Hue UI, so I am not sure how to answer this. But my guess would be, it depends on how Hue detects which format is being used by a directory.

@tdas tdas added the question Questions on how to use Delta Lake label Apr 26, 2019
@hkak03key
Copy link

I have a question about format, too.

Now, "delta" uses snappy for compression, but I hope to use gzip. gzip is slow to write, but it reads speedy like snappy, and most of all, it compresses well than snappy.

Why delta uses snappy? Do you have gzip in future?

@rambabu-posa
Copy link
Author

I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.

Regarding this question,
When I read my data (already wrote in my previous steps) using "spark.read.format("delta").load("/delta/events")", Im able to see the following file in that location in my local system:

part-00007-144fb4c5-dff0-4487-a4e8-241a9e850b35.c000.snappy.parquet

I'm able to see it in my Local FS as I read it. In the same way, if we write it to HDFS, I can traverse the given path in HDFS using HUE UI and see this kind of file(s) in that location.

In the same way, Is it possible to login to Delta Lake ( If I assume it like a DFS for instance HDFS, S3 etc) File system, browse to the given path and see my files?

@mukulmurthy
Copy link
Collaborator

@hkak03key - One of the reasons we chose snappy is that gzip isn't splittable. We have no current plans to support gzip in the future, but that can always change based on community feedback.

@zsxwing
Copy link
Member

zsxwing commented Apr 29, 2019 via email

@mukulmurthy
Copy link
Collaborator

@rambabu-posa - Everything Shixiong said is correct, but one warning - while you may be able to see all the individual files Delta writes, you're not looking at a consistent view of the table, because some of those files may have been logically removed from the table.

@zsxwing zsxwing changed the title delta (parquest) format delta (parquet) format Apr 29, 2019
@yuhuali1989
Copy link

@hkak03key - One of the reasons we chose snappy is that gzip isn't splittable. We have no current plans to support gzip in the future, but that can always change based on community feedback.

Based on my data engineering experience, parquet file with gzip compression is actually splittable.
Also some answer from https://stackoverflow.com/questions/43323882/is-gzipped-parquet-file-splittable-in-hdfs-for-spark

@mukulmurthy
Copy link
Collaborator

@hkak03key - My mistake, we actually do support gzip through the Spark config spark.sql.parquet.compression.codec. You can set this conf to "gzip".

@tdas
Copy link
Contributor

tdas commented May 10, 2019

I am closing this issue for now. Feel free to reopen it if your questions haven't been answered.

@tdas tdas closed this as completed May 11, 2019
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
andreaschat-db added a commit to andreaschat-db/delta that referenced this issue Apr 23, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes

# This is the commit message delta-io#13:

Partial cleaning

# This is the commit message delta-io#14:

cleaning and improvements

# This is the commit message delta-io#15:

cleaning and improvements

# This is the commit message delta-io#16:

Clean RowIndexFilter
andreaschat-db added a commit to andreaschat-db/delta that referenced this issue Apr 26, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes

# This is the commit message delta-io#13:

Partial cleaning

# This is the commit message delta-io#14:

cleaning and improvements

# This is the commit message delta-io#15:

cleaning and improvements

# This is the commit message delta-io#16:

Clean RowIndexFilter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions on how to use Delta Lake
Projects
None yet
Development

No branches or pull requests

6 participants