delta (parquet) format #13

rambabu-posa · 2019-04-26T14:21:11Z

Hi Delta team,

I tried delta, interesting. I have few questions.

Even though we use "delta" format, its underlying format is "parquet". So is it possible to use this Spark Delta format to read my existing parquet data written without using this Delta.

Why its supporting only delta for parquet? Why not for other spark supported formats? Do you have them in future?

I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.

Many thanks,
Ram

tdas · 2019-04-26T20:14:04Z

@rambabu-posa Let me try to answer each question one by one

is it possible to use this Spark Delta format to read my existing parquet data written without using this Delta.

It is not possible directly because Delta format relies on the transaction log being present, and obviously, a parquet table does not have that log. So attempt to read a parquet table using delta format will throw an error. That said, Managed Delta Lake already has a CONVERT command that can do in-place convert a parquet table to delta table by writing a new transaction log inside the same directory. We are hoping that we can eventually put that command in the open source.

Why its supporting only delta for parquet? Why not for other spark supported formats? Do you have them in future?

As of now, we are supporting only the parquet format so that Delta Lake users can get the maximum benefit of parquet data skipping, etc. when querying the delta lake. We may make this configurable in the future. But really, as a Delta Lake user, you should not have to worry about what the underlying file format as you can query it using the "delta" format and you get the maximum benefit of partition pruning, data skipping, etc.

I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.

When you say "see that data", do you mean visually see it in Hue? I am not so familiar with Hue UI, so I am not sure how to answer this. But my guess would be, it depends on how Hue detects which format is being used by a directory.

hkak03key · 2019-04-27T13:38:59Z

I have a question about format, too.

Now, "delta" uses snappy for compression, but I hope to use gzip. gzip is slow to write, but it reads speedy like snappy, and most of all, it compresses well than snappy.

Why delta uses snappy? Do you have gzip in future?

rambabu-posa · 2019-04-29T09:23:07Z

I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.

Regarding this question,
When I read my data (already wrote in my previous steps) using "spark.read.format("delta").load("/delta/events")", Im able to see the following file in that location in my local system:

part-00007-144fb4c5-dff0-4487-a4e8-241a9e850b35.c000.snappy.parquet

I'm able to see it in my Local FS as I read it. In the same way, if we write it to HDFS, I can traverse the given path in HDFS using HUE UI and see this kind of file(s) in that location.

In the same way, Is it possible to login to Delta Lake ( If I assume it like a DFS for instance HDFS, S3 etc) File system, browse to the given path and see my files?

mukulmurthy · 2019-04-29T15:59:54Z

@hkak03key - One of the reasons we chose snappy is that gzip isn't splittable. We have no current plans to support gzip in the future, but that can always change based on community feedback.

zsxwing · 2019-04-29T16:04:37Z

Actually, Delta Lake is not a file format. It’s like Hive Metastore but the table metadata is stored in the file system so that it can use Spark to process them (table metadata is also a big data problem for a large table). Delta lake also provides advanced features (ACID, DMLs which is coming soon) on top of the distributed metadata. Hence, if you are using HDFS, you will see all files that Delta writes in HUE. No more plugin needed.

On Mon, Apr 29, 2019 at 2:23 AM rambabu-posa ***@***.***> wrote: I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui. Regarding this question, When I read my data (already wrote in my previous steps) using "spark.read.format("delta").load("/delta/events")", Im able to see the following file in that location in my local system: part-00007-144fb4c5-dff0-4487-a4e8-241a9e850b35.c000.snappy.parquet I'm able to see it in my Local FS as I read it. In the same way, if we write it to HDFS, I can traverse the given path in HDFS using HUE UI and see this kind of file(s) in that location. In the same way, Is it possible to login to Delta Lake ( If I assume it like a DFS for instance HDFS, S3 etc) File system, browse to the given path and see my files? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHUKSTSUJCB3VTORVRX6V3PS247ZANCNFSM4HIWBDEQ> .

-- Best Regards, Shixiong Zhu

mukulmurthy · 2019-04-29T16:30:33Z

@rambabu-posa - Everything Shixiong said is correct, but one warning - while you may be able to see all the individual files Delta writes, you're not looking at a consistent view of the table, because some of those files may have been logically removed from the table.

yuhuali1989 · 2019-04-30T03:43:50Z

@hkak03key - One of the reasons we chose snappy is that gzip isn't splittable. We have no current plans to support gzip in the future, but that can always change based on community feedback.

Based on my data engineering experience, parquet file with gzip compression is actually splittable.
Also some answer from https://stackoverflow.com/questions/43323882/is-gzipped-parquet-file-splittable-in-hdfs-for-spark

mukulmurthy · 2019-04-30T20:16:11Z

@hkak03key - My mistake, we actually do support gzip through the Spark config spark.sql.parquet.compression.codec. You can set this conf to "gzip".

tdas · 2019-05-10T21:52:00Z

I am closing this issue for now. Feel free to reopen it if your questions haven't been answered.

update fork

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter

tdas added the question Questions on how to use Delta Lake label Apr 26, 2019

zsxwing changed the title ~~delta (parquest) format~~ delta (parquet) format Apr 29, 2019

tdas closed this as completed May 11, 2019

LantaoJin added a commit to LantaoJin/delta that referenced this issue Mar 24, 2020

[CARMEL-2217] Output the metrics for update/deleta (delta-io#13)

14840bb

LantaoJin added a commit to LantaoJin/delta that referenced this issue Mar 12, 2021

[CARMEL-4109] Support scalar subquery in update set clause (delta-io#13)

c8e37e6

jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022

Merge pull request delta-io#13 from delta-io/master

c8eb27c

update fork

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delta (parquet) format #13

delta (parquet) format #13

rambabu-posa commented Apr 26, 2019 •

edited

Loading

tdas commented Apr 26, 2019

hkak03key commented Apr 27, 2019

rambabu-posa commented Apr 29, 2019

mukulmurthy commented Apr 29, 2019

zsxwing commented Apr 29, 2019 via email

mukulmurthy commented Apr 29, 2019

yuhuali1989 commented Apr 30, 2019

mukulmurthy commented Apr 30, 2019

tdas commented May 10, 2019

delta (parquet) format #13

delta (parquet) format #13

Comments

rambabu-posa commented Apr 26, 2019 • edited Loading

tdas commented Apr 26, 2019

hkak03key commented Apr 27, 2019

rambabu-posa commented Apr 29, 2019

mukulmurthy commented Apr 29, 2019

zsxwing commented Apr 29, 2019 via email

mukulmurthy commented Apr 29, 2019

yuhuali1989 commented Apr 30, 2019

mukulmurthy commented Apr 30, 2019

tdas commented May 10, 2019

rambabu-posa commented Apr 26, 2019 •

edited

Loading