-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to process deltas in delta lake? #28
Comments
We are working towards making the DMLs like update, delete, and merge available on this oss Delta lake. |
Hi tdas. Is there another way to update only some rows (deltas) and then be able to time travel with the timestampAsOf? If not, do you have an ETA on this? I know this was just opened sourced, but it is not a delta lake if we can't process deltas with it. Thanks for your help. |
@CyborgDroid MERGE and UPDATE are the most elegant way of doing exactly that, updating a few rows. Until that is available, you can use the already-supported Spark DataFrame APIs to read and rewrite entire partitions (with the modified rows) of a partitioned Delta table. And then you can use time travel on the Delta table. |
Adding a timestamp field and appending records to a regular parquet or csv file would do the same thing. That is not the same as tracking changes (deltas). Let's say I have a thousand rows and only 30 are randomly updated every second, I would end up with 2,593,000 records (1000+30x60x60x24) per day in a real Delta table, with the current functionality overwriting all partitions I would end up with 86,400,000 (1000x60x60x24). The updates are random so very few partitions would be skipped in an overwrite unless I make a partition per unique ID (1000 partitions) which hurts speed from what I understand. I'll try partitioning per unique ID though and see if it works. Please correct my understanding of this if I am wrong. |
Closing this issue as a duplicate of #42. Feel free to reopen if you have further questions. |
How do I optimize delta tables using pyspark api? I create delta table using the following. endpoints_delta_table = DeltaTable.forPath(spark, HDFS_DIR) HDFS_DIR is the hdfs location where my streaming pyspark application is merging data to. Its a parquet files of delta table. |
Optimize SQL command is not available in Delta Lake OSS. Instead you can do manual compaction - https://docs.delta.io/latest/best-practices.html#compact-files Also, please note, that this is nothing to do with the original issue. Please use new issues for new questions. |
[CARMEL-2481] Repartition before writing should use sortWithinPartitions instead of sort/orderby
…to parquet table (delta-io#28)
In databricks I can use MERGE. That doesn't seem supported in the open source version.
error:
The text was updated successfully, but these errors were encountered: