Skip to content

Support SQL-like method #14518

@hudi-bot

Description

@hudi-bot

As we know, Hudi use spark datasource api to upsert data. For example, if we want to update a data, we need to get the old row's data first, and use upsert method to update this row.
But there's another situation where someone just wants to update one column of data. If we use a sql to describe, it is {{update table set col1 = X where col2 = Y}}. This is something hudi cannot deal with directly at present, we can only get all the data involved as a dataset first and then merge it.
So I think maybe we can create a new subproject to process the batch data in an sql-like method. For example.

 {code}
val hudiTable = new HudiTable(path)
hudiTable.update.set("col1 = X").where("col2 = Y")
hudiTable.delete.where("col3 = Z")
hudiTable.commit
{code}

It may also extend the functionality and support jdbc-like RFC schemes: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]

Hope every one can provide some suggestions to see if this plan is feasible.

JIRA info


Comments

30/Dec/19 18:17;vinoth;I am not sure if CLI is the right component for this. First few questions before I can triage this.. 

 

  • Is this intended to be a Spark API? We have thought about adding support in Spark SQL to specify the merge logic vs HoodieRecordPayload interface.. This sounds similar. 
  • I think we need to move towards Spark Datasource V2 api first.. and then rethink how this will fit in HUDI-30;;;

07/Jan/20 02:57;chenxiang;[~vinoth]
I checked the spark project. It seems that the spark SQL syntax tree only supports DELETE keyword at present. UPDATE and MERGE are not supported yet. I think this may be because the design idea of spark is to deal with the relationship between dataset and dataset. Using existing operators can solve similar problems, but it is not sql-like.
My current idea is to build a layer of SQL syntax on the hudi-core, and properly enable antlr4 to process semantics. For example, the update statement can be parsed into first filtering data according to where conditions, and then upsert the data into hudi.;;;


08/Jan/20 02:28;vinoth;Hi [~chenxiang] , [https://github.com/apache/spark/blob/master/docs/sql-keywords.md] does list DELETE and UPDATE keywords in the language itself.. I think its upto to the datasource to implement this. We can consider this once we move to datasource v2 first? Is nt that pre-req for this ;;;


09/Jan/20 01:11;chenxiang;[~vinoth]

Oh~ I‘ve seen in [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4] 

Spark can recognize these keywords. 
 
I have built a project to try to see if it is feasible first. I don't know if it is related to V1 or V2. If I have a result, I will let you know as soon as possible;;;


10/Jan/20 00:39;chenxiang;[~vinoth]

I opened Spark's GitHub again this morning, and suddenly found that yesterday I was looking at the master branch (spark 3.0). When I switched to version 2.4, there was no UPDATE or MERGE keywords. This shows that spark does not support these keywordsin in version 2.4, which may be a problem.;;;


12/Jan/20 07:09;vinoth;I see.. so if the future 3.x versions will have it, its fine right? we can just build based off that? ;;;


14/Oct/20 00:35;chenxiang;I have created a github project in https://github.com/shangyuantech/hudi-sql . Later, I can show some design ideas and usage scenarios on this project .

;;;


14/Oct/20 02:13;x1q1j1;hi [~chenxiang]There is a part of the syntax that needs to be extended. We can further refine it.  Contrast https://docs.delta.io/latest/delta-update.html#update-a-table.;;;


14/Oct/20 03:40;309637554;[~chenxiang] [~x1q1j1]  hi, i also have some plan about this https://issues.apache.org/jira/browse/HUDI-1341. We can often discuss :D;;;


14/Oct/20 05:19;chenxiang;[~309637554] Glad to see your attention. I've added a relates to link with HUDI-1341;;;


20/Oct/20 22:06;vinoth;>a. If we use a sql to describe, it is {{update table set col1 = X where col2 = Y}}. This is something hudi cannot deal with directly at present, we can only get all the data involved as a dataset first and then merge it.

I don't think we can avoid getting the dataset first i.e read the older parquet file to merge the record. In fact, I would argue that Hudi uniquely let's you deal with a single column update scenario now, by allowing custom payloads to specify merging. i.e base file can contain the entire record and the log can just contain the updated col value and we will be able to merge this .

 

What we are missing is the SQL support for Merges, which we should build out under HUDI-1297 's scope. wdyt? ;;;


21/Oct/20 14:32;309637554;[~vinoth] agree with you  .

1、 at present can not avoid getting the dataset first. agree with you for  the log can just contain the updated col value and we will be able to merge this . If we have column statistic or clustering like z-ordering index, this scenario can be optimized.

  1. I see hudi support spark 3.0 will land it.   We can build the sql API  HUDI-1297  on spark datasource 2.0 API.  can build under HUDI-1297 ;;;

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions