[HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table by nsivabalan · Pull Request #9261 · apache/hudi

nsivabalan · 2023-07-21T19:28:52Z

Change Logs

Adding support for upsert and deletes with spark datasource for pk less table.

Impact

This patch opens up possibility to do updates or delete records with spark datasource writes for a primary key less table.

For eg, if for a primary table, user prefers to delete all records for a given employee.

val df = spark.read.format("hudi").load(basePath).filter("emp_id == '100')
df.write.format("hudi").option("hoodie.datasource.write.operation","delete").mode(Append).save(basePath)

This is feasible for a pk less table.

Risk level (write none, low medium or high below)

low.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala

codope · 2023-07-23T07:27:28Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+  private def canRemoveMetaFields(optParams: Map[String, String]) : Boolean = {
+    !(optParams.getOrDefault(SPARK_SQL_WRITES_PREPPED_KEY, "false").toBoolean
+    || optParams.getOrDefault(SPARK_SQL_MERGE_INTO_PREPPED_KEY, "false").toBoolean
+    || !optParams.containsKey(RECORDKEY_FIELD.key()))


For pk-less tables, RECORDKEY_FIELD may not be present and this condition can be true. Do we want to drop the meta fields in such cases? From the comment it seems like for pk-less we don;t want to drop the meta fields.

not sure I get it. we will drop the meta fields down the line while creating the HoodieRecordPayload.
here we are just trying to gauge if we can go w/ prepped writes or regular non-prepped writes. if incoming df has meta fields and if its pk less table, we go with prepped flow.

My point was for pk-less this condition would be true and canRemoveMetaFields can return true, isn't it? However, we need the meta fields for pk-less.

!optParams.containsKey(RECORDKEY_FIELD.key()

nsivabalan · 2023-08-01T01:01:53Z

@codope : addressed all comments.

codope

Can you please look into the CI failures?

codope · 2023-08-01T07:33:44Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+  private def canRemoveMetaFields(optParams: Map[String, String]) : Boolean = {
+    !(optParams.getOrDefault(SPARK_SQL_WRITES_PREPPED_KEY, "false").toBoolean
+    || optParams.getOrDefault(SPARK_SQL_MERGE_INTO_PREPPED_KEY, "false").toBoolean
+    || !optParams.containsKey(RECORDKEY_FIELD.key()))


My point was for pk-less this condition would be true and canRemoveMetaFields can return true, isn't it? However, we need the meta fields for pk-less.

!optParams.containsKey(RECORDKEY_FIELD.key()

nsivabalan · 2023-08-02T02:32:03Z

oops, my bad. not sure how I missed. will fix it

nsivabalan · 2023-08-02T03:22:15Z

@codope : Updated

codope

Will land once the CI succeeds.

…ss table

hudi-bot · 2023-08-05T23:41:51Z

CI report:

edd7d00 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan · 2023-08-31T12:11:45Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+      val df = if (preppedWriteOperation || preppedSparkSqlWrites || preppedSparkSqlMergeInto) {
+        sourceDf
+      } else {
+        sourceDf.drop(HoodieRecord.HOODIE_META_COLUMNS: _*)


dropping meta cols here caused a problem with HoodieStreamingSink: when val sourceDF = spark.readStream.format(hudi).load; sourceDF.writeStream.format(hudi).start(), the source DF is a streaming source and dropping metacols failed the assertion of "Queries with streaming sources must be executed with writeStream.start()" because internally Hudi is writing DF as a batch

we either keep df here the same as sourceDF when using sourceDF.writeStream.format(hudi).start() or use sourceDF.writeStream.foreachBatch{}

nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jul 21, 2023

apache deleted a comment from hudi-bot Jul 23, 2023

codope reviewed Jul 23, 2023

View reviewed changes

codope reviewed Aug 1, 2023

View reviewed changes

nsivabalan mentioned this pull request Aug 2, 2023

[HUDI-5514] Improving usability/performance with out of box default for append only use-cases #8697

Merged

4 tasks

codope approved these changes Aug 2, 2023

View reviewed changes

nsivabalan force-pushed the enableUpsertsDeletesPkless branch 3 times, most recently from 7d85ec6 to 7e7efc7 Compare August 3, 2023 22:28

codope self-assigned this Aug 4, 2023

nsivabalan added 5 commits August 4, 2023 08:03

Adding support for upsert and deletes with spark datasource for pk le…

9c841b0

…ss table

Addressing comments

94f7345

Fixing upserts and deleted for pkless tables

be06aa2

Fixing test failures

94daef5

fixing test failures

3968bf3

nsivabalan force-pushed the enableUpsertsDeletesPkless branch from 7e7efc7 to 3968bf3 Compare August 4, 2023 15:34

nsivabalan added 2 commits August 4, 2023 22:54

Fixing test failures

c1977e2

fixing test failures

edd7d00

nsivabalan merged commit 7061652 into apache:master Aug 6, 2023

xushiyan reviewed Aug 31, 2023

View reviewed changes

xushiyan mentioned this pull request Aug 31, 2023

[HUDI-6579] Fix streaming write when meta cols dropped #9589

Merged

4 tasks

hudi-bot mentioned this pull request Dec 9, 2025

Add support for upsert and deletes for pk less table with spark datasource writes #16117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table#9261

[HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table#9261
nsivabalan merged 7 commits intoapache:masterfrom
nsivabalan:enableUpsertsDeletesPkless

nsivabalan commented Jul 21, 2023

Uh oh!

Uh oh!

Uh oh!

codope Jul 23, 2023

Uh oh!

nsivabalan Aug 1, 2023

Uh oh!

codope Aug 1, 2023

Uh oh!

nsivabalan commented Aug 1, 2023

Uh oh!

codope left a comment

Uh oh!

codope Aug 1, 2023

Uh oh!

nsivabalan commented Aug 2, 2023

Uh oh!

nsivabalan commented Aug 2, 2023

Uh oh!

codope left a comment

Uh oh!

hudi-bot commented Aug 5, 2023

Uh oh!

xushiyan Aug 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nsivabalan commented Jul 21, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

codope Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

codope Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Aug 1, 2023

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Aug 2, 2023

Uh oh!

nsivabalan commented Aug 2, 2023

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Aug 5, 2023

CI report:

Uh oh!

xushiyan Aug 31, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants