Skip to content

[HUDI-18691] Honor IF NOT EXISTS when creating indexes#18699

Open
201573 wants to merge 2 commits into
apache:masterfrom
201573:codex/hudi-18691-create-index-if-not-exists
Open

[HUDI-18691] Honor IF NOT EXISTS when creating indexes#18699
201573 wants to merge 2 commits into
apache:masterfrom
201573:codex/hudi-18691-create-index-if-not-exists

Conversation

@201573
Copy link
Copy Markdown

@201573 201573 commented May 7, 2026

Describe the issue this Pull Request addresses

Closes #18691.

Spark SQL parses IF NOT EXISTS for CREATE INDEX, but the parsed flag was not propagated into the Spark index client. As a result, duplicate index creation still failed even when users explicitly requested idempotent behavior.

Summary and Changelog

This pull request honors IF NOT EXISTS for Spark SQL CREATE INDEX statements.

Changes:

  • Pass the parsed ignoreIfExists flag from Spark SQL CREATE INDEX commands into the Spark index client.
  • Skip index creation when the index already exists and IF NOT EXISTS is used.
  • Preserve the existing duplicate-index failure behavior when IF NOT EXISTS is not specified.

No code was copied from external sources.

Impact

Low user-facing impact. This makes CREATE INDEX IF NOT EXISTS behave as expected for existing indexes while keeping the existing strict failure behavior for plain CREATE INDEX.

There is no public API change, storage format change, or expected performance impact.

Risk Level

low

The change is scoped to Spark SQL index creation and preserves the existing non-IF NOT EXISTS failure path. Verification covered both syntax handling and secondary index behavior.

Documentation Update

none

This fixes existing command semantics and does not add a new user-facing command, config, or API.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Testing:

  • git diff --check
  • mvn -pl hudi-spark-datasource/hudi-spark -am -Pspark3.5 -DskipTests -DskipITs -DskipUTs -DskipFTs -DskipDocker -Drat.skip=true -Dmaven.javadoc.skip=true install
  • mvn -pl hudi-spark-datasource/hudi-spark -Pspark3.5 -DwildcardSuites=org.apache.spark.sql.hudi.feature.index.TestIndexSyntax -Drat.skip=true org.scalatest:scalatest-maven-plugin:2.2.0:test
  • mvn -pl hudi-spark-datasource/hudi-spark -Pspark3.5 -DwildcardSuites=org.apache.spark.sql.hudi.feature.index.TestSecondaryIndex -Drat.skip=true org.scalatest:scalatest-maven-plugin:2.2.0:test

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR threads an ignoreIfExists flag through HoodieSparkIndexClient.create so that CREATE INDEX IF NOT EXISTS is honored for record, secondary, expression, bloom_filters, and column_stats indexes, and adds tests covering the new path. The existence-check prefix logic mirrors what HoodieIndexUtils.getSecondaryOrExpressionIndexDefinition already uses, and the default behavior (without IF NOT EXISTS) is preserved. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small readability suggestion on the guard pattern in createExpressionOrSecondaryIndex; rest of the change is clean.

cc @yihua

String fullIndexName = indexType.equals(PARTITION_NAME_SECONDARY_INDEX)
? PARTITION_NAME_SECONDARY_INDEX_PREFIX + userIndexName
: PARTITION_NAME_EXPRESSION_INDEX_PREFIX + userIndexName;
if (indexExists(metaClient, fullIndexName) && ignoreIfExists) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the && guard here reads differently from the nested-if pattern used in createRecordIndex (line ~117). A future reader might wonder what happens when indexExists && !ignoreIfExists — not immediately obvious that the throw is handled further down. Could you use the same nested-if shape for consistency, e.g. if (indexExists(...)) { if (ignoreIfExists) { return; } }?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, updated this guard to match the nested shape used in createRecordIndex. Verified with git diff --check and the hudi-spark-client Spark 3.5 build.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 7, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR threads an ignoreIfExists flag from CreateIndexCommand through HoodieSparkIndexClient.create so that CREATE INDEX IF NOT EXISTS short-circuits when the index already exists, while leaving the existing duplicate-error path intact for plain CREATE INDEX. I traced the record-index and secondary/expression-index paths; the fullIndexName prefix logic matches what HoodieIndexUtils.getSecondaryOrExpressionIndexDefinition uses for its own duplicate check, so behavior stays consistent when ignoreIfExists is false. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small simplification suggestion below; otherwise the change reads cleanly.

cc @yihua

String fullIndexName = indexType.equals(PARTITION_NAME_SECONDARY_INDEX)
? PARTITION_NAME_SECONDARY_INDEX_PREFIX + userIndexName
: PARTITION_NAME_EXPRESSION_INDEX_PREFIX + userIndexName;
if (indexExists(metaClient, fullIndexName)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: since there's no throw after the inner branch (unlike createRecordIndex), this nested if is just a conjunction — could you flatten to if (ignoreIfExists && indexExists(metaClient, fullIndexName)) { ... } for readability?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I kept the nested shape here to match the record-index guard above and make the plain CREATE INDEX path fall through to the existing duplicate-index check in HoodieIndexUtils.getSecondaryOrExpressionIndexDefinition. That keeps the two CREATE INDEX paths visually consistent.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 7, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Copy Markdown
Member

Thank you for the contribution, can you please help to edit the PR using the template provided so that the complaince check passes?

Thank you!

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.87%. Comparing base (34e9c7c) to head (f93f2c6).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
.../org/apache/hudi/index/HoodieSparkIndexClient.java 85.71% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18699      +/-   ##
============================================
- Coverage     67.92%   67.87%   -0.05%     
+ Complexity    29003    28978      -25     
============================================
  Files          2522     2522              
  Lines        141166   141181      +15     
  Branches      17506    17509       +3     
============================================
- Hits          95881    95826      -55     
- Misses        37415    37489      +74     
+ Partials       7870     7866       -4     
Flag Coverage Δ
common-and-other-modules 44.17% <0.00%> (-0.01%) ⬇️
hadoop-mr-java-client 45.00% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.34% <0.00%> (-0.01%) ⬇️
spark-java-tests 48.76% <40.00%> (-0.24%) ⬇️
spark-scala-tests 44.91% <90.00%> (+0.01%) ⬆️
utilities 37.63% <25.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/spark/sql/hudi/command/IndexCommands.scala 87.35% <100.00%> (+0.45%) ⬆️
.../org/apache/hudi/index/HoodieSparkIndexClient.java 85.61% <85.71%> (+1.28%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@201573
Copy link
Copy Markdown
Author

201573 commented May 12, 2026

I updated the PR description to match the current Hudi PR template sections. The new PR Compliance run triggered by the edit is currently in action_required state and appears to need maintainer approval.\n\nFor the Java CI failure, the actual Java UT 1 - Common & Spark step completed successfully; the failed step is Generate merged coverage report in the Spark 4.0 / Scala 2.13 Java 17 job. I do not have permission to rerun the upstream checks directly, so could a maintainer please approve/re-run the failed checks when convenient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] CREATE INDEX IF NOT EXISTS record_index throws "Index already exists" — ignoreIfExists flag is silently dropped

5 participants