Skip to content

[Cherry-pick to branch-1.2] [#10096] feat(iceberg): support build in Iceberg stats updater job (#10106)#10264

Merged
jerryshao merged 1 commit intobranch-1.2from
cherry-pick-f7b9c390-to-branch-1.2
Mar 6, 2026
Merged

[Cherry-pick to branch-1.2] [#10096] feat(iceberg): support build in Iceberg stats updater job (#10106)#10264
jerryshao merged 1 commit intobranch-1.2from
cherry-pick-f7b9c390-to-branch-1.2

Conversation

@github-actions
Copy link

@github-actions github-actions bot commented Mar 6, 2026

Cherry-pick Information:

  • Original commit: f7b9c39
  • Target branch: branch-1.2
  • Status: ✅ Clean cherry-pick (no conflicts)

…10106)

### What changes were proposed in this pull request?

This PR enhances the built-in Iceberg update job and optimizer
integration in four areas:

1. Add built-in Iceberg stats/metrics update capability
- Introduce/rename the built-in job implementation to
`IcebergUpdateStatsAndMetricsJob`.
- Support `update_mode` with `stats | metrics | all`.
- Keep one execution path that can update Gravitino statistics and/or
persist metrics (including JDBC-backed metrics storage).

2. Reuse shared Spark/Iceberg config logic
- Add `IcebergSparkConfigUtils` in `maintenance/optimizer-api`.
- Share Spark template config generation, flat JSON parsing, and catalog
config validation logic.
- Reuse the same utility in both update-stats and rewrite-data-files
jobs to avoid diverging behavior.

3. Simplify submit-update-stats-job UX
- Simplify optimizer command context and submit command input handling.
- Remove `--target-file-size-bytes` from optimizer CLI and built-in job
template inputs.
- Use fixed defaults in job logic:
  - `datafile_mse` target file size: **128 MiB**
  - small file threshold: **32 MiB**

4. Strengthen tests
- Add unit tests for `IcebergSparkConfigUtils`.
- Update/extend job and optimizer tests to cover new signatures and
validation paths.
- Keep Spark-based update-stats tests aligned with the new fixed-target
behavior.

### Why are the changes needed?

- The update-stats flow now also needs metrics updates, and users need
one consistent built-in job for `stats`, `metrics`, and `all` modes.
- Spark config building/parsing/validation was duplicated across job
paths; centralizing it reduces maintenance risk and inconsistent
behavior.
- User feedback indicated the submit command had too many parameters;
removing `target-file-size-bytes` and fixing safe defaults reduces
configuration overhead and mistakes.

Fix: #10096

### Does this PR introduce _any_ user-facing change?

Yes.

1. Optimizer CLI behavior change
- `submit-update-stats-job` no longer accepts
`--target-file-size-bytes`.
- Users should provide `update_mode`, `updater_options`, and
`spark_conf` (via CLI or optimizer config).

2. Built-in job input change
- `target_file_size_bytes` is removed from built-in job template/job
config for update-stats flow.
- `datafile_mse` now always uses fixed 128 MiB target size.

3. Functionality enhancement
- Built-in update job supports metrics-only mode and combined
stats+metrics mode.

### How was this patch tested?

Locally validated with formatting, compilation, and targeted tests:

```bash
./gradlew :maintenance:optimizer-api:spotlessApply :maintenance:jobs:spotlessApply :maintenance:optimizer:spotlessApply
./gradlew :maintenance:optimizer-api:test --tests org.apache.gravitino.maintenance.optimizer.common.util.TestIcebergSparkConfigUtils
./gradlew :maintenance:jobs:test --tests org.apache.gravitino.maintenance.jobs.iceberg.TestIcebergRewriteDataFilesJob --tests org.apache.gravitino.maintenance.jobs.iceberg.TestIcebergUpdateStatsJob
./gradlew :maintenance:optimizer:test --tests org.apache.gravitino.maintenance.optimizer.TestOptimizerCmd
```

And after removing `--target-file-size-bytes`, reran:

```bash
./gradlew :maintenance:jobs:test --tests org.apache.gravitino.maintenance.jobs.iceberg.TestIcebergUpdateStatsJob
./gradlew :maintenance:optimizer:test --tests org.apache.gravitino.maintenance.optimizer.TestOptimizerCmd
```
@github-actions github-actions bot requested a review from jerryshao March 6, 2026 03:02
@jerryshao jerryshao closed this Mar 6, 2026
@jerryshao jerryshao reopened this Mar 6, 2026
@jerryshao jerryshao merged commit 7f989d0 into branch-1.2 Mar 6, 2026
26 checks passed
@jerryshao jerryshao deleted the cherry-pick-f7b9c390-to-branch-1.2 branch March 6, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants