Skip to content

[SPARK-57245][SQL] Simplify FormatNumber codegen by extracting a static Java helper#56299

Closed
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:formatnumber-codegen-helper
Closed

[SPARK-57245][SQL] Simplify FormatNumber codegen by extracting a static Java helper#56299
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:formatnumber-codegen-helper

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908 (reduce generated Java size in whole-stage codegen).

For an integer d, FormatNumber.doGenCode inlined the number-pattern build (the default grouping pattern followed by d decimal places, applied to a DecimalFormat), an exact transliteration of the same block in nullSafeEval. The codegen also threaded a pattern StringBuilder through mutable state:

if (d >= 0) {
  pattern.delete(0, pattern.length());
  if (d != lastDValue) {
    pattern.append("#,###,###,###,###,###,##0");
    if (d > 0) {
      pattern.append(".");
      for (int i = 0; i < d; i++) {
        pattern.append("0");
      }
    }
    lastDValue = d;
    numberFormat.applyLocalizedPattern(pattern.toString());
  }
  ...
}

This PR moves the pattern build into a static helper ExpressionImplUtils.applyNumberFormatScale(DecimalFormat, String defaultFormat, int scale) that both eval and doGenCode call, mirroring the ascii helper added in SPARK-57208. The integer-d codegen becomes:

if (d >= 0) {
  if (d != lastDValue) {
    lastDValue = d;
    ExpressionImplUtils.applyNumberFormatScale(numberFormat, "#,###,###,###,###,###,##0", d);
  }
  ...
}

This removes the duplicated block, drops the pattern StringBuilder (an instance field in eval and a mutable-state slot in codegen), and removes a dead pattern.delete(...) that the string-pattern branch never read. The string-pattern path and the per-d caching (only rebuild when d changes) are unchanged.

Why are the changes needed?

To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. The integer-d branch shrinks to a single helper call, and the generated class drops one mutable-state field (the pattern StringBuilder). The same block is also deduplicated from eval.

Does this PR introduce any user-facing change?

No. This is a codegen/refactor change; eval, nullable, dataType, and results are unchanged, so SQL output and golden files are unaffected.

How was this patch tested?

Existing tests, which exercise the changed path in both interpreted and codegen modes:

  • StringExpressionsSuite "format_number / FormatNumber" (catalyst) covers integer d with scale > 0, scale = 0, and scale < 0 (null) across all numeric input types, plus the string-pattern path and null cases.
  • StringFunctionsSuite "number format function" (sql/core).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…ic Java helper

FormatNumber.doGenCode inlined the integer-scale number pattern build
(the default grouping pattern plus `d` decimal places, applied to a
DecimalFormat), an exact transliteration of the same block in
nullSafeEval. The codegen also threaded a `pattern` StringBuilder through
mutable state.

Move the pattern build into a static helper
`ExpressionImplUtils.applyNumberFormatScale` that both eval and codegen
call, mirroring the Ascii helper. This removes the duplicated block, drops
the `pattern` StringBuilder (an instance field in eval and a mutable-state
slot in codegen), and shrinks the generated Java for the integer-`d` case
to a single call. The string-pattern path and the per-`d` caching are
unchanged; behavior is preserved.
@HyukjinKwon
Copy link
Copy Markdown
Member

cc @gengliangwang

@LuciferYang
Copy link
Copy Markdown
Contributor Author

@gengliangwang This should comply with the preceding criteria.

@gengliangwang
Copy link
Copy Markdown
Member

@LuciferYang Thanks for working on this!

LuciferYang added a commit that referenced this pull request Jun 4, 2026
…ic Java helper

### What changes were proposed in this pull request?

This is a sub-task of [SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce generated Java size in whole-stage codegen).

For an integer `d`, `FormatNumber.doGenCode` inlined the number-pattern build (the default grouping pattern followed by `d` decimal places, applied to a `DecimalFormat`), an exact transliteration of the same block in `nullSafeEval`. The codegen also threaded a `pattern` `StringBuilder` through mutable state:

```java
if (d >= 0) {
  pattern.delete(0, pattern.length());
  if (d != lastDValue) {
    pattern.append("#,###,###,###,###,###,##0");
    if (d > 0) {
      pattern.append(".");
      for (int i = 0; i < d; i++) {
        pattern.append("0");
      }
    }
    lastDValue = d;
    numberFormat.applyLocalizedPattern(pattern.toString());
  }
  ...
}
```

This PR moves the pattern build into a static helper `ExpressionImplUtils.applyNumberFormatScale(DecimalFormat, String defaultFormat, int scale)` that both `eval` and `doGenCode` call, mirroring the `ascii` helper added in SPARK-57208. The integer-`d` codegen becomes:

```java
if (d >= 0) {
  if (d != lastDValue) {
    lastDValue = d;
    ExpressionImplUtils.applyNumberFormatScale(numberFormat, "#,###,###,###,###,###,##0", d);
  }
  ...
}
```

This removes the duplicated block, drops the `pattern` `StringBuilder` (an instance field in `eval` and a mutable-state slot in codegen), and removes a dead `pattern.delete(...)` that the string-pattern branch never read. The string-pattern path and the per-`d` caching (only rebuild when `d` changes) are unchanged.

### Why are the changes needed?

To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. The integer-`d` branch shrinks to a single helper call, and the generated class drops one mutable-state field (the `pattern` `StringBuilder`). The same block is also deduplicated from `eval`.

### Does this PR introduce _any_ user-facing change?

No. This is a codegen/refactor change; `eval`, `nullable`, `dataType`, and results are unchanged, so SQL output and golden files are unaffected.

### How was this patch tested?

Existing tests, which exercise the changed path in both interpreted and codegen modes:
- `StringExpressionsSuite` "format_number / FormatNumber" (catalyst) covers integer `d` with scale > 0, scale = 0, and scale < 0 (null) across all numeric input types, plus the string-pattern path and null cases.
- `StringFunctionsSuite` "number format function" (sql/core).

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

Closes #56299 from LuciferYang/formatnumber-codegen-helper.

Authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit d172a62)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Merged into master/branch-4.x. Thanks @gengliangwang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants