Skip to content

[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters#4447

Merged
leonardBang merged 2 commits into
apache:masterfrom
haruki-830:FLINK-39759
Jun 24, 2026
Merged

[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters#4447
leonardBang merged 2 commits into
apache:masterfrom
haruki-830:FLINK-39759

Conversation

@haruki-830

Copy link
Copy Markdown
Contributor

Summary

This commit fixes CHAR / VARCHAR mapping in the Flink CDC StarRocks connector for utf8mb4 characters by introducing a configurable unicode-char.max-bytes option.

Key Changes

Configurable Character Length Mapping

  • Added a new optional StarRocks sink option: unicode-char.max-bytes
  • Default value is 3, preserving existing behavior
  • Users can set it to 4 for utf8mb4 sources to avoid underestimating target column lengths

Schema Mapping and Evolution Support

  • Updated StarRocksUtils.CdcDataTypeTransformer to use a configurable byte multiplier instead of the hard-coded 3
  • Integrated the option into TableCreateConfig
  • Updated create table, add column, and alter column type paths to honor the configured value
  • Preserved existing primary key handling and length capping behavior

Validation, Tests, and Docs

  • Added validation to ensure unicode-char.max-bytes is positive
  • Added unit and integration test coverage for both default behavior and unicode-char.max-bytes = 4
  • Updated English and Chinese StarRocks connector docs to document the new option

Configuration Example

Enable 4-byte character length mapping for utf8mb4 sources

sink:
  type: starrocks
  jdbc-url: jdbc:mysql://fe_host1:9030
  load-url: fe_host1:8030
  username: root
  password: password
  unicode-char.max-bytes: 4

Default behavior remains unchanged

sink:
  type: starrocks
  jdbc-url: jdbc:mysql://fe_host1:9030
  load-url: fe_host1:8030
  username: root
  password: password

JIRA Reference

https://issues.apache.org/jira/browse/FLINK-39759

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added docs Improvements or additions to documentation starrocks-pipeline-connector labels Jun 22, 2026
@haruki-830 haruki-830 marked this pull request as ready for review June 22, 2026 02:38
@leonardBang leonardBang requested a review from Copilot June 22, 2026 09:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Introduces a configurable unicode-char.max-bytes option to correctly map CDC CHAR/VARCHAR lengths (character count) into StarRocks byte-length semantics, with support across create/evolve paths and test/doc updates.

Changes:

  • Added unicode-char.max-bytes sink option (default 3) and wired it into TableCreateConfig and schema mapping.
  • Updated StarRocks type transformation logic to scale CHAR/VARCHAR lengths by the configured max bytes.
  • Added/updated unit + IT coverage and documented the new option (EN + ZH).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/TableCreateConfig.java Adds unicodeCharMaxBytes to table creation config with basic validation and config parsing.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksUtils.java Plumbs the new setting through datatype mapping and updates CHAR/VARCHAR scaling logic.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksMetadataApplier.java Ensures add/alter column schema evolution uses configured unicodeCharMaxBytes.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksDataSinkOptions.java Defines the new UNICODE_CHAR_MAX_BYTES config option and description.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksDataSinkFactory.java Exposes the new option as an optional sink option.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksUtilsTest.java Adds unit test asserting 4-byte scaling outcome for CHAR/VARCHAR.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/CdcDataTypeTransformerTest.java Adds unit tests to cover scaling/capping behavior with unicode-char.max-bytes=4.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksMetadataApplierITCase.java Adds IT coverage for schema creation with unicode-char.max-bytes=4 and allows injecting extra config.
docs/content/docs/connectors/pipeline-connectors/starrocks.md Documents the new sink option in English docs.
docs/content.zh/docs/connectors/pipeline-connectors/starrocks.md Documents the new sink option in Chinese docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +58 to +66
public TableCreateConfig(
@Nullable Integer numBuckets, Map<String, String> properties, int unicodeCharMaxBytes) {
Preconditions.checkArgument(
unicodeCharMaxBytes > 0,
"unicode-char.max-bytes must be positive, but actually is %s",
unicodeCharMaxBytes);
this.numBuckets = numBuckets;
this.properties = new HashMap<>(properties);
this.unicodeCharMaxBytes = unicodeCharMaxBytes;
Comment on lines +213 to +217
StarRocksColumn nameColumn =
table.getColumns().stream()
.filter(c -> c.getColumnName().equals("name"))
.findFirst()
.get();

@leonardBang leonardBang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haruki-830 for the contribution, the code change looks good to me, I left comments about docs.

And in English docs and Chinese docs, the mapping table still says CHAR(n) threshold is n <= 85 and output is n * 3, we also need to update them.

<td>optional</td>
<td style="word-wrap: break-word;">3</td>
<td>Integer</td>
<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td>
<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths. The default value of 3 is retained for backward compatibility.</td>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@leonardBang

Copy link
Copy Markdown
Contributor

Thanks for the update, LGTM, wait the CI green

@leonardBang leonardBang merged commit 504a4b4 into apache:master Jun 24, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved docs Improvements or additions to documentation reviewed starrocks-pipeline-connector

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants