[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters by haruki-830 · Pull Request #4447 · apache/flink-cdc

haruki-830 · 2026-06-22T02:37:45Z

Summary

This commit fixes CHAR / VARCHAR mapping in the Flink CDC StarRocks connector for utf8mb4 characters by introducing a configurable unicode-char.max-bytes option.

Key Changes

Configurable Character Length Mapping

Added a new optional StarRocks sink option: unicode-char.max-bytes
Default value is 3, preserving existing behavior
Users can set it to 4 for utf8mb4 sources to avoid underestimating target column lengths

Schema Mapping and Evolution Support

Updated StarRocksUtils.CdcDataTypeTransformer to use a configurable byte multiplier instead of the hard-coded 3
Integrated the option into TableCreateConfig
Updated create table, add column, and alter column type paths to honor the configured value
Preserved existing primary key handling and length capping behavior

Validation, Tests, and Docs

Added validation to ensure unicode-char.max-bytes is positive
Added unit and integration test coverage for both default behavior and unicode-char.max-bytes = 4
Updated English and Chinese StarRocks connector docs to document the new option

Configuration Example

Enable 4-byte character length mapping for utf8mb4 sources

sink:
  type: starrocks
  jdbc-url: jdbc:mysql://fe_host1:9030
  load-url: fe_host1:8030
  username: root
  password: password
  unicode-char.max-bytes: 4

Default behavior remains unchanged

sink:
  type: starrocks
  jdbc-url: jdbc:mysql://fe_host1:9030
  load-url: fe_host1:8030
  username: root
  password: password

JIRA Reference

https://issues.apache.org/jira/browse/FLINK-39759

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Introduces a configurable unicode-char.max-bytes option to correctly map CDC CHAR/VARCHAR lengths (character count) into StarRocks byte-length semantics, with support across create/evolve paths and test/doc updates.

Changes:

Added unicode-char.max-bytes sink option (default 3) and wired it into TableCreateConfig and schema mapping.
Updated StarRocks type transformation logic to scale CHAR/VARCHAR lengths by the configured max bytes.
Added/updated unit + IT coverage and documented the new option (EN + ZH).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/TableCreateConfig.java	Adds `unicodeCharMaxBytes` to table creation config with basic validation and config parsing.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksUtils.java	Plumbs the new setting through datatype mapping and updates `CHAR`/`VARCHAR` scaling logic.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksMetadataApplier.java	Ensures add/alter column schema evolution uses configured `unicodeCharMaxBytes`.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksDataSinkOptions.java	Defines the new `UNICODE_CHAR_MAX_BYTES` config option and description.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksDataSinkFactory.java	Exposes the new option as an optional sink option.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksUtilsTest.java	Adds unit test asserting 4-byte scaling outcome for `CHAR`/`VARCHAR`.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/CdcDataTypeTransformerTest.java	Adds unit tests to cover scaling/capping behavior with `unicode-char.max-bytes=4`.
flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksMetadataApplierITCase.java	Adds IT coverage for schema creation with `unicode-char.max-bytes=4` and allows injecting extra config.
docs/content/docs/connectors/pipeline-connectors/starrocks.md	Documents the new sink option in English docs.
docs/content.zh/docs/connectors/pipeline-connectors/starrocks.md	Documents the new sink option in Chinese docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    public TableCreateConfig(
+            @Nullable Integer numBuckets, Map<String, String> properties, int unicodeCharMaxBytes) {
+        Preconditions.checkArgument(
+                unicodeCharMaxBytes > 0,
+                "unicode-char.max-bytes must be positive, but actually is %s",
+                unicodeCharMaxBytes);
        this.numBuckets = numBuckets;
        this.properties = new HashMap<>(properties);
+        this.unicodeCharMaxBytes = unicodeCharMaxBytes;


+        StarRocksColumn nameColumn =
+                table.getColumns().stream()
+                        .filter(c -> c.getColumnName().equals("name"))
+                        .findFirst()
+                        .get();


leonardBang

Thanks @haruki-830 for the contribution, the code change looks good to me, I left comments about docs.

And in English docs and Chinese docs, the mapping table still says CHAR(n) threshold is n <= 85 and output is n * 3, we also need to update them.

leonardBang · 2026-06-22T11:39:12Z

+      <td>optional</td>
+      <td style="word-wrap: break-word;">3</td>
+      <td>Integer</td>
+      <td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td>


Suggested change

<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td>

<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths. The default value of 3 is retained for backward compatibility.</td>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

leonardBang · 2026-06-24T04:07:50Z

Thanks for the update, LGTM, wait the CI green

[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters

5333ce8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added docs Improvements or additions to documentation starrocks-pipeline-connector labels Jun 22, 2026

haruki-830 marked this pull request as ready for review June 22, 2026 02:38

leonardBang requested a review from Copilot June 22, 2026 09:09

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Copilot started reviewing on behalf of leonardBang June 22, 2026 09:38 View session

leonardBang reviewed Jun 22, 2026

View reviewed changes

address review comments

e4c9654

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

leonardBang approved these changes Jun 24, 2026

View reviewed changes

github-actions Bot added approved reviewed labels Jun 24, 2026

leonardBang merged commit 504a4b4 into apache:master Jun 24, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters#4447

[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters#4447
leonardBang merged 2 commits into
apache:masterfrom
haruki-830:FLINK-39759

haruki-830 commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

leonardBang left a comment

Uh oh!

leonardBang Jun 22, 2026

Uh oh!

leonardBang commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td>
	<td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths. The default value of 3 is retained for backward compatibility.</td>

Uh oh!

Conversation

haruki-830 commented Jun 22, 2026

Summary

Key Changes

Configuration Example

JIRA Reference

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

leonardBang left a comment

Choose a reason for hiding this comment

Uh oh!

leonardBang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

leonardBang commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants