[FLINK-39759][starrocks] Fix CHAR/VARCHAR mapping for utf8mb4 characters#4447
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Introduces a configurable unicode-char.max-bytes option to correctly map CDC CHAR/VARCHAR lengths (character count) into StarRocks byte-length semantics, with support across create/evolve paths and test/doc updates.
Changes:
- Added
unicode-char.max-bytessink option (default3) and wired it intoTableCreateConfigand schema mapping. - Updated StarRocks type transformation logic to scale
CHAR/VARCHARlengths by the configured max bytes. - Added/updated unit + IT coverage and documented the new option (EN + ZH).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/TableCreateConfig.java | Adds unicodeCharMaxBytes to table creation config with basic validation and config parsing. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksUtils.java | Plumbs the new setting through datatype mapping and updates CHAR/VARCHAR scaling logic. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksMetadataApplier.java | Ensures add/alter column schema evolution uses configured unicodeCharMaxBytes. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksDataSinkOptions.java | Defines the new UNICODE_CHAR_MAX_BYTES config option and description. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/main/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksDataSinkFactory.java | Exposes the new option as an optional sink option. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksUtilsTest.java | Adds unit test asserting 4-byte scaling outcome for CHAR/VARCHAR. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/CdcDataTypeTransformerTest.java | Adds unit tests to cover scaling/capping behavior with unicode-char.max-bytes=4. |
| flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-starrocks/src/test/java/org/apache/flink/cdc/connectors/starrocks/sink/StarRocksMetadataApplierITCase.java | Adds IT coverage for schema creation with unicode-char.max-bytes=4 and allows injecting extra config. |
| docs/content/docs/connectors/pipeline-connectors/starrocks.md | Documents the new sink option in English docs. |
| docs/content.zh/docs/connectors/pipeline-connectors/starrocks.md | Documents the new sink option in Chinese docs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public TableCreateConfig( | ||
| @Nullable Integer numBuckets, Map<String, String> properties, int unicodeCharMaxBytes) { | ||
| Preconditions.checkArgument( | ||
| unicodeCharMaxBytes > 0, | ||
| "unicode-char.max-bytes must be positive, but actually is %s", | ||
| unicodeCharMaxBytes); | ||
| this.numBuckets = numBuckets; | ||
| this.properties = new HashMap<>(properties); | ||
| this.unicodeCharMaxBytes = unicodeCharMaxBytes; |
| StarRocksColumn nameColumn = | ||
| table.getColumns().stream() | ||
| .filter(c -> c.getColumnName().equals("name")) | ||
| .findFirst() | ||
| .get(); |
leonardBang
left a comment
There was a problem hiding this comment.
Thanks @haruki-830 for the contribution, the code change looks good to me, I left comments about docs.
And in English docs and Chinese docs, the mapping table still says CHAR(n) threshold is n <= 85 and output is n * 3, we also need to update them.
| <td>optional</td> | ||
| <td style="word-wrap: break-word;">3</td> | ||
| <td>Integer</td> | ||
| <td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td> |
There was a problem hiding this comment.
| <td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths.</td> | |
| <td>The maximum number of bytes allocated for each upstream character when mapping CHAR and VARCHAR types to StarRocks, whose length is measured in bytes. If the upstream source uses utf8mb4, set this option to 4 to avoid underestimating column lengths. The default value of 3 is retained for backward compatibility.</td> |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the update, LGTM, wait the CI green |
Summary
This commit fixes CHAR / VARCHAR mapping in the Flink CDC StarRocks connector for utf8mb4 characters by introducing a configurable unicode-char.max-bytes option.
Key Changes
Configurable Character Length Mapping
Schema Mapping and Evolution Support
Validation, Tests, and Docs
Configuration Example
Enable 4-byte character length mapping for utf8mb4 sources
Default behavior remains unchanged
JIRA Reference
https://issues.apache.org/jira/browse/FLINK-39759