Skip to content

changefeedccl: fix decimal writes into parquet#145248

Draft
KeithCh wants to merge 1 commit intocockroachdb:masterfrom
KeithCh:parquet-decimal
Draft

changefeedccl: fix decimal writes into parquet#145248
KeithCh wants to merge 1 commit intocockroachdb:masterfrom
KeithCh:parquet-decimal

Conversation

@KeithCh
Copy link
Copy Markdown
Contributor

@KeithCh KeithCh commented Apr 25, 2025

Previously we wrote decimals into parquet as
string bytes. This caused issues when being read
since the readers expect the bytes to be a valid
decimal representation.

Epic: CRDB-41784
Fixes: #130909
Release note (bug fix): Fix decimal writes into parquet files from
changefeeds and exports.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@KeithCh KeithCh force-pushed the parquet-decimal branch 2 times, most recently from a4c6470 to 298d3b7 Compare April 25, 2025 17:28
@KeithCh KeithCh changed the title changefeedccl: Fixed decimal writes into parquet changefeedccl: fix decimal writes into parquet Apr 25, 2025
@KeithCh KeithCh requested a review from Copilot April 25, 2025 17:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the encoding of decimal values in parquet files by switching from string bytes to a proper twos-complement byte representation. Key changes include:

  • Expanding the decimal test case in writer_test.go to cover a broader range of decimals.
  • Introducing new helper functions (twosComplement and formatDecimal) in write_functions.go for proper decimal encoding.
  • Updating the decimal decoding logic in decoders.go to convert from the twos-complement format.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
pkg/util/parquet/writer_test.go Updated the test case for decimals by adding more decimal columns and test values.
pkg/util/parquet/write_functions.go Added functions to compute the twos-complement and format decimals correctly.
pkg/util/parquet/testutils.go Extended datum validation to include comparisons of decimal coefficients and sign.
pkg/util/parquet/decoders.go Revised decimal decoding to convert twos-complement bytes to decimal values.
Files not reviewed (1)
  • pkg/util/parquet/BUILD.bazel: Language not supported
Comments suppressed due to low confidence (1)

pkg/util/parquet/writer_test.go:240

  • [nitpick] The updated test values for decimals now use finite numbers instead of '-inf' and 'nan'. Please verify that these finite values provide sufficient coverage of edge cases previously intended.
if datums[1], err = tree.ParseDDecimal("1.222"); err != nil {

@cockroachdb cockroachdb deleted a comment from Copilot AI Apr 25, 2025
@KeithCh KeithCh requested review from a team and rharding6373 and removed request for a team April 25, 2025 17:32
@KeithCh KeithCh added backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x backport-25.2.x Flags PRs that need to be backported to 25.2 labels Apr 25, 2025
@KeithCh KeithCh marked this pull request as draft April 25, 2025 19:32
@KeithCh KeithCh force-pushed the parquet-decimal branch 4 times, most recently from c223a07 to e7a2847 Compare April 28, 2025 15:18
@KeithCh KeithCh marked this pull request as ready for review April 28, 2025 15:19
@KeithCh KeithCh force-pushed the parquet-decimal branch from e7a2847 to 0394463 Compare May 1, 2025 16:29
@KeithCh KeithCh requested a review from a team as a code owner May 1, 2025 16:29
@KeithCh KeithCh requested a review from a team May 1, 2025 16:29
@KeithCh KeithCh marked this pull request as draft May 1, 2025 16:30
@asg0451 asg0451 self-requested a review May 1, 2025 16:46
@KeithCh KeithCh force-pushed the parquet-decimal branch from 0394463 to 9061664 Compare May 1, 2025 18:36
@KeithCh
Copy link
Copy Markdown
Contributor Author

KeithCh commented May 1, 2025

The current status is that it works for 95% of cases, with the following caveats:

  • We're currently converting all decimal columns into a tuple with columns (decimal, string). But this means we cannot handle when decimals are already in a tuple since we do not support writing to parquet where we have nested tuples. e.g. (decimal, int) -> ((decimal, string), int)
  • Exports to parquet do not work when dealing with inf, -inf etc., it causes a panic. Changefeeds are working fine.

Previously we wrote decimals into parquet as
string bytes. This caused issues when being read
since the readers expect the bytes to be a valid
decimal representation. We now write all decimals
columns in parquet as a (decimal, string) tuple.
The decimal value is null if it is not a finite
decimal (e.g. NaN, Inf). The string value is null
if the decimal value is finite.

Epic: CRDB-41784
Fixes: cockroachdb#130909
Release note (bug fix): Fix decimal writes into parquet files from
changefeeds and exports.
@KeithCh KeithCh force-pushed the parquet-decimal branch from 9061664 to c68ef59 Compare May 2, 2025 18:49
@asg0451 asg0451 removed their request for review May 8, 2025 22:08
@rharding6373 rharding6373 removed their request for review January 6, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cdc: decimal types are written incorrectly in parquet format

4 participants