Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-4775: logical type preservation for hdfs connector when output format is parquet #1124

Merged
merged 1 commit into from Jun 12, 2019

Conversation

@liukrimhrim
Copy link
Member

commented May 20, 2019

The root of logical type loss lies in the old way how avro-converter set logical type: it uses a prop named "logicalType", instead of setting directly. Even though apache avro can recognize such prop as the logical type, apache parquet cannot.

To fix it, I add code to set logical type directly, while keep the old code for back compatibility

@liukrimhrim liukrimhrim requested review from kkonstantine and confluentinc/connect May 20, 2019

@liukrimhrim liukrimhrim self-assigned this May 20, 2019

@ConfluentCLABot

This comment has been minimized.

Copy link

commented May 20, 2019

It looks like @liukrimhrim hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

@liukrimhrim

This comment has been minimized.

Copy link
Member Author

commented May 20, 2019

[clabot:check]

[clabot:check]

@ConfluentCLABot

This comment has been minimized.

Copy link

commented May 20, 2019

@confluentinc It looks like @liukrimhrim just signed our Contributor License Agreement. 👍

Always at your service,

clabot

@kkonstantine
Copy link
Member

left a comment

Thanks @liukrimhrim for this catch and the fix!
My main concern is around the assumptions regarding precision and scale.
Then, I think we definitely need unit tests that will test the new conversion explicitly (probably without the old way being asserted).

@@ -950,6 +956,24 @@ private static Object maybeWrapSchemaless(Schema schema, Object value, String ty
}
}

// the new and correct way to handle logical type

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine May 20, 2019

Member

I suggest moving the new improved way to handle logical types above the older one.

// the new and correct way to handle logical type
if (schema.name() != null) {
if (Decimal.LOGICAL_NAME.equalsIgnoreCase(schema.name())) {
String precisionString = schema.parameters().get(CONNECT_AVRO_DECIMAL_PRECISION_PROP);

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine May 20, 2019

Member

If precision and scale are not there, we can't assume default values that could potentially lead to data loss.
We'll need to fall back to the current behavior which treats these types as bytes. Please add a comment to explain what's the assumption when precision or scale are not there.

This comment has been minimized.

Copy link
@liukrimhrim

liukrimhrim May 21, 2019

Author Member

According to Apache Avro, if scale is not specified, the default value is 0. Does this mean I can use default value for scale?

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine May 21, 2019

Member

I'd suggest leaving it unspecified if we can, to match exactly the input

@@ -933,6 +933,12 @@ private static Object maybeWrapSchemaless(Schema schema, Object value, String ty
}
}

// The code block below was the older way to handle logical type, but in fact,

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine May 20, 2019

Member

How about making this simpler?
A suggestion:

// Initially, to add support for logical types a new property was added with key `logicalType`. 
// This enabled logical types for avro schemas but not others, such as parquet. 
// The use of 'addToSchema` above supersedes this method here, which should eventually be removed. Keeping for backwards compatibility until a major version upgrade happens. 
@kkonstantine

This comment has been minimized.

Copy link
Member

commented May 20, 2019

Also, fyi: the build revealed an issue caught by spotbugs that needs fixing.

@liukrimhrim

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

I did 3 things in this commit:

  1. Now, for Connect Decimal to be mapped to Avro decimal, the precision has to be specified, otherwise, Connect Decimal will be mapped to Avro bytes;
  2. Previously, when convert Avro schema to Connect schema, if the scale is not specified, AvroData will throw an exception. Now, it will use default value 0 instead. This default value is per Apache Avro Doc;
  3. Added several unit tests that will pass iff using new way to handle logical types
@liukrimhrim

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

in this commit, I roll back the second item in the last comment

@kkonstantine
Copy link
Member

left a comment

Thanks @liukrimhrim for addressing my initial comments.
Took another pass and have a few follow up comments.

}
// if the precision is not specified, we should not assume a default value,

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 10, 2019

Member

we should not retain commented out code

@@ -491,6 +491,9 @@ private static Object fromConnectData(
requireContainer);

case BYTES: {
if (value instanceof BigDecimal) {

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 10, 2019

Member

Why are we adding this conversion?

This comment has been minimized.

Copy link
@liukrimhrim

liukrimhrim Jun 10, 2019

Author Member

because now we are enforcing the semantics: decimal type is preserved iff the precision is specified, otherwise it will be convert to bytes. Therefore, it can happen that some Connect data, the type of which is Decimal without precision; when we convert its schema to avro schema, the type becomes bytes. Therefore, in fromConnectData(), even though the schema type is bytes, it is possible for the data to be converted to be a BigDecimal. I added this conversion because I actually encountered a bug caused by the fact.

// Only Avro named types (record, enum, fixed) may contain namespace + name. Only Connect's
// struct converts to one of those (record), so for everything else that has a name we store
// the full name into a special property. For uniformity, we also duplicate this info into
// the same field in records as well even though it will also be available in the namespace()
// and name().
if (schema.name() != null) {
if (Decimal.LOGICAL_NAME.equalsIgnoreCase(schema.name())) {
if (Decimal.LOGICAL_NAME.equalsIgnoreCase(schema.name())

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 10, 2019

Member

Why are we changing this in the old conversion type?

This comment has been minimized.

Copy link
@liukrimhrim

liukrimhrim Jun 10, 2019

Author Member

because now we requires that decimal logical type gets preserved iff the precision is specified. So, even for the old way, we might want to check if the precision is specified.

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 12, 2019

Member

I'd like us to leave the old logic unchanged in this PR

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 12, 2019

Member

On a second thought, let's keep them in sync with the above block

// which should eventually be removed.
// Keeping for backwards compatibility until a major version upgrade happens.

// OLD COMMENT

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 10, 2019

Member

Let's not use all caps in comments, because it's not clear what that means.
Suggestion:

// Below follows the older method of supporting logical types via properties. It is retained for now and will be deprecated eventually. 
@kkonstantine

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

Also, there's a test failure, that it'd be great to investigate and see if a rebase on top of recent changes fixes it. Thanks!

@liukrimhrim liukrimhrim force-pushed the liukrimhrim:CC-4775 branch from 9ce639d to df00c95 Jun 11, 2019

@kkonstantine kkonstantine force-pushed the liukrimhrim:CC-4775 branch from df00c95 to 29b0d38 Jun 11, 2019


// Below follows the older method of supporting logical types via properties.
// It is retained for now and will be deprecated eventually.
//

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 11, 2019

Member

nit: let's remove this empty comment line

@liukrimhrim liukrimhrim force-pushed the liukrimhrim:CC-4775 branch from 29b0d38 to 06d8a83 Jun 11, 2019

@liukrimhrim liukrimhrim changed the base branch from 5.2.x to 5.1.x Jun 11, 2019

@liukrimhrim liukrimhrim force-pushed the liukrimhrim:CC-4775 branch from 06d8a83 to f852fbd Jun 11, 2019

baseSchema
.addProp(AVRO_LOGICAL_DECIMAL_PRECISION_PROP,
new IntNode(CONNECT_AVRO_DECIMAL_PRECISION_DEFAULT));
baseSchema.addProp(AVRO_LOGICAL_DECIMAL_PRECISION_PROP,

This comment has been minimized.

Copy link
@kkonstantine

kkonstantine Jun 12, 2019

Member

seems like we are introducing an artificial change. This code remained unchanged praticaly

@kkonstantine
Copy link
Member

left a comment

LGTM!
Thanks @liukrimhrim !

@liukrimhrim liukrimhrim force-pushed the liukrimhrim:CC-4775 branch from f852fbd to e852859 Jun 12, 2019

@liukrimhrim liukrimhrim merged commit b33f0e4 into confluentinc:5.1.x Jun 12, 2019

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.