Support Avro's decimal #69

aymkhalil · 2023-05-31T01:19:03Z

This patch adds support for an AVRO only type: decimal.

It would enable conversion from other types like CQL Decimal available in C* CDC.

For example, the use can use an ARRO decimal aware sink with an upstream topic that has a CQL Decimal type like the following:

{
  "name": "longitude",
  "type": [
    "null",
    {
      "type": "record",
      "name": "cql_decimal",
      "namespace": "",
      "fields": [
        {
          "name": "bigint",
          "type": "bytes"
        },
        {
          "name": "scale",
          "type": "int"
        }
      ],
      "logicalType": "cql_decimal"
    }
  ]
}

can be converted to Avro's decimal with the following converion:

{
    "steps": [
        {
            "type": "compute",
            "fields": [
                {
                    "name": "value.longitude",
                    "expression": "fn:decimal(value.longitude.bigint, value.longitude.scale)",
                    "type": "DECIMAL"
                }
            ]
        }
    ]
}

The output schema of the conversion would look like this:

{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4, // would match the BigDecimal.precision that is retrievable after building the BigDecimal the takes bytes + scale as input (but not precision)
  "scale": 2 // would match the scale in the CQL decimal type
}

In order to support the conversion, a new function is registered with the following sig:

decimal(byte[] bytes, int scale) --> BigDecimal

eolivelli

Good idea

I wonder if it would work better 'toDecimal' instead of 'decimal'

a second thought:
probably the table has many columns of this type, probably we could have ALSO a specific step 'convert Cassandra CDC logical types' that does the conversion for all the fields.

it will help a lot because it would be a no Brainerd and a simple checkbox in the UI

lhotari

LGTM. Amazing work @aymkhalil !

README.md

tests/src/test/java/com/datastax/oss/pulsar/functions/transforms/tests/AbstractDockerTest.java

...-transformations/src/main/java/com/datastax/oss/pulsar/functions/transforms/ComputeStep.java

...tions/src/main/java/com/datastax/oss/pulsar/functions/transforms/jstl/JstlTypeConverter.java

aymkhalil · 2023-05-31T17:58:16Z

I wonder if it would work better 'toDecimal' instead of 'decimal'

Actually I originally had it toBigDecimal (but toDecimal is a better one) - then I realized we had toString that we expose to user as str and we don't have to* convention so I thought decimal is more inline with the convention. I'm open to both

a second thought: probably the table has many columns of this type, probably we could have ALSO a specific step 'convert Cassandra CDC logical types' that does the conversion for all the fields.
it will help a lot because it would be a no Brainerd and a simple checkbox in the UI

Yeah you are right this conversion could be reused multiple resulting in duplicate steps. It is like a "reasonable" work around for the CDC issue in specific. I was avoiding adding too specific of a conversion step to keep the transformation generic enough. I actually thought of adding another step like compute (but called computeSchema) that would define a schema conversion rule and is not specific to a particular field. For example:

{
    "steps": [
        {
            "type": "schema", // indicates that this step is a schema transformation step. Could be named "computeSchema" to draw analogy with the compute step 
            "fields": [
                {
                    "name": "cql_decimal", // the name of the schema to be transformed
                    "type": "record", // type of the schema. "record" it indicates the fields names are nested
                    "output": { // the desired output schema, could be named outputSchema or newSchema to be more explicit 
                        "type": "bytes", 
                        "value": "bigint", // another way to referece those is to use dot notations like cql_decimal.bigint or record.bigint. This would leverage the Expression Language for completeness
                        "logicalType": "decimal",
                        "precision": 4, // use can hard code stuff directly 
                        "scale": "scale" // again, could be cq_decimal.scale or record.scale
                    }
                }
            ]
        }
    ]
}

but it turned out more complex that I though, however I can pursue this direction if we see value in it

@cbornet what are your thoughts regarding both points above?

aymkhalil · 2023-05-31T23:47:40Z

I realized that in practice, BigDecimal would be of variable precision/scale. So what will happen in this patch is if scale changes and we reuse the schema, validation could pass if no rounding is required: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Conversions.java#L122-L132, if we cache per field, then compatibility rules would apply.

I'm not sure how practical the changes in the PR would be, yes it conforms to AVRO's logical type, but may not be very useful to CDC customers in particular (they could still use STRING conversion which makes the JSTL function useful, but not necessary the DECIMAL type it self.

...ormations/src/main/java/com/datastax/oss/pulsar/functions/transforms/jstl/JstlEvaluator.java

cbornet · 2023-06-01T07:23:32Z

...tions/src/main/java/com/datastax/oss/pulsar/functions/transforms/jstl/JstlTypeConverter.java

@@ -239,6 +245,12 @@ protected byte[] coerceToBytes(Object value) {
    if (value instanceof byte[]) {
      return (byte[]) value;
    }
+    if (value instanceof ByteBuffer) {


When do we get ByteBuffer ?

It comes from the custom CDC CQL Type (the bigint part is encoded as ByteBuffer) so I mimicked the same in the tests. The coerceToBytes from ByteBuffer will kick in when expression like value.cqlDecimalField.bigint are evaluated

So this is only for tests ? Can we use instead the type that we'll get from the AVRO deserializer (probably byte[] ?)

Not only for testing. If the byte[] was encoded as ByteBuffer, the AVRO deserializer will return ByteBuffer even if the schema type is bytes.

This how CDC code converts it: https://github.com/datastax/cdc-apache-cassandra/blob/cd64bdd03af7a608687b6c54aa614021c62d8027/commons/src/main/java/com/datastax/oss/cdc/CqlLogicalTypes.java#L146

I don't see how the info of which structure was used byte[]/ByteBuffer during serialization would be in the AVRO record...

I just realized the conversation from ByteBuffer had been removed from the coerceToBytes and is now limited to coerceToBigInteger only. However, I'm not sure how the AVRO recognizes ByteBuffer under the hoods.

...tions/src/main/java/com/datastax/oss/pulsar/functions/transforms/jstl/JstlTypeConverter.java

...s/src/test/java/com/datastax/oss/pulsar/functions/transforms/tests/util/CqlLogicalTypes.java

...ormations/src/main/java/com/datastax/oss/pulsar/functions/transforms/jstl/JstlEvaluator.java

aymkhalil · 2023-06-01T22:49:51Z

@cbornet I addressed your feedback, PTAL
However, I think the DECMAL type (although compliant with AVRO) is not practical, here is why in context of CDC:

create table ks1.d2 (id int primary key, v1 decimal) with cdc=true;
insert into ks1.d2(id, v1) values (1, 1.334);

assume the following transform is in place:

{
   "steps":[
      {
         "type":"compute",
         "fields":[
            {
               "name":"value.v2",
               "expression":"fn:decimal(value.v1.bigint, value.v1.scale)",
               "type":"DECIMAL"
            }
         ]
      }
   ]
}

works fine, results in the following schema:

{
        "name": "v2",
        "type": [
          "null",
          {
            "type": "bytes",
            "logicalType": "decimal",
            "precision": 4,
            "scale": 3
          }
        ]
      }

now decreasing the scale works fine as well (without changing the schema on the topic):

 insert into ks1.d2(id, v1) values (1, 1.3);

however, increasing the scale is problematic:

insert into ks1.d2(id, v1) values (1, 1.33499);

will fail with

org.apache.avro.AvroTypeException: Cannot encode decimal with scale 5 as scale 3 without rounding in field v2

so basically depending on the very first decimal that was received, the scale will limit all future decimal values which is impractical. Otherwise, this patch is technically correct. Let me know WDYT.

cbornet · 2023-06-19T14:10:52Z

so basically depending on the very first decimal that was received, the scale will limit all future decimal values which is impractical.

Is this bc it sets the schema on the topic with the first scale received ?

aymkhalil · 2023-06-19T14:57:41Z

so basically depending on the very first decimal that was received, the scale will limit all future decimal values which is impractical.

Is this bc it sets the schema on the topic with the first scale received ?

Exactly.

README.md

cbornet · 2023-06-20T22:39:51Z

LGTM.
We can rebase and merge.
As discussed it would be useful to have a decimalFromNumberWithScale(decimal, scale) to be able to fix the scale of the output in CDC but it can be done in a subsequent PR.

aymkhalil · 2023-06-20T23:43:55Z

LGTM. We can rebase and merge. As discussed it would be useful to have a decimalFromNumberWithScale(decimal, scale) to be able to fix the scale of the output in CDC but it can be done in a subsequent PR.

Issue for tracking: #75

aymkhalil added 3 commits May 30, 2023 18:17

Support Avro's decimal

dc386ff

Fix cql decimal schema in integ tests

05cffa8

Update config-schema and preserve key order

98fed0e

eolivelli reviewed May 31, 2023

View reviewed changes

aymkhalil added 2 commits May 31, 2023 00:40

Enable optional decimal field only

51424cd

Add license header

90b0d46

aymkhalil marked this pull request as ready for review May 31, 2023 07:49

aymkhalil requested a review from cbornet May 31, 2023 07:49

lhotari approved these changes May 31, 2023

View reviewed changes

cbornet reviewed May 31, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

cbornet reviewed May 31, 2023

View reviewed changes

tests/src/test/java/com/datastax/oss/pulsar/functions/transforms/tests/AbstractDockerTest.java Outdated Show resolved Hide resolved

cbornet reviewed May 31, 2023

View reviewed changes

tests/src/test/java/com/datastax/oss/pulsar/functions/transforms/tests/AbstractDockerTest.java Outdated Show resolved Hide resolved

cbornet reviewed May 31, 2023

View reviewed changes

tests/src/test/java/com/datastax/oss/pulsar/functions/transforms/tests/AbstractDockerTest.java Outdated Show resolved Hide resolved

cbornet reviewed May 31, 2023

View reviewed changes

...-transformations/src/main/java/com/datastax/oss/pulsar/functions/transforms/ComputeStep.java Outdated Show resolved Hide resolved

cbornet reviewed May 31, 2023

View reviewed changes

...tions/src/main/java/com/datastax/oss/pulsar/functions/transforms/jstl/JstlTypeConverter.java Show resolved Hide resolved

Fix integ tests and disable decimal caching

14d8298

aymkhalil force-pushed the avro-decimal branch from a04d4d6 to 14d8298 Compare May 31, 2023 23:28

aymkhalil requested a review from cbornet May 31, 2023 23:35