Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PIP] PIP-173 : Create a built-in Function implementing the most common basic transformations #15902

Closed
cbornet opened this issue Jun 2, 2022 · 9 comments

Comments

@cbornet
Copy link
Contributor

cbornet commented Jun 2, 2022

Motivation

Currently, when users want to modify the data in Pulsar, they need to write a Function.
For a lot of use cases, it would be handy for them to be able to use a ready-made built-in Function that implements the most common basic transformations like the ones available in Kafka Connect’s SMTs.
This removes users the burden of writing the Function themselves, having to understanding the perks of Pulsar Schemas, coding in a language that they may not master (probably Java if they want to do advanced stuff), and they benefit from battle-tested, maintained, performance-optimised code.

Goal

This PIP is about providing a TransformFunction that executes a sequence of basic transformations on the data.
The TransformFunction shall be easy to configure, launchable as a built-in NAR.
The TransformFunction shall be able to apply a sequence of common transformations in-memory so we don’t need to execute the TransformFunction multiple times and read/write to a topic each time.

This PIP is not about appending such a Function to a Source or a Sink.
While this is the ultimate goal, so we can provide an experience similar to Kafka SMTs and avoid a read/write to a topic, this work will be done in a future PIP.
It is expected that the code written for this PIP will be reusable in this future work.

API Changes

This PIP will introduce a new transform module in pulsar-function multi-module project.
 The produced artifact will be a NAR of the TransformFunction.

Implementation

When it processes a record, TransformFunction will :


  • Call in sequence the process method of a series of TransformStep implementations.
    Each TransformStep will modify the output message and topic as needed.

  • Send the transformed message to the output topic.


The TransformFunction will read its configuration as Json from userConfig in the format:

{
  "steps": [
    {
      "type": "drop-fields", "fields": "keyField1,keyField2", "part": "key"
    },
    {
      "type": "merge-key-value"
    },
    {
      "type": "unwrap-key-value"
    },
    {
      "type": "cast", "schema-type": "STRING"
    }
  ]
}

Each step is defined by its type and uses its own arguments.

This example config applied on a KeyValue<AVRO, AVRO> input record with value {key={keyField1: key1, keyField2: key2, keyField3: key3}, value={valueField1: value1, valueField2: value2, valueField3: value3}} will give after each step:

{key={keyField1: key1, keyField2: key2, keyField3: key3}, value={valueField1: value1, valueField2: value2, valueField3: value3}}(KeyValue<AVRO, AVRO>)

           |
           | ”type": "drop-fields", "fields": "keyField1,keyField2”, "part": "key”
           |
{key={keyField3: key3}, value={valueField1: value1, valueField2: value2, valueField3: value3}} (KeyValue<AVRO, AVRO>)
           |
           | "type": "merge-key-value"
           |

{key={keyField3: key3}, value={keyField3: key3, valueField1: value1, valueField2: value2, valueField3: value3}} (KeyValue<AVRO, AVRO>)
           |
           | "type": "unwrap-key-value"
           |
{keyField3: key3, valueField1: value1, valueField2: value2, valueField3: value3} (AVRO)
           |
           | "type": "cast", "schema-type": "STRING"
           |
{"keyField3": "key3", "valueField1": "value1", "valueField2": "value2", "valueField3": "value3"} (STRING)

TransformFunction will be built as a NAR including a pulsar-io.yaml service file so it can be registered as a built-in function with name transform.

Reject Alternatives

Create a separate third party project not managed by the Pulsar community.
Problems:

  • it won't be easily available to all Pulsar users
  • it would be hard to guarantee compatibility with many Pulsar versions, and the Transformations will use many advanced features of Pulsar APIs
@eolivelli
Copy link
Contributor

I would not go too much into the implementation details in the PIP like TransformContext

I would only cite the steps, the configurations and the first operations that will be available.
the TransformContext without the PR is very hard to understand and actually, it is a implementation detail, hidden to the users and we will change it in the future as needed.

I would add to "Reject Alternatives":

Create a separate third party project not managed by the Pulsar community.
Problems:
it won't be easily available to all Pulsar users
it would be hard to guarantee compatibility with many Pulsar versions, and the Transformations will use many advanced features of Pulsar APIs

@cbornet
Copy link
Contributor Author

cbornet commented Jun 3, 2022

Thanks Enrico. I did the updates.

@nlu90
Copy link
Member

nlu90 commented Jun 8, 2022

I would suggest having some concrete implementations of the TransformFunction in a separate repo first instead of starting right in the main pulsar project.

@cbornet
Copy link
Contributor Author

cbornet commented Jun 9, 2022

I would suggest having some concrete implementations of the TransformFunction in a separate repo first instead of starting right in the main pulsar project.

@nlu90 see the comment from @eolivelli . We've put this in "Rejected alternatives" for the reasons that:

  • it won't be easily available to all Pulsar users
  • it would be hard to guarantee compatibility with many Pulsar versions, and the Transformations will use many advanced features of Pulsar APIs

@nicoloboschi
Copy link
Contributor

@nlu90 The transformations will be well tested in the codebase. The use cases do not came out of the blue (in that case you may wonder if they add value for users) but they are inspired by Kafka/Confluent Platform.

The very good thing about this proposal is that they are builtin but they won't be an additional weight for users that do not want to use them.

@dave2wave
Copy link
Member

@nlu90 Would you like to have a conversation with @cbornet to discuss the performance motivations for this change?

@nlu90
Copy link
Member

nlu90 commented Jun 13, 2022

@dave2wave @cbornet @nicoloboschi

Thanks for your replies. I really would like to hear more about use cases and performance motivations to better understand it.

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jul 14, 2022
@eolivelli
Copy link
Contributor

This PIP was not accepted

The work has been posted here
https://github.com/datastax/pulsar-transformations
and it is available to anyone, it is Apache 2 licensed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants