Adding upsert functionality by devinbost · Pull Request #4012 · apache/pulsar

devinbost · 2019-04-09T18:55:23Z

Motivation

Upsert functionality is intended to simplify continuous deployment processes. This simplification is especially desirable because there is no Python Admin API available that provides typed objects. Rather, users are required to build workflows with the Admin CLI, which returns strings that can be messy to handle. Upsert helps solve this complexity by allowing Pulsar to handle behavior according to whether the component has already been created or not.

Motivation (with future in mind)

Upsert is a dependency of BulkUpsert (to be added), which will greatly simplify seamless continuous deployments by allowing Pulsar to rapidly update Functions, Sinks, and Sources (and leverage parallelization). BulkUpsert will enable automated deployments to handle all but deletes on components that are no longer used. (For that, functionality will need to be added to allow the user to easily compare their expected component tree with the component tree in Pulsar to identify components that need to be deleted.)

Modifications

My contributions add Upsert functionality for Functions, Sinks, and Sources to the REST API and to the Admin CLI. Upsert conditionally combines the functionality of Create and Update operations. If the component already exists in Pulsar, Upsert updates the component; if the component doesn't already exist in Pulsar, Upsert creates the component.

Commits 29e2a66 and 7749419 were just to get the latest changes from the Pulsar repo.
My primary code is in commit 441ba03, and the unit tests are in commit 441ba03.

Verifying this change

This commit is largely covered already by unit tests for registration and updates for Functions, Sinks, and Sources. However, additional unit tests have been added for Upsert specifically. Tests were added to:

org.apache.pulsar.functions.worker/rest.api/FunctionApiV3ResourceTest.java
org.apache.pulsar.functions.worker/rest.api/SinkApiV3ResourceTest.java
org.apache.pulsar.functions.worker/rest.api/SourceApiV3ResourceTest.java
org.apache.pulsar.functions.worker/FunctionMetaDataManagerTest.java

When running the build tests, I did notice these unrelated test failures:
1.

[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 13.506 s <<< FAILURE! - in org.apache.pulsar.broker.service.RackAwareTest
[ERROR] testPlacement(org.apache.pulsar.broker.service.RackAwareTest) Time elapsed: 1.085 s <<< FAILURE!
java.lang.AssertionError: first bookie in rack 0 not included in ensemble expected [true] but found [false]

[ERROR] Tests run: 13, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 59.026 s <<< FAILURE! - in org.apache.pulsar.io.PulsarFunctionE2ETest
[ERROR] testPulsarSinkStatsWithUrl(org.apache.pulsar.io.PulsarFunctionE2ETest) Time elapsed: 0.614 s <<< FAILURE!
java.lang.NullPointerException
at org.apache.pulsar.io.PulsarFunctionE2ETest.testPulsarSinkStats(PulsarFunctionE2ETest.java:527)
at org.apache.pulsar.io.PulsarFunctionE2ETest.testPulsarSinkStatsWithUrl(PulsarFunctionE2ETest.java:708)

It appears that these test failures are unrelated to any changes that I made to this repo. However, I ran these tests before merging in commit 29e2a66. (Sorry if that's confusing. I was working in a different directory.)

This pull request affects:

The public API: (yes)
The rest endpoints: (yes)
The admin cli options: (yes)
Anything that affects deployment: (don't know)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (not documented YET). (Please provide guidance on where it needs to be documented. There are 930 markdown (.md) files in this repo...)

…ces.

devinbost · 2019-04-09T19:45:28Z

It looks like there were some failing tests that I didn't see locally, so I'll work on fixing those now. However, I'd still like some guidance on where to update documentation for this feature.

jerrypeng

I would first like to thank @devinbost for actively participating in improving pulsar functions!

I have some concerns regarding the implementation/design of this upsert functionality in regards to increasing complexity of the function core code.

Why does this need to be part of the REST API? Can't users create their own wrapper libraries to do this logic outside ? Something like:

if ( function exists) {
   createFunction()
} else {
updateFunction()
}

Is this inconvenient enough for users to warrant an additional API endpoint for doing this?

We can even add this functionality into the pulsar-admin CLI.

Even if we add a REST API endpoint for upserts, can't we also just use the above logic i.e. a combination of existing APIs to implement the functionality. Do we need to modify core class like FunctionRuntimeManager and FunctionMetaDataManager to do so? I would rather not unnecessarily increase the complexity of those class.

devinbost · 2019-04-09T23:07:43Z

@jerrypeng Thank you for asking these questions. I will explain my reasoning behind these changes and then answer your questions.

The Goal

My primary goal is to improve the continuous deployment process for Pulsar environments operating at-scale. In large projects where multiple teams are collaborating to develop functions, sinks, and sources, we need an automated deployment process that allows us to quickly and easily update a production Pulsar environment without downtime for cases where we may have hundreds or thousands of inter-dependent Pulsar functions, sinks, and sources that need to be deployed simultaneously. My hope is to allow Pulsar to handle more of the deployment logic to simplify the implementation of Pulsar in large/high-performance production environments.

The Problem

Because our pulsar-admin commands are dockerized to allow them to operate at scale, there is performance overhead with every pulsar-admin command that must be executed. Each pulsar-admin command takes approximately 3-5 seconds to execute. In a deployment with 300 Pulsar functions, if each pulsar-admin command must be executed in series (rather than in parallel), executing 300 pulsar-admin commands to update these objects takes 15-25 minutes. Because Pulsar is in a broken state while these commands are being executed, this deployment approach could result in a production Pulsar environment being down for 15-25 minutes, far beyond our SLA of 300 milliseconds of downtime. Furthermore, if we need to execute a get-status check before each update command (so that we can conditionally execute a create command instead), we double our wait time and significantly increase the complexity of the deployment code that we must maintain.

The Hope

If we can offload more of the deployment logic to Pulsar, then we could easily parallelize the deployment process in a way that avoids the overhead and complexity associated with repeatedly executing pulsar admin commands and handling the text output.

The Current Approach

The way we have handled the deployment so far is to use SaltStack to read a YAML manifest file with component (function, sink, and source) metadata to generate pulsar-admin commands to construct the create statements for these components. Here is an obfuscated example of this YAML:

- type: source
    namespace: ns1
    tenant: tenant1
    name: source1-kafka
    sourceType: kafka
    destinationTopicName: persistent://tenant1/ns1/topic1
    configs:
        bootstrapServers: kafka_bootstrap_servers
        groupId: "kafkaGroupId"
        topic: "kafkaTopic"
        consumerConfigProperties:
        security.protocol: "SASL_PLAINTEXT"
        sasl.kerberos.service.name: "kafka"
        auto.offset.reset: "latest"
        sasl.jaas.config: sasl_jaas_config
- type: function
    namespace: ns1
    name: func1
    tenant: tenant1
    artifactFileName: tenant1-1.1-SNAPSHOT-jar-with-dependencies.jar
    className: com.path.to.className1
    inputs:
    - persistent://tenant1/ns1/topic1
    logTopic: persistent://tenant1/ns1/logTopic1
    output: persistent://tenant1/ns1/topic2
- type: sink
    namespace: ns1
    name: sink1-redis
    tenant: tenant1
    artifactFileName: tenant1-1.1-SNAPSHOT-jar-with-dependencies.jar
    className: com.path.to.className2
    inputs:
    - persistent://tenant1/ns1/topic2
    configs:
        hostname: redis_hostname
        port: redis_port
        password: redis_pass

The Idea

If we could pass a YAML file like this to Pulsar and have Pulsar ensure that its state matches our YAML file, it would make large-scale continuous deployments seamless. The Upsert functionality is one step in this direction, but it's not anywhere near as important as a bigger-picture solution to this problem. My team wants to build changes into Pulsar to simplify deployments, but we need architectural guidance about where to make these changes in Pulsar so that we don't violate architectural expectations.

Answering your questions

Regarding your Question 1:

My current understanding is that the Admin CLI creates REST calls that hit the Pulsar REST API. If my understanding of the behavior of the Admin CLI is incorrect, then Upsert would not need to be added to the REST API.

Regarding your Question 2:

I agree that it would be better to use the existing APIs and not modify FunctionRuntimeManager and FunctionMetaDataManager, especially because the current PR introduces a considerable amount of code duplication. However, I didn't want to create new classes to handle the new functionality without getting architectural guidance because I wasn't sure where to put the new classes.

What are your thoughts? If we can solve the broader deployment problem with a YAML based approach, then we could create a robust solution that wouldn't need Upsert to be added to the REST API to simplify the deployment process.

devinbost · 2019-04-10T17:45:03Z

This relates to issue #4021 (Pulsar admin commands do not support parallelization.)

jerrypeng · 2019-04-10T18:30:13Z

@devinbost thanks sharing your use case and the hurdles you are trying to overcome!

In regards to:

Because our pulsar-admin commands are dockerized to allow them to operate at scale, there is performance overhead with every pulsar-admin command that must be executed.

Is there a reason why you can't just submit/update functions via the REST endpoints instead of using the pulsar-admin CLI from docker containers? Submitting/Updating functions by just making a HTTP REST call will be a lot faster than start up a docker container every time to execute commands via command line

In a deployment with 300 Pulsar functions, if each pulsar-admin command must be executed in series (rather than in parallel), executing 300 pulsar-admin commands to update these objects takes 15-25 minutes.

Do you have 300 individual functions or is there a function with 300 instances or a group of functions that total 300 instances? There will be a huge submission time difference depending on which scenario. Submitting one function with 300 instances will take much less time that submitting 300 functions with one instance each.

Because Pulsar is in a broken state while these commands are being executed

What do you mean by this? The cluster will be running as it should when submitting functions.

this deployment approach could result in a production Pulsar environment being down for 15-25 minutes, far beyond our SLA of 300 milliseconds of downtime.

In a situation, that somehow your whole pulsar cluster is down and all your functions disappeared, it is unrealistic to expect the downtime to be less that 300 milliseconds. As you probably already know, starting up a pulsar cluster regardless of functions will take longer than that. If you are just talking about resubmitting 300 functions, I am not sure its realistic to expect all the JARs/Packages for 300 functions can be upload in 300 milliseconds. If you are trying to avoid a situation in which you suffer downtime because a catastrophic event happened to your cluster, i would recommend having redundancy. Have geo-replicated clusters across multiple regions. So you can seamlessly cut traffic from your downed cluster to another cluster.

If you have 300 functions, I don't think its going to be the norm for you to need to update all 300 functions. Its more likely that its going to be a subset of that.

I think functionality you are looking is bulk create, update, or upserts. You want to bring a cluster from a potentially unknown state into a known consistent state in regards to functions. I am I understanding you correctly?

While we can add upserts and even bulk upserts. I would suggest you to try just creating/updating functions directly using the REST endpoint first to see if that is good enough.

I would still very much like to see features like bulk create/update/upserts in Pulsar functions. I do believe we can accomplish them by just adding/modifying the "front end" code i.e. the REST the endpoints and ComponentImpl.java to implement the bulk actions. Please reference the code in registerFunction and updateFunction and when can probably just run that in a loop for bulk actions.

In regards, to this PR and implementing upserts, I think you can just do something like the following in ComponentImpl.java

if(functionMetaDataManager.containsFunction(tenant, namespace, functionName)) {
   updateFunction(...)
} else {
   registerFunction(...)
}

The caveat in the above logic is that a function can be deleted after "containsFunction" is called. To handle that scenario I would suggest you looking at how the updateFunction code works and basically copy that code and modify it to also allow functions that don't exist to also proceed in the logic.

devinbost · 2019-04-10T19:36:56Z

@jerrypeng Thank you for your very detailed response. I appreciate your time and attention to this matter.

Regarding:

Is there a reason why you can't just submit/update functions via the REST endpoints instead of using the pulsar-admin CLI from docker containers? Submitting/Updating functions by just making a HTTP REST call will be a lot faster . . .

I appreciate your guidance. Based on advice from @merlimat earlier today, I am currently working on an implementation using the REST endpoints.

Regarding:

Do you have 300 individual functions or is there a function with 300 instances or a group of functions that total 300 instances? There will be a huge submission time difference depending on which scenario. Submitting one function with 300 instances will take much less time that submitting 300 functions with one instance each.

At the current moment, all of our functions are individual because they represent different use cases. However, we appreciate your advice about the performance improvement that we will get from deploying function instances, so we will examine ways that we can refactor to obtain those benefits.

Regarding:

What do you mean by this? The cluster will be running as it should when submitting functions.

I may have been unintentionally misleading, and I apologize for that. Please let me clarify. When I said:

Pulsar is in a broken state

I didn't mean that the Pulsar cluster is not running. What I meant is that our end-to-end production message pipelines will be in a broken state. (i.e. Our customers will experience problems.)
Consider a plumbing analogy. If you need to re-route pipes while water is flowing, if you can't do it extremely quickly, then water will end up leaking everywhere, and the people who are expecting water at a particular location will notice a loss of service. This doesn't mean that the water system is completely broken or that water is not flowing; however, it means that water is not reaching our customers.
In our case, if we have a production data flow that is processing tens of thousands of messages per second, if we need to deploy updates to functions that are inter-dependent, then until all of the functions are deployed, some of the functions may introduce breaking changes that could cause data loss or could cause messages to fail to reach the final destination topic until all of the updated functions are deployed.
Does this make more sense?

Regarding:

I think functionality you are looking is bulk create, update, or upserts. You want to bring a cluster from a potentially unknown state into a known consistent state in regards to functions. I am I understanding you correctly?

That is exactly right.
I think you're right that we won't likely always need to update all 300 functions every time we deploy updates. However, we need to ensure that Pulsar can quickly and seamlessly match the expected state when we deploy updates.

Regarding:

While we can add upserts and even bulk upserts. I would suggest you to try just creating/updating functions directly using the REST endpoint first to see if that is good enough.
I will investigate your suggestions for implementing these changes for bulk actions.

Thank you also for the guidance and example change to ComponentImpl.java for the Upsert functionality for this PR.

merlimat · 2019-04-10T20:38:11Z

I didn't mean that the Pulsar cluster is not running. What I meant is that our end-to-end production message pipelines will be in a broken state. (i.e. Our customers will experience problems.)
Consider a plumbing analogy. If you need to re-route pipes while water is flowing, if you can't do it extremely quickly, then water will end up leaking everywhere, and the people who are expecting water at a particular location will notice a loss of service.

The purpose of the messaging system is to buffer up for these scenarios. Functions use a subscription associated that is used to retain the data when no consumers are connected.

You have to plan the amount of disk space based on the time buffer you want to have in case of issues.

During an update of the function there will be a quick restart of the process. If you want to minimize that time, you just need to have >1 instance per function. During the rolling upgrade there'll be always one function instance consuming from the topic.

devinbost · 2019-04-10T21:15:40Z

@jerrypeng I wrote a Java method to performance test the process of creating components with the Pulsar Java Admin API, and it's orders of magnitude faster. (I was running Pulsar standalone locally in a docker container, but it's still indicative of the speedup.) Here's what I got:

Creation of 1000 tenants executed in 6411 milliseconds.
It then took an additional 48 milliseconds to get the results.

So, it definitely appears that we can use the REST API to implement bulk operations because that performance is acceptable.

…clude server registration method (#4175)

…Now, the gRPC files contains the handler method for gRPC server registration. Also, upgraded proto package version to 3 from 2. (#4175)

Devin Bost added 4 commits April 8, 2019 17:55

Adding functionality for upsert, conditional update/create.

441ba03

Merge branch 'master' of https://github.com/devinbost/pulsar

29e2a66

Merge remote-tracking branch 'upstream/master'

7749419

Added unit tests for Upsert capability for functions, sinks, and sour…

c58a122

…ces.

Fixed bug in test methods for Upsert.

3d88000

srkukarni requested review from jerrypeng, sijie and srkukarni April 9, 2019 21:34

srkukarni added the area/function label Apr 9, 2019

srkukarni added this to the 2.4.0 milestone Apr 9, 2019

jerrypeng requested changes Apr 9, 2019

View reviewed changes

jerrypeng assigned devinbost Apr 9, 2019

devinbost mentioned this pull request Apr 10, 2019

Pulsar-admin commands do not support parallelization #4021

Closed

sijie removed this from the 2.4.0 milestone Jun 9, 2019

Devin Bost added 2 commits January 10, 2020 18:30

Added grpc plugin to protoc command to fix gRPC-generated files to in…

0694bd6

…clude server registration method (#4175)

Rebuilt gRPC-generated files after fixing bug in generate.sh script. …

19d3829

…Now, the gRPC files contains the handler method for gRPC server registration. Also, upgraded proto package version to 3 from 2. (#4175)

devinbost closed this Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding upsert functionality#4012

Adding upsert functionality#4012
devinbost wants to merge 7 commits intoapache:masterfrom
devinbost:master

devinbost commented Apr 9, 2019

Uh oh!

devinbost commented Apr 9, 2019

Uh oh!

jerrypeng left a comment

Uh oh!

devinbost commented Apr 9, 2019 •

edited

Loading

Uh oh!

devinbost commented Apr 10, 2019

Uh oh!

jerrypeng commented Apr 10, 2019 •

edited

Loading

Uh oh!

devinbost commented Apr 10, 2019 •

edited

Loading

Uh oh!

merlimat commented Apr 10, 2019

Uh oh!

devinbost commented Apr 10, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

devinbost commented Apr 9, 2019

Motivation

Motivation (with future in mind)

Modifications

Verifying this change

This pull request affects:

Documentation

Uh oh!

devinbost commented Apr 9, 2019

Uh oh!

jerrypeng left a comment

Choose a reason for hiding this comment

Uh oh!

devinbost commented Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Goal

The Problem

The Hope

The Current Approach

The Idea

Answering your questions

Regarding your Question 1:

Regarding your Question 2:

Uh oh!

devinbost commented Apr 10, 2019

Uh oh!

jerrypeng commented Apr 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinbost commented Apr 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merlimat commented Apr 10, 2019

Uh oh!

devinbost commented Apr 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

devinbost commented Apr 9, 2019 •

edited

Loading

jerrypeng commented Apr 10, 2019 •

edited

Loading

devinbost commented Apr 10, 2019 •

edited

Loading

devinbost commented Apr 10, 2019 •

edited

Loading