Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work Function MetaDataManager to make all metadata writes only by the leader #7255

Merged
merged 35 commits into from
Jun 26, 2020

Conversation

srkukarni
Copy link
Contributor

(If this PR fixes a github issue, please add Fixes #<xyz>.)

Fixes #

(or if this PR is one task of a github issue, please add Master Issue: #<xyz> to link to the master issue.)

Master Issue: #

Motivation

Currently Function Metadata topic is not compacted, which means that in a long running system, with sufficient number of function submissions/updates/state changes, the startup lag for workers to read from beginning increases linearly.
However the current mechanism of Function Metadata topic writes does not lend itself to compaction. This is because all workers write into the topic and only one of them wins(it need not be the last).
This pr makes a first stab at simplifying the current workflow. Now, upon a function submission/update/state change, the workers simply pass that request to the leader. The leader is the arbitrer of what goes in(just like it is today) and is the only one writing to the function metadata topic. The rest of the worker still continue to tail the topic to receive the appropriate updates. The leader does not have the tailer, and instead directly updates in in-memory state when it writes to the metadata topic.

Modifications

Describe the modifications you've done.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

this.functionMetaDataTopicTailer = new FunctionMetaDataTopicTailer(this,
pulsarClient.newReader(), this.workerConfig, this.errorNotifier);
// read all existing messages
this.setInitializePhase(true);
while (this.functionMetaDataTopicTailer.getReader().hasMessageAvailable()) {
this.functionMetaDataTopicTailer.processRequest(this.functionMetaDataTopicTailer.getReader().readNext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is kind of weird that the functionMetaDataTopicTailer.processRequest() will call back to FunctionMetadataManager. Seems like an awkward interaction between the classes. Perhaps we can refactor in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

readerThread.start();
}

@Override
public void run() {
while(running) {
while (running) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check if we have really reached the end of the topic, I think its safer if we check reader.hasMessageAvailable() == false and reader.readNext(5, TimeUnit.SECONDS) returns null.

.build();
try {
lastMessageSeen = exclusiveLeaderProducer.send(serviceRequest.toByteArray());
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return a 500 error to the end user? If we call just "errorNotifier.triggerError(e)", the worker die and the end user will likely not get a response or a timeout error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Changed

Copy link
Contributor

@jerrypeng jerrypeng Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"errorNotifier.triggerError(e);" is still being called. The worker might exit before exception gets bubbled up and a response send back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question here is whats the right thing to do. If we are having issues to write into the producer, should the leader just reject the request saying Internal server error and hope that things will be better next time? Or is the right approach to trigger worker death?

Copy link
Contributor

@jerrypeng jerrypeng Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return the error to the worker making the call to the leader, otherwise the worker might have to wait for a timeout. I think we should just return an error and the user can retry. There is no guarantee that restarting the worker or electing another leader will help solve the issue since all the workers have the same configuration. Restarting can also be heavy and I would prefer to minimize the amount of forced restarts as possible.

@srkukarni srkukarni merged commit c83a656 into apache:master Jun 26, 2020
@srkukarni srkukarni deleted the functions_leader_executor branch June 26, 2020 14:49
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
…the leader (apache#7255)

* Function workers re-direct call update requests to the leader

* Fixed test

* tests pass

* Working version

* Fix test

* Short circuit update

* Fix test

* Fix test

* Fix tests

* Added one more catch

* Added one more catch

* Seperated internal and external errors

* Fix test

* Address feedback

* Do not expose updateOnLeader to functions

* hide api

* hide api

* removed duplicate comments

* Do leadership changes in function metadata manager

* make the function sync

* Added more comments

* Throw error

* Changed name

* address comments

* Deleted unused classes

* Rework metadata manager

* Working

* Fix test

* A better way for test

* Address feedback

Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com>
sijie added a commit to sijie/pulsar that referenced this pull request Jan 21, 2021
*Motivation*

apache#7255 re-worked Function MetaDataManager to make all metadata writes only by the leader.
This unintentionally broke Pulsar Functions when m-TLS is used for authentication. Because it doesn't
taken TLS port into consideration and always uses a non-TLS port to communicate with the leader broker.

The PR fixes the broken implementation and ensure Pulsar Functions use the right service url and
authentication plugin to communicate with leader.

*Tests*

Add an integration test to reproduce the issue and ensure functions worker with m-TLS
codelipenghui pushed a commit that referenced this pull request Feb 5, 2021
*Motivation*

#7255 re-worked Function MetaDataManager to make all metadata writes only by the leader.
This unintentionally broke Pulsar Functions when m-TLS is used for authentication. Because it doesn't
taken TLS port into consideration and always uses a non-TLS port to communicate with the leader broker.

The PR fixes the broken implementation and ensure Pulsar Functions use the right service url and
authentication plugin to communicate with leader.

*Tests*

Add an integration test to reproduce the issue and ensure functions worker with m-TLS
ivankelly pushed a commit to ivankelly/pulsar that referenced this pull request Aug 10, 2021
*Motivation*

apache#7255 re-worked Function MetaDataManager to make all metadata writes only by the leader.
This unintentionally broke Pulsar Functions when m-TLS is used for authentication. Because it doesn't
taken TLS port into consideration and always uses a non-TLS port to communicate with the leader broker.

The PR fixes the broken implementation and ensure Pulsar Functions use the right service url and
authentication plugin to communicate with leader.

*Tests*

Add an integration test to reproduce the issue and ensure functions worker with m-TLS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants