Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ServerCnx] Only reply to client when code completes producerFuture #13949

Merged

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Jan 25, 2022

Motivation

We should only send the error response to the client when the code is able to complete the producerFuture. This logic is described here:

// If client timed out, the future would have been completed
// by subsequent close. Send error back to
// client, only if not completed already.
if (producerFuture.completeExceptionally(exception)) {
commandSender.sendErrorResponse(requestId,
BrokerServiceException.getClientErrorCode(cause), cause.getMessage());
}
producers.remove(producerId, producerFuture);

Edit: in a previous version of this motivation section, I attributed the current behavior to #12874. That PR did not introduce this behavior, though.

Modifications

  • Move the response to the client into a conditional block that only runs when this section of the code is able to complete the future.

Verifying this change

This is a trivial change.

Does this pull request potentially affect one of the following parts:

It impacts how the broker interacts with the client. This change ensures that we have the correct behavior.

Documentation

  • no-need-doc

This change is completely internal.

@michaeljmarshall michaeljmarshall added type/cleanup Code or doc cleanups e.g. remove the outdated documentation or remove the code no longer in use area/broker doc-not-needed Your PR changes do not impact docs labels Jan 25, 2022
@michaeljmarshall michaeljmarshall added this to the 2.10.0 milestone Jan 25, 2022
@michaeljmarshall michaeljmarshall self-assigned this Jan 25, 2022
Copy link
Member

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jason918
Copy link
Contributor

In #12874, we reply to the client in all cases. That is not our current design though.

This issue exists long before #12874 which just changed the sync blocking to async implementation.

I wonder if there are some other cases have the same issue.

@michaeljmarshall
Copy link
Member Author

In #12874, we reply to the client in all cases. That is not our current design though.

This issue exists long before #12874 which just changed the sync blocking to async implementation.

I wonder if there are some other cases have the same issue.

@Jason918 - thank you for calling this out, sorry about that misattribution. I noticed the code today when looking at recent changes, but I failed to dig deep enough to know that it predated your commit. I updated the PR description.

I did inspect the rest of the class today, and I don't see the behavior anywhere else.

@codelipenghui
Copy link
Contributor

We should apply this change carefully.

If the client is like the followings:

  1. The client sends the command to create the producer
  2. The broker received the command and start to process the request, but can't complete it after the client-side operation timeout
  3. The client tries to send a new create producer command with the same producer ID but a different request ID (Of course this usually doesn't happen)
  4. Now the broker received the new create producer command and will use the existing future https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1173 to process the new request.
  5. Then the previous request failed, and complete the future with an exception
  6. The client will not able to receive the response for the second request.

In my opinion, we should handle such case in the following principles:

  1. The server side should return a response to the client, one producer future might map to multiple request IDs, we should avoid broker miss response to the client, of course, if the connection is not available, there is no way to make the above guarantee.
  2. The client-side should make sure to only process one response for a request ID, if the request ID has already been finished by another response, the client to ignore the subsequent responses.

I think it's the more easy and clear way to handle both the client-side and server-side. And currently, the client-side also follow this way https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L534, https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L719

@codelipenghui
Copy link
Contributor

A related discussion before: #13245

@michaeljmarshall
Copy link
Member Author

If the client is like the followings:

1. The client sends the command to create the producer

2. The broker received the command and start to process the request, but can't complete it after the client-side operation timeout

3. The client tries to send a new create producer command with the same producer ID but a different request ID (Of course this usually doesn't happen)

4. Now the broker received the new create producer command and will use the existing `future` https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1173 to process the new request.

5. Then the previous request failed, and complete the future with an exception

6. The client will not able to receive the response for the second request.

This is a valid concern. However, the client is supposed to send a CloseProducer command when timing out the first request before sending a Producer command. When it sends the CloseProducer command, the above scenario is avoided. This was not done in the Java Client until I fixed it a month ago: #13161. Additionally, it was not explicitly called out in the protocol documentation until a month ago either: #12948.

The consequence of my proposed PR: if a client does not follow the protocol spec, it will not receive a response (step 6), then the client will timeout waiting for a response and then it'll retry and succeed (or fail).

I agree that we need to be careful implementing this. I wouldn't cherry pick these changes to previous branches. However, as I have mentioned elsewhere, I think this is the right design because we should only respond to a client once.

The server side should return a response to the client, one producer future might map to multiple request IDs, we should avoid broker miss response to the client, of course, if the connection is not available, there is no way to make the above guarantee.

I don't think we need to design for this case because the client is not supposed to send two Producer commands without a CloseProducer command in between. Note that we don't store the requestId for a second Producer command, so there is no way for the broker to respond to the second request. Since the client removed the first request when it timed out, replying to it won't matter. The second client request will timeout, too, and the client will need to resend the Producer command a third time.

I think it's the more easy and clear way to handle both the client-side and server-side. And currently, the client-side also follow this way https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L534, https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L719

In both of these examples, the client will log warning messages if it receives responses for requests that are already completed. Those warnings won't be actionable. I think we should respond to client requests just once.

@codelipenghui
Copy link
Contributor

However, as I have mentioned elsewhere, I think this is the right design because we should only respond to a client once.

Yes, I'm also talking about this one, the only difference is the behavior of the client-side sending multiple requests with the same producer ID.

Two options we are talking about:

  1. Provide the spec to tell the implements, "If you create multiple requests using the same producer ID, you might never get a response".
  2. Return an error to the client, "The current producer is creating, please close the old one first!"

I would tend to the second one.

@michaeljmarshall
Copy link
Member Author

  1. Return an error to the client, "The current producer is creating, please close the old one first!"

This is already the current behavior. See:

if (existingProducerFuture != null) {
if (existingProducerFuture.isDone() && !existingProducerFuture.isCompletedExceptionally()) {
Producer producer = existingProducerFuture.getNow(null);
log.info("[{}] Producer with the same id is already created:"
+ " producerId={}, producer={}", remoteAddress, producerId, producer);
commandSender.sendProducerSuccessResponse(requestId, producer.getProducerName(),
producer.getSchemaVersion());
return null;
} else {
// There was an early request to create a producer with same producerId.
// This can happen when client timeout is lower than the broker timeouts.
// We need to wait until the previous producer creation request
// either complete or fails.
ServerError error = null;
if (!existingProducerFuture.isDone()) {
error = ServerError.ServiceNotReady;
} else {
error = getErrorCode(existingProducerFuture);
// remove producer with producerId as it's already completed with exception
producers.remove(producerId, existingProducerFuture);
}
log.warn("[{}][{}] Producer with id is already present on the connection, producerId={}",
remoteAddress, topicName, producerId);
commandSender.sendErrorResponse(requestId, error, "Producer is already present on the connection");
return null;
}
}
.

I agree that sending a failure is the right design, since the client is not following the protocol spec (it shouldn't try to create the same producer twice). Although, technically, if the producer is already created, we just respond that it was created successfully. I am not sure that I like this design, but that is a different discussion.

I described in detail why it is problematic if the client does not send the CloseProducer command before trying to create a new producer here (https://lists.apache.org/thread/x7886r5v1dtg4c4nbptdfn97ryw097wl):

Specifically, if the client fails to send a CloseProducer command,
it ends up getting into a sequence of retries where each new
Producer command receives an immediate ErrorResponse because the
ServerCnx already has a pending producer. By sending a
CloseProducer command, the client gives the broker permission to
stop keeping track of the original create producer request. It also
means that if the topic eventually loads, the broker will respond to
the right request id with a ProducerSuccessResponse command.

This is another reason why the broker shouldn't respond if the producer future is already completed: it gets completed when the client sends a CloseProducer command.

@codelipenghui
Copy link
Contributor

@michaeljmarshall Thanks for the explanation, LGTM.

Although, technically, if the producer is already created, we just respond that it was created successfully. I am not sure that I like this design, but that is a different discussion.

Yes, it should be a different discussion, I think we should return such as a ProducerConflict exception to the client, otherwise, if the client implementation does not follow the spec, it will have more than 1 active producer instance with the same producer ID.

@codelipenghui codelipenghui merged commit 3c6aae3 into apache:master Jan 28, 2022
@michaeljmarshall michaeljmarshall deleted the conditionally-reply-to-client branch January 28, 2022 06:08
@michaeljmarshall
Copy link
Member Author

Yes, it should be a different discussion, I think we should return such as a ProducerConflict exception to the client, otherwise, if the client implementation does not follow the spec, it will have more than 1 active producer instance with the same producer ID.

@codelipenghui - that's a great point, and I support exploring a change to the behavior. I just looked at the git history, and we've responded to this behavior with a ProducerSuccess since the initial import of the project in 2016. The only change would be to reject the second request. The initial producer would continue to be connected.

Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
…he#13949)

### Motivation

We should only send the error response to the client when the code is able to complete the `producerFuture`. This logic is described here:

https://github.com/apache/pulsar/blob/2285d02aa9957af7877b9d3d3c628a750d813ca7/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1286-L1293

Edit: in a previous version of this motivation section, I attributed the current behavior to apache#12874. That PR did not introduce this behavior, though.

### Modifications

* Move the response to the client into a conditional block that only runs when this section of the code is able to complete the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs type/cleanup Code or doc cleanups e.g. remove the outdated documentation or remove the code no longer in use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants