Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proto] java.lang.IllegalStateException: Some required fields are missing #14436

Closed
congbobo184 opened this issue Feb 24, 2022 · 38 comments
Closed
Assignees
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@congbobo184
Copy link
Contributor

congbobo184 commented Feb 24, 2022

Describe the bug
image
parse message metadata throw Exception
To Reproduce
Steps to reproduce the behavior:

  1. start up a standalone pulsar, the version is bigger than 2.10.0
  2. change broker config
    transactionCoordinatorEnabled=true
    managedLedgerMaxEntriesPerLedger=3
    managedLedgerMinLedgerRolloverTimeMinutes=1
  3. config namespace ttl ./pulsar-admin namespaces set-message-ttl -ttl 10 public/default
  4.      PulsarClient pulsarClient = PulsarClient.builder().enableTransaction(true).serviceUrl("pulsar://127.0.0.1:6650").build();
         int sendTopicsCount = 200;
         String topicName = "test55";
         List<Producer<String>> producers = new ArrayList<>(sendTopicsCount);
    
         for (int i = 0; i < sendTopicsCount; i++) {
             producers.add(pulsarClient.newProducer(Schema.STRING).sendTimeout(0, TimeUnit.SECONDS).topic(i + topicName).create());
              Consumer<String> consumer = pulsarClient.newConsumer(Schema.STRING).subscriptionName("test").topic(i + topicName).subscribe();
             consumer.closeAsync();
         }
         for (int j = 0; j < 10000; j ++) {
             for (int i = 0; i < sendTopicsCount; i++) {
                 Transaction transaction = pulsarClient.newTransaction()
                         .withTransactionTimeout(10, TimeUnit.SECONDS).build().get();
                 System.out.println("send one transaction messageId" + producers.get(i)
                         .newMessage(transaction).value(i + topicName).send() + "   topic name : " + i);
                 transaction.commit();
                 Thread.sleep(10);
             }
         }```
    

Execute the above code multiple times
Execute ./pulsar-admin namespaces unload public/default multiple times during execute the above code
4. See error
image
Expected behavior
can't throw exception

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]

Additional context
Not a stable reproduction, requires multiple trials

@hangc0276
Copy link
Contributor

test1.log.zip

less test1.log|grep "read entry from bookie : 1889:205"|less

@lhotari
Copy link
Member

lhotari commented Mar 4, 2022

I made an attempt to fix this issue in #14556.

The IllegalStateException: Some required fields are missing could happen if either publishTime, producerName or sequenceId is unset.

required string producer_name = 1;
required uint64 sequence_id = 2;
required uint64 publish_time = 3;

One interesting detail is this code:

if (!msgMetadata.hasPublishTime()) {
msgMetadata.setPublishTime(client.getClientClock().millis());
checkArgument(!msgMetadata.hasProducerName());
msgMetadata.setProducerName(producerName);
if (conf.getCompressionType() != CompressionType.NONE) {
msgMetadata
.setCompression(CompressionCodecProvider.convertToWireProtocol(conf.getCompressionType()));
}
msgMetadata.setUncompressedSize(uncompressedSize);
}

I guess that's fine (similar logic has been in place since the initial commit). The problem is more likely a thread safety issue or shared thread local instance issue. That's why #14556 could be effective.

@congbobo184
Copy link
Contributor Author

@lhotari hi, I add some log. I found that, when the ByteBuf read from bookie client, the data length is not correct and the data is not this entry data which you want to read. So this problem may only the ByteBuf not clear enough, it may netty bug or bookie problem

@lhotari
Copy link
Member

lhotari commented Mar 4, 2022

@lhotari hi, I add some log. I found that, when the ByteBuf read from bookie client, the data length is not correct and the data is not this entry data which you want to read. So this problem may only the ByteBuf not clear enough, it may netty bug or bookie problem

@congbobo184 Perhaps the ByteBuf was already released and this resulted in the memory location to get reused for something else?

@congbobo184
Copy link
Contributor Author

@lhotari hi, I add some log. I found that, when the ByteBuf read from bookie client, the data length is not correct and the data is not this entry data which you want to read. So this problem may only the ByteBuf not clear enough, it may netty bug or bookie problem

@congbobo184 Perhaps the ByteBuf was already released and this resulted in the memory location to get reused for something else?

yes, most likely because Bytebuf was not released cleanly and reused

@lhotari
Copy link
Member

lhotari commented Mar 4, 2022

@congbobo184 @merlimat @codelipenghui @BewareMyPower When digging into the code, I circle back to the fact that OpReadEntry is not thread safe. OpReadEntry.entries mutated and accessed from multiple threads. That is a big problem.

I have an attempt to improve the situation with #11387, but that's not ready and won't resolve the issue completely. The description in #11387 contains some explanation of the issue.

Do we all agree that OpReadEntry is not thread safe and that is a problem that must be resolved?

@BewareMyPower
Copy link
Contributor

Do we all agree that OpReadEntry is not thread safe and that is a problem that must be resolved?

Yes. But it looks like there are some problems with the pin executor solution. Could we solve this problem by changing entries to a CopyOnWriteArrayList and make the field volatile?

@BewareMyPower
Copy link
Contributor

But it looks like that the thread safety issue is not the cause of this issue.

@lhotari
Copy link
Member

lhotari commented Mar 7, 2022

But it looks like that the thread safety issue is not the cause of this issue.

@BewareMyPower What's the current assumption of the cause of this issue?

@lhotari
Copy link
Member

lhotari commented Mar 7, 2022

Yes. But it looks like there are some problems with the pin executor solution. Could we solve this problem by changing entries to a CopyOnWriteArrayList and make the field volatile?

That could work, but I'm not sure about the consequences for performance. It would be worth testing to see what the impact would be.

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Mar 7, 2022

What's the current assumption of the cause of this issue?

I'm not working on this issue currently, just because this thread safety issue has existed from the very beginning. If this issue could be reproduced only by Pulsar 2.10, then the cause might not be related to OpReadEntry.

@lhotari
Copy link
Member

lhotari commented Mar 7, 2022

What's the current assumption of the cause of this issue?

I'm not working on this issue currently, just because this thread safety issue has existed from the very beginning. If this issue could be reproduced only by Pulsar 2.10, then the cause might not be related to OpReadEntry.

Ok, so you don't have an assumption of the root cause?

@congbobo184 Could you please try running the test case with -Dio.netty.recycler.maxCapacity=0 -Dio.netty.recycler.maxCapacityPerThread=0 in the broker and in the client (perhaps separately)? That setting disables the Netty Recycler. The benefit of the experiment would be that it would reveal new information about the problem if disabling Netty Recycler makes the problem go away.

Please be aware that io.netty.recycler.maxCapacity.default is not the correct name of the system property. I created #14576 to remove the invalid property name from Pulsar files.

@congbobo184
Copy link
Contributor Author

Hi @lhotari , I start up standalone with -Dio.netty.recycler.maxCapacity=0 -Dio.netty.recycler.maxCapacityPerThread=0 the problem seem not reproduce, I don’t know is it because of performance degradation, or because this netty configuration makes it unreproduce.I log the buffer length from reading bookie, So I think the problem might not be with the broker OpReadEntry or MessageMetadata.

image (1)

@lhotari
Copy link
Member

lhotari commented Mar 8, 2022

@congbobo184 Thank you. Based on this experiment we cannot determine the issue, but since disabling Netty Recycler has an impact, there's a possibility that recycling instances is causing issues in a way or another.

The next experiment could be to leave the Netty Recycler setting unchanged (enabled) and instead comment out the line recyclerHandle.recycle(this); in OpReadEntry: https://github.com/apache/pulsar/blob/9032a9afc15e029d8f205761b5e104f44f9c4be0/man[…]c/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

That will disable recycling of OpReadEntry instances. If that makes the problem go away, it's again more information about the possible problem.

@congbobo184 Would you be able to continue with this experiment?

@congbobo184
Copy link
Contributor Author

congbobo184 commented Mar 9, 2022

I roll back netty to 4.1.68.Final (requires netty-tc-native.version 2.0.42.Final) follow @lhotari advice, the problem can't be reproduce. The last time I only change netty version to 4.1.68.Final and don't change netty-tc-native.version to 2.0.42.Final to 2.0.48.Final, this problem also happen. So the problem may be in netty-tc-native.version 2.0.48.Final.

@lhotari
Copy link
Member

lhotari commented Mar 9, 2022

So the problem may be in netty-tc-native.version 2.0.48.Final.

Do you happen to use TLS in your setup? If not, it seems odd why the netty-tc-native version would cause the problem to reproduce.

btw. A major difference between Netty 4.1.68.Final and 4.1.74.Final is the Netty Recycler rewrite that was made in 4.1.71.Final. There's more info in this comment: #13328 (comment)

@lhotari
Copy link
Member

lhotari commented Mar 9, 2022

@congbobo184
Copy link
Contributor Author

So the problem may be in netty-tc-native.version 2.0.48.Final.

Do you happen to use TLS in your setup? If not, it seems odd why the netty-tc-native version would cause the problem to reproduce.

btw. A major difference between Netty 4.1.68.Final and 4.1.74.Final is the Netty Recycler rewrite that was made in 4.1.71.Final. There's more info in this comment: #13328 (comment)

I am not sure the problem in 2.0.48.Final. So I will test again. change netty version to 4.1.68.Final and don't change the netty-tcnative version

@congbobo184
Copy link
Contributor Author

congbobo184 commented Mar 11, 2022

We have roughly analyzed the problem. First of all, we declare that it is not a problem that can block the 2.10 and 2.92 releases. Let's analyze the reasons for this problem:

The reason for this is the reuse of the org.apache.bookkeeper.client.impl.LedgerEntryImpl object.

I log in three position:

  1. org.apache.bookkeeper.client.impl.LedgerEntryImpl create
  2. org.apache.bookkeeper.client.impl.LedgerEntryImpl recycle
  3. handle read response in bk client
    image
    image

image

image

  1. As we can see this log, LedgerEntryImpl for entry(899:173) create and then recycle, it don't log firstcomplete, this certify this LedgerEntryImpl(1167574521) don't receive any data from bookie server, maybe receive an error then it recycle.
  2. When LedgerEntryImpl for entry(899:173) recycle, it reuse for read entry (1395:220) then it create and then receive the response from the bookie client
  3. But when it haven't recycle, the reponse for read entry(899:173) arrive, then use the LedgerEntryImpl(1167574521) now read for entry (1395:220) to handle this response. but now LedgerEntryImpl(1167574521) is not use by read entry(899:173). So the data is not (1395:220), it is entry(899:173)'s data then return broker. The entryId is (1395:220) but data is entry(899:173). So will produce this problem.
  4. Then recycle the org.apache.bookkeeper.client.impl.LedgerEntryImpl

As we can see, this problem is from bookie client and is a long-standing problem. So we don't block 2.10 and 2.9.2 release, should wait bookie fix and release, so we release broker first and then fix this issue in bookie client.

Waiting for the fix to complete I will show it in this issue.

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

Thanks for the analysis @congbobo184 . Really helpful.

As we can see, this problem is from bookie client and is a long-standing problem. So we don't block 2.10 and 2.9.2 release, should wait bookie fix and release, so we release broker first and then fix this issue in bookie client.

This doesn't mean that we should accept the bug that is caused by thread safety issues. This is a severe issue and can lead to data inconsistency problems. As discussed privately, this problem seems to become more frequent with Netty 4.1.74.Final compare to Netty 4.1.68.Final version. Most likely this behavioral changes is caused by the new Netty Recycler that was introduced in Netty 4.1.71.Final. My assumption is that the new implementation is more efficient and brings thread safety issues to the surface.
There's more info about the change in this comment: #13328 (comment)

I'll bring this severe issue up to discussion on the Apache Pulsar and Apache Bookkeeper developer mailing lists.

The quick workaround is to disable Netty Recycler completely by setting the system property -Dio.netty.recycler.maxCapacityPerThread=0. There are early reports from @dave2wave in OMB testing that when running on JDK17 ZGC or Shenandoah GC, there's no negative performance impact in disabling Netty Recycler.
This means that it will be useful to consider the removal of Netty Recycler from Bookkeeper and Pulsar code bases in the future. The current usage patterns contain thread safety issues and those will continue to hinder Pulsar and Bookkeeper reliability unless it's resolved completely.

@shoothzj
Copy link
Member

@lhotari do you mean it's a netty issue ? Should we report it to the netty commuinty and revert the netty version ?

@codelipenghui
Copy link
Contributor

Most likely this behavioral changes is caused by the new Netty Recycler that was introduced in Netty 4.1.71.Final. My assumption is that the new implementation is more efficient and brings thread safety issues to the surface.

I have tested with congbo, can't actually prove it for now. After downgrading the netty version, the problem still can be reproduced, and regarding the timing of the reproducing, it is uncertain even to use the same netty version.

So far, we don't have an exact root cause, since the problem can be reproduced, I think we can find the root cause.

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

@lhotari do you mean it's a netty issue ? Should we report it to the netty commuinty and revert the netty version ?

No it's not a Netty issue. It's a problem caused by our code in Pulsar and Bookkeeper violating the Java Memory Model and thread safety rules.

There are 2 possible solutions:

  1. Disabling the Netty Recycler by setting -Dio.netty.recycler.maxCapacityPerThread=0. There are early reports from @dave2wave in OMB testing that when running on JDK17 ZGC or Shenandoah GC, there's no negative performance impact in disabling Netty Recycler.

  2. Ensuring that recycled object instances are "safely published" to other threads ("safe publication", related concept is the "happens before visibility guarantee"). There are multiple ways to achieve this. Full synchronization isn't necessary in most cases.

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

Most likely this behavioral changes is caused by the new Netty Recycler that was introduced in Netty 4.1.71.Final. My assumption is that the new implementation is more efficient and brings thread safety issues to the surface.

I have tested with congbo, can't actually prove it for now. After downgrading the netty version, the problem still can be reproduced, and regarding the timing of the reproducing, it is uncertain even to use the same netty version.

So far, we don't have an exact root cause, since the problem can be reproduced, I think we can find the root cause.

I meant to say that after the new Netty Recycler, it just happens to be more frequent. That's not the root cause. The root cause is the violation of the Java Memory Model and thread safety rules with the use of recycled objects across multiple threads.

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

In this case, one example of a class that is not thread safe is LedgerEntriesImpl. It has a single field entries: https://github.com/apache/bookkeeper/blob/7087fda4e91e9fe8974f9d36e01e275a08bf38f1/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/impl/LedgerEntriesImpl.java#L38

I'm not saying that full synchronization is required. In this case, it is sufficient to apply the "golden rule" that Aleksey Shipilev presents.
image

image
(at 33:10)

I'm thinking of an approach where there would be a base class for recycled objects with methods that should be called after writing fields and before reading fields in different threads. That would be sufficient to ensure thread safety without the need to have full synchronization. It could be implemented with a simple volatile field write and read.

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

Usually every concurrent collections provide causality. For example, when a ConcurrentHashMap is used. It's all good when one thread writes fields and puts it to a ConcurrentHashMap and another thread gets it from the ConcurrentHashMap and reads the fields. There is causality and the reads are consistent. The problem comes with other threads that don't use the same "barrier". There's no guarantee of causality in that case. With asynchronous callbacks, multiple threads are used and there isn't a single barrier. which would guarantee causality.

@eolivelli
Copy link
Contributor

@lhotari
I agree with your points, we should make all the Recyclable classes that could be shared among threads threadsafe.
The problem is not for every class, but only for those that are meant to be passed across different threads while they are "active" (not parked into the Recycler).

what about opening a discussion on Netty ?
I guess that everyone who uses the Recycler has to deal with this kind of problems.
Maybe they could publish some guidelines based on your thoughts.

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

I agree with your points, we should make all the Recyclable classes that could be shared among threads threadsafe. The problem is not for every class, but only for those that are meant to be passed across different threads while they are "active" (not parked into the Recycler).

I was thinking initially that "thread safety" should be achieved for recycled object classes. However that would be very intrusive since "safe publication" is sufficient to reaching thread safety in this case. There aren't cases where there would be races in writes in multiple threads. The object instance is used and passed across threads and as long as "safe publication" is achieved, that is sufficient. That's why I'm actually not so sure about the solution. "Safe publication" is also a mandatory consideration for new objects so it's not actually completely limited to recycled object instance.

It's probably the case, that there's already causality for most execution paths. It's just very hard to analyse the code and find the execution paths where this isn't the case.

what about opening a discussion on Netty ? I guess that everyone who uses the Recycler has to deal with this kind of problems. Maybe they could publish some guidelines based on your thoughts.

I don't think that this is a Netty problem.The usages of Netty Recycler within Netty are most likely covered with causality. I'm not sure if the Netty Recycler is used commonly outside of Netty. There are only 8 projects in grep.app that contain "import io.netty.util.Recycler"

@lhotari
Copy link
Member

lhotari commented Mar 11, 2022

Just wondering if #12744 and #12881 changes are some how triggering this issue?

@lhotari
Copy link
Member

lhotari commented Mar 12, 2022

The reason for this is the reuse of the org.apache.bookkeeper.client.impl.LedgerEntryImpl object.

@congbobo184 Since this is related to BookKeeper client, could you please run an experiment with bookkeeperUseV2WireProtocol=false for the Broker? The benefit of this experiment would be that it would tell if there's any difference when the client uses the V3 protocol.

The reason for this request is that in the client, V2 requests don't have unique keys and I'm wondering if that could cause collisions:

https://github.com/apache/bookkeeper/blob/74b00a101cb6ffd06dafccfe9d31c8c5be66efaf/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java#L2298-L2313

@lhotari
Copy link
Member

lhotari commented Mar 13, 2022

There seem to be cases where causality and "happens-before" isn't sufficient and classes must be thread safe.

For example BookKeeper client's LedgerEntriesImpl contains this type of code:
https://github.com/apache/bookkeeper/blob/7087fda4e91e9fe8974f9d36e01e275a08bf38f1/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/impl/LedgerEntriesImpl.java#L57-L63

    private void releaseByteBuf() {
        if (entries != null) {
            entries.forEach(LedgerEntry::close);
            entries.clear();
            entries = null;
        }
    }

If there are multiple threads that call this method, the change to the entries field wouldn't be visible to the other thread and this could cause to the "double release" problem that might be causing the problem in this issue where the logging added by @congbobo184 proved that a single instance was used in 2 locations simultaneously.

@congbobo184 would you be able to make an experiment where you make all methods synchronized in LedgerEntriesImpl and add a synchronized setter to modify the "entries" field? This doesn't mean that it should be the way to resolve the issue.
By making LedgerEntriesImpl competely thread safe, that would ensure that lack of thread safety in LedgerEntriesImpl is a source of problems.

@lhotari
Copy link
Member

lhotari commented Mar 13, 2022

I was looking into the BookKeeper client and it seems that apache/bookkeeper#1792 breaks thread safety for read-only ledgers.
Thread safety would require happens-before relationships all the way, but it is missing in this case.

@lhotari
Copy link
Member

lhotari commented Mar 13, 2022

Since the thread safety is broken only for read-only ledgers, it requires a ledger roll over to happen and the Pulsar ledger cache to be empty to reproduce the issue. These conditions seem to be met also in the way what describes how to reproduce the reported issue.

@congbobo184
Copy link
Contributor Author

@lhotari I am working in transaction issues. I'll take a look at this when I'm done with transaction issues

@lhotari
Copy link
Member

lhotari commented Mar 14, 2022

@congbobo184 I filed apache/bookkeeper#3104 issue in BK for tracking this.

@dlg99
Copy link
Contributor

dlg99 commented Mar 14, 2022

  • But when it haven't recycle, the reponse for read entry(899:173) arrive, then use the LedgerEntryImpl(1167574521) now read for entry (1395:220) to handle this response. but now LedgerEntryImpl(1167574521) is not use by read entry(899:173). So the data is not (1395:220), it is entry(899:173)'s data then return broker. The entryId is (1395:220) but data is entry(899:173). So will produce this problem.

Have you tried disabling speculative reads? Please see apache/bookkeeper#3106 (review)

@congbobo184
Copy link
Contributor Author

apache/bookkeeper#3110 will fix it

@github-actions
Copy link

github-actions bot commented May 9, 2022

The issue had no activity for 30 days, mark with Stale label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

8 participants