[Proto] java.lang.IllegalStateException: Some required fields are missing #14436

congbobo184 · 2022-02-24T02:25:56Z

Describe the bug

parse message metadata throw Exception
To Reproduce
Steps to reproduce the behavior:

start up a standalone pulsar, the version is bigger than 2.10.0
change broker config
transactionCoordinatorEnabled=true
managedLedgerMaxEntriesPerLedger=3
managedLedgerMinLedgerRolloverTimeMinutes=1
config namespace ttl ./pulsar-admin namespaces set-message-ttl -ttl 10 public/default

     PulsarClient pulsarClient = PulsarClient.builder().enableTransaction(true).serviceUrl("pulsar://127.0.0.1:6650").build();
     int sendTopicsCount = 200;
     String topicName = "test55";
     List<Producer<String>> producers = new ArrayList<>(sendTopicsCount);

     for (int i = 0; i < sendTopicsCount; i++) {
         producers.add(pulsarClient.newProducer(Schema.STRING).sendTimeout(0, TimeUnit.SECONDS).topic(i + topicName).create());
          Consumer<String> consumer = pulsarClient.newConsumer(Schema.STRING).subscriptionName("test").topic(i + topicName).subscribe();
         consumer.closeAsync();
     }
     for (int j = 0; j < 10000; j ++) {
         for (int i = 0; i < sendTopicsCount; i++) {
             Transaction transaction = pulsarClient.newTransaction()
                     .withTransactionTimeout(10, TimeUnit.SECONDS).build().get();
             System.out.println("send one transaction messageId" + producers.get(i)
                     .newMessage(transaction).value(i + topicName).send() + "   topic name : " + i);
             transaction.commit();
             Thread.sleep(10);
         }
     }```

Execute the above code multiple times
Execute ./pulsar-admin namespaces unload public/default multiple times during execute the above code
4. See error

Expected behavior
can't throw exception

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]

Additional context
Not a stable reproduction, requires multiple trials

The text was updated successfully, but these errors were encountered:

hangc0276 · 2022-03-04T01:27:08Z

test1.log.zip

less test1.log|grep "read entry from bookie : 1889:205"|less

lhotari · 2022-03-04T08:00:08Z

I made an attempt to fix this issue in #14556.

The IllegalStateException: Some required fields are missing could happen if either publishTime, producerName or sequenceId is unset.

pulsar/pulsar-common/src/main/proto/PulsarApi.proto

Lines 105 to 107 in db25438

    
           required string producer_name   = 1; 
        
           required uint64 sequence_id     = 2; 
        
           required uint64 publish_time    = 3;

One interesting detail is this code:

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java

Lines 565 to 577 in 5bcdb5e

    
           if (!msgMetadata.hasPublishTime()) { 
        
               msgMetadata.setPublishTime(client.getClientClock().millis()); 
        
               checkArgument(!msgMetadata.hasProducerName()); 
        
               msgMetadata.setProducerName(producerName); 
        
               if (conf.getCompressionType() != CompressionType.NONE) { 
        
                   msgMetadata 
        
                           .setCompression(CompressionCodecProvider.convertToWireProtocol(conf.getCompressionType())); 
        
               } 
        
               msgMetadata.setUncompressedSize(uncompressedSize); 
        
           }

I guess that's fine (similar logic has been in place since the initial commit). The problem is more likely a thread safety issue or shared thread local instance issue. That's why #14556 could be effective.

congbobo184 · 2022-03-04T08:04:01Z

@lhotari hi, I add some log. I found that, when the ByteBuf read from bookie client, the data length is not correct and the data is not this entry data which you want to read. So this problem may only the ByteBuf not clear enough, it may netty bug or bookie problem

lhotari · 2022-03-04T08:31:46Z

@lhotari hi, I add some log. I found that, when the ByteBuf read from bookie client, the data length is not correct and the data is not this entry data which you want to read. So this problem may only the ByteBuf not clear enough, it may netty bug or bookie problem

@congbobo184 Perhaps the ByteBuf was already released and this resulted in the memory location to get reused for something else?

congbobo184 · 2022-03-04T08:43:27Z

@lhotari hi, I add some log. I found that, when the ByteBuf read from bookie client, the data length is not correct and the data is not this entry data which you want to read. So this problem may only the ByteBuf not clear enough, it may netty bug or bookie problem

@congbobo184 Perhaps the ByteBuf was already released and this resulted in the memory location to get reused for something else?

yes, most likely because Bytebuf was not released cleanly and reused

lhotari · 2022-03-04T08:51:51Z

@congbobo184 @merlimat @codelipenghui @BewareMyPower When digging into the code, I circle back to the fact that OpReadEntry is not thread safe. OpReadEntry.entries mutated and accessed from multiple threads. That is a big problem.

I have an attempt to improve the situation with #11387, but that's not ready and won't resolve the issue completely. The description in #11387 contains some explanation of the issue.

Do we all agree that OpReadEntry is not thread safe and that is a problem that must be resolved?

BewareMyPower · 2022-03-07T03:00:13Z

Do we all agree that OpReadEntry is not thread safe and that is a problem that must be resolved?

Yes. But it looks like there are some problems with the pin executor solution. Could we solve this problem by changing entries to a CopyOnWriteArrayList and make the field volatile?

BewareMyPower · 2022-03-07T03:01:34Z

But it looks like that the thread safety issue is not the cause of this issue.

lhotari · 2022-03-07T07:49:26Z

But it looks like that the thread safety issue is not the cause of this issue.

@BewareMyPower What's the current assumption of the cause of this issue?

lhotari · 2022-03-07T07:54:13Z

Yes. But it looks like there are some problems with the pin executor solution. Could we solve this problem by changing entries to a CopyOnWriteArrayList and make the field volatile?

That could work, but I'm not sure about the consequences for performance. It would be worth testing to see what the impact would be.

BewareMyPower · 2022-03-07T08:14:09Z

What's the current assumption of the cause of this issue?

I'm not working on this issue currently, just because this thread safety issue has existed from the very beginning. If this issue could be reproduced only by Pulsar 2.10, then the cause might not be related to OpReadEntry.

lhotari · 2022-03-07T08:30:52Z

What's the current assumption of the cause of this issue?

I'm not working on this issue currently, just because this thread safety issue has existed from the very beginning. If this issue could be reproduced only by Pulsar 2.10, then the cause might not be related to OpReadEntry.

Ok, so you don't have an assumption of the root cause?

@congbobo184 Could you please try running the test case with -Dio.netty.recycler.maxCapacity=0 -Dio.netty.recycler.maxCapacityPerThread=0 in the broker and in the client (perhaps separately)? That setting disables the Netty Recycler. The benefit of the experiment would be that it would reveal new information about the problem if disabling Netty Recycler makes the problem go away.

Please be aware that io.netty.recycler.maxCapacity.default is not the correct name of the system property. I created #14576 to remove the invalid property name from Pulsar files.

congbobo184 · 2022-03-08T07:13:36Z

Hi @lhotari , I start up standalone with -Dio.netty.recycler.maxCapacity=0 -Dio.netty.recycler.maxCapacityPerThread=0 the problem seem not reproduce, I don’t know is it because of performance degradation, or because this netty configuration makes it unreproduce.I log the buffer length from reading bookie, So I think the problem might not be with the broker OpReadEntry or MessageMetadata.

lhotari · 2022-03-08T07:15:17Z

@congbobo184 Thank you. Based on this experiment we cannot determine the issue, but since disabling Netty Recycler has an impact, there's a possibility that recycling instances is causing issues in a way or another.

The next experiment could be to leave the Netty Recycler setting unchanged (enabled) and instead comment out the line recyclerHandle.recycle(this); in OpReadEntry: https://github.com/apache/pulsar/blob/9032a9afc15e029d8f205761b5e104f44f9c4be0/man[…]c/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

That will disable recycling of OpReadEntry instances. If that makes the problem go away, it's again more information about the possible problem.

@congbobo184 Would you be able to continue with this experiment?

congbobo184 · 2022-03-09T09:48:33Z

I roll back netty to 4.1.68.Final (requires netty-tc-native.version 2.0.42.Final) follow @lhotari advice, the problem can't be reproduce. The last time I only change netty version to 4.1.68.Final and don't change netty-tc-native.version to 2.0.42.Final to 2.0.48.Final, this problem also happen. So the problem may be in netty-tc-native.version 2.0.48.Final.

lhotari · 2022-03-09T10:27:41Z

So the problem may be in netty-tc-native.version 2.0.48.Final.

Do you happen to use TLS in your setup? If not, it seems odd why the netty-tc-native version would cause the problem to reproduce.

btw. A major difference between Netty 4.1.68.Final and 4.1.74.Final is the Netty Recycler rewrite that was made in 4.1.71.Final. There's more info in this comment: #13328 (comment)

lhotari · 2022-03-09T10:29:47Z

Changes between netty-tcnative 2.0.42.Final and 2.0.48.Final

congbobo184 · 2022-03-09T10:37:52Z

So the problem may be in netty-tc-native.version 2.0.48.Final.

Do you happen to use TLS in your setup? If not, it seems odd why the netty-tc-native version would cause the problem to reproduce.

btw. A major difference between Netty 4.1.68.Final and 4.1.74.Final is the Netty Recycler rewrite that was made in 4.1.71.Final. There's more info in this comment: #13328 (comment)

I am not sure the problem in 2.0.48.Final. So I will test again. change netty version to 4.1.68.Final and don't change the netty-tcnative version

congbobo184 · 2022-03-11T04:22:12Z

We have roughly analyzed the problem. First of all, we declare that it is not a problem that can block the 2.10 and 2.92 releases. Let's analyze the reasons for this problem:

The reason for this is the reuse of the org.apache.bookkeeper.client.impl.LedgerEntryImpl object.

I log in three position:

org.apache.bookkeeper.client.impl.LedgerEntryImpl create
org.apache.bookkeeper.client.impl.LedgerEntryImpl recycle
handle read response in bk client

As we can see this log, LedgerEntryImpl for entry(899:173) create and then recycle, it don't log firstcomplete, this certify this LedgerEntryImpl(1167574521) don't receive any data from bookie server, maybe receive an error then it recycle.
When LedgerEntryImpl for entry(899:173) recycle, it reuse for read entry (1395:220) then it create and then receive the response from the bookie client
But when it haven't recycle, the reponse for read entry(899:173) arrive, then use the LedgerEntryImpl(1167574521) now read for entry (1395:220) to handle this response. but now LedgerEntryImpl(1167574521) is not use by read entry(899:173). So the data is not (1395:220), it is entry(899:173)'s data then return broker. The entryId is (1395:220) but data is entry(899:173). So will produce this problem.
Then recycle the org.apache.bookkeeper.client.impl.LedgerEntryImpl

As we can see, this problem is from bookie client and is a long-standing problem. So we don't block 2.10 and 2.9.2 release, should wait bookie fix and release, so we release broker first and then fix this issue in bookie client.

Waiting for the fix to complete I will show it in this issue.

lhotari · 2022-03-11T08:13:51Z

Thanks for the analysis @congbobo184 . Really helpful.

As we can see, this problem is from bookie client and is a long-standing problem. So we don't block 2.10 and 2.9.2 release, should wait bookie fix and release, so we release broker first and then fix this issue in bookie client.

This doesn't mean that we should accept the bug that is caused by thread safety issues. This is a severe issue and can lead to data inconsistency problems. As discussed privately, this problem seems to become more frequent with Netty 4.1.74.Final compare to Netty 4.1.68.Final version. Most likely this behavioral changes is caused by the new Netty Recycler that was introduced in Netty 4.1.71.Final. My assumption is that the new implementation is more efficient and brings thread safety issues to the surface.
There's more info about the change in this comment: #13328 (comment)

I'll bring this severe issue up to discussion on the Apache Pulsar and Apache Bookkeeper developer mailing lists.

The quick workaround is to disable Netty Recycler completely by setting the system property -Dio.netty.recycler.maxCapacityPerThread=0. There are early reports from @dave2wave in OMB testing that when running on JDK17 ZGC or Shenandoah GC, there's no negative performance impact in disabling Netty Recycler.
This means that it will be useful to consider the removal of Netty Recycler from Bookkeeper and Pulsar code bases in the future. The current usage patterns contain thread safety issues and those will continue to hinder Pulsar and Bookkeeper reliability unless it's resolved completely.

shoothzj · 2022-03-11T08:23:27Z

@lhotari do you mean it's a netty issue ? Should we report it to the netty commuinty and revert the netty version ?

codelipenghui · 2022-03-11T08:37:34Z

Most likely this behavioral changes is caused by the new Netty Recycler that was introduced in Netty 4.1.71.Final. My assumption is that the new implementation is more efficient and brings thread safety issues to the surface.

I have tested with congbo, can't actually prove it for now. After downgrading the netty version, the problem still can be reproduced, and regarding the timing of the reproducing, it is uncertain even to use the same netty version.

So far, we don't have an exact root cause, since the problem can be reproduced, I think we can find the root cause.

lhotari · 2022-03-11T09:00:09Z

@lhotari do you mean it's a netty issue ? Should we report it to the netty commuinty and revert the netty version ?

No it's not a Netty issue. It's a problem caused by our code in Pulsar and Bookkeeper violating the Java Memory Model and thread safety rules.

There are 2 possible solutions:

Disabling the Netty Recycler by setting -Dio.netty.recycler.maxCapacityPerThread=0. There are early reports from @dave2wave in OMB testing that when running on JDK17 ZGC or Shenandoah GC, there's no negative performance impact in disabling Netty Recycler.
Ensuring that recycled object instances are "safely published" to other threads ("safe publication", related concept is the "happens before visibility guarantee"). There are multiple ways to achieve this. Full synchronization isn't necessary in most cases.

lhotari · 2022-03-11T09:03:32Z

Most likely this behavioral changes is caused by the new Netty Recycler that was introduced in Netty 4.1.71.Final. My assumption is that the new implementation is more efficient and brings thread safety issues to the surface.

I have tested with congbo, can't actually prove it for now. After downgrading the netty version, the problem still can be reproduced, and regarding the timing of the reproducing, it is uncertain even to use the same netty version.

So far, we don't have an exact root cause, since the problem can be reproduced, I think we can find the root cause.

I meant to say that after the new Netty Recycler, it just happens to be more frequent. That's not the root cause. The root cause is the violation of the Java Memory Model and thread safety rules with the use of recycled objects across multiple threads.

lhotari · 2022-03-11T09:11:01Z

In this case, one example of a class that is not thread safe is LedgerEntriesImpl. It has a single field entries: https://github.com/apache/bookkeeper/blob/7087fda4e91e9fe8974f9d36e01e275a08bf38f1/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/impl/LedgerEntriesImpl.java#L38

I'm not saying that full synchronization is required. In this case, it is sufficient to apply the "golden rule" that Aleksey Shipilev presents.

(at 33:10)

I'm thinking of an approach where there would be a base class for recycled objects with methods that should be called after writing fields and before reading fields in different threads. That would be sufficient to ensure thread safety without the need to have full synchronization. It could be implemented with a simple volatile field write and read.

lhotari · 2022-03-11T09:26:56Z

Usually every concurrent collections provide causality. For example, when a ConcurrentHashMap is used. It's all good when one thread writes fields and puts it to a ConcurrentHashMap and another thread gets it from the ConcurrentHashMap and reads the fields. There is causality and the reads are consistent. The problem comes with other threads that don't use the same "barrier". There's no guarantee of causality in that case. With asynchronous callbacks, multiple threads are used and there isn't a single barrier. which would guarantee causality.

eolivelli · 2022-03-11T10:02:20Z

@lhotari
I agree with your points, we should make all the Recyclable classes that could be shared among threads threadsafe.
The problem is not for every class, but only for those that are meant to be passed across different threads while they are "active" (not parked into the Recycler).

what about opening a discussion on Netty ?
I guess that everyone who uses the Recycler has to deal with this kind of problems.
Maybe they could publish some guidelines based on your thoughts.

lhotari · 2022-03-11T10:41:52Z

I agree with your points, we should make all the Recyclable classes that could be shared among threads threadsafe. The problem is not for every class, but only for those that are meant to be passed across different threads while they are "active" (not parked into the Recycler).

I was thinking initially that "thread safety" should be achieved for recycled object classes. However that would be very intrusive since "safe publication" is sufficient to reaching thread safety in this case. There aren't cases where there would be races in writes in multiple threads. The object instance is used and passed across threads and as long as "safe publication" is achieved, that is sufficient. That's why I'm actually not so sure about the solution. "Safe publication" is also a mandatory consideration for new objects so it's not actually completely limited to recycled object instance.

It's probably the case, that there's already causality for most execution paths. It's just very hard to analyse the code and find the execution paths where this isn't the case.

what about opening a discussion on Netty ? I guess that everyone who uses the Recycler has to deal with this kind of problems. Maybe they could publish some guidelines based on your thoughts.

I don't think that this is a Netty problem.The usages of Netty Recycler within Netty are most likely covered with causality. I'm not sure if the Netty Recycler is used commonly outside of Netty. There are only 8 projects in grep.app that contain "import io.netty.util.Recycler"

lhotari · 2022-03-11T16:44:16Z

Just wondering if #12744 and #12881 changes are some how triggering this issue?

lhotari · 2022-03-12T15:39:22Z

The reason for this is the reuse of the org.apache.bookkeeper.client.impl.LedgerEntryImpl object.

@congbobo184 Since this is related to BookKeeper client, could you please run an experiment with bookkeeperUseV2WireProtocol=false for the Broker? The benefit of this experiment would be that it would tell if there's any difference when the client uses the V3 protocol.

The reason for this request is that in the client, V2 requests don't have unique keys and I'm wondering if that could cause collisions:

https://github.com/apache/bookkeeper/blob/74b00a101cb6ffd06dafccfe9d31c8c5be66efaf/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java#L2298-L2313

lhotari · 2022-03-13T12:04:35Z

There seem to be cases where causality and "happens-before" isn't sufficient and classes must be thread safe.

For example BookKeeper client's LedgerEntriesImpl contains this type of code:
https://github.com/apache/bookkeeper/blob/7087fda4e91e9fe8974f9d36e01e275a08bf38f1/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/impl/LedgerEntriesImpl.java#L57-L63

    private void releaseByteBuf() {
        if (entries != null) {
            entries.forEach(LedgerEntry::close);
            entries.clear();
            entries = null;
        }
    }

If there are multiple threads that call this method, the change to the entries field wouldn't be visible to the other thread and this could cause to the "double release" problem that might be causing the problem in this issue where the logging added by @congbobo184 proved that a single instance was used in 2 locations simultaneously.

@congbobo184 would you be able to make an experiment where you make all methods synchronized in LedgerEntriesImpl and add a synchronized setter to modify the "entries" field? This doesn't mean that it should be the way to resolve the issue.
By making LedgerEntriesImpl competely thread safe, that would ensure that lack of thread safety in LedgerEntriesImpl is a source of problems.

lhotari · 2022-03-13T19:07:37Z

I was looking into the BookKeeper client and it seems that apache/bookkeeper#1792 breaks thread safety for read-only ledgers.
Thread safety would require happens-before relationships all the way, but it is missing in this case.

lhotari · 2022-03-13T19:58:02Z

Since the thread safety is broken only for read-only ledgers, it requires a ledger roll over to happen and the Pulsar ledger cache to be empty to reproduce the issue. These conditions seem to be met also in the way what describes how to reproduce the reported issue.

congbobo184 · 2022-03-14T01:30:57Z

@lhotari I am working in transaction issues. I'll take a look at this when I'm done with transaction issues

lhotari · 2022-03-14T06:15:41Z

@congbobo184 I filed apache/bookkeeper#3104 issue in BK for tracking this.

dlg99 · 2022-03-14T19:00:38Z

But when it haven't recycle, the reponse for read entry(899:173) arrive, then use the LedgerEntryImpl(1167574521) now read for entry (1395:220) to handle this response. but now LedgerEntryImpl(1167574521) is not use by read entry(899:173). So the data is not (1395:220), it is entry(899:173)'s data then return broker. The entryId is (1395:220) but data is entry(899:173). So will produce this problem.

Have you tried disabling speculative reads? Please see apache/bookkeeper#3106 (review)

congbobo184 · 2022-03-15T04:15:58Z

apache/bookkeeper#3110 will fix it

github-actions · 2022-05-09T02:14:04Z

The issue had no activity for 30 days, mark with Stale label.

congbobo184 added the type/bug The PR fixed a bug or issue reported a bug label Feb 24, 2022

sijie mentioned this issue Feb 24, 2022

ISSUE-14436: [Proto] java.lang.IllegalStateException: Some required fields are missing streamnative/pulsar-archived#3812

Closed

congbobo184 self-assigned this Mar 3, 2022

lhotari mentioned this issue Mar 4, 2022

Make copies of thread local MessageMetadata when it might be shared #14556

Closed

This was referenced Mar 13, 2022

Support isolate read write thread pool apache/bookkeeper#3062

Open

Issue #1791: Read Submission should bypass OSE Threads apache/bookkeeper#1792

Merged

lhotari mentioned this issue Mar 14, 2022

Recycled LedgerEntryImpl instances are corrupted due to invalid recycling in BK client apache/bookkeeper#3104

Closed

diegosalvi mentioned this issue Mar 14, 2022

Revert "Issue #1791: Read Submission should bypass OSE Threads" apache/bookkeeper#3106

Closed

github-actions bot added the lifecycle/stale label May 9, 2022

congbobo184 closed this as completed May 10, 2022

nicoloboschi mentioned this issue May 13, 2022

[improve] Upgrade BookKeeper to 4.14.5 (2.8, 2.9, 2.10 branches) #15581

Merged

1 task

michaeljmarshall mentioned this issue Aug 13, 2022

Flaky-test: MessageDispatchThrottlingTest.testBacklogConsumerCacheReads #16979

Closed

[Proto] java.lang.IllegalStateException: Some required fields are missing #14436

[Proto] java.lang.IllegalStateException: Some required fields are missing #14436

Comments

congbobo184 commented Feb 24, 2022 • edited Loading

hangc0276 commented Mar 4, 2022

lhotari commented Mar 4, 2022

congbobo184 commented Mar 4, 2022

lhotari commented Mar 4, 2022

congbobo184 commented Mar 4, 2022

lhotari commented Mar 4, 2022 • edited Loading

BewareMyPower commented Mar 7, 2022

BewareMyPower commented Mar 7, 2022

lhotari commented Mar 7, 2022

lhotari commented Mar 7, 2022

BewareMyPower commented Mar 7, 2022 • edited Loading

lhotari commented Mar 7, 2022 • edited Loading

congbobo184 commented Mar 8, 2022

lhotari commented Mar 8, 2022 • edited Loading

congbobo184 commented Mar 9, 2022 • edited by codelipenghui Loading

lhotari commented Mar 9, 2022

lhotari commented Mar 9, 2022

congbobo184 commented Mar 9, 2022

congbobo184 commented Mar 11, 2022 • edited Loading

lhotari commented Mar 11, 2022

shoothzj commented Mar 11, 2022

codelipenghui commented Mar 11, 2022

lhotari commented Mar 11, 2022 • edited Loading

lhotari commented Mar 11, 2022

lhotari commented Mar 11, 2022 • edited Loading

lhotari commented Mar 11, 2022

eolivelli commented Mar 11, 2022

lhotari commented Mar 11, 2022

lhotari commented Mar 11, 2022

lhotari commented Mar 12, 2022 • edited Loading

lhotari commented Mar 13, 2022

lhotari commented Mar 13, 2022

lhotari commented Mar 13, 2022

congbobo184 commented Mar 14, 2022

lhotari commented Mar 14, 2022

dlg99 commented Mar 14, 2022

congbobo184 commented Mar 15, 2022

github-actions bot commented May 9, 2022

congbobo184 commented Feb 24, 2022 •

edited

Loading

lhotari commented Mar 4, 2022 •

edited

Loading

BewareMyPower commented Mar 7, 2022 •

edited

Loading

lhotari commented Mar 7, 2022 •

edited

Loading

lhotari commented Mar 8, 2022 •

edited

Loading

congbobo184 commented Mar 9, 2022 •

edited by codelipenghui

Loading

congbobo184 commented Mar 11, 2022 •

edited

Loading

lhotari commented Mar 11, 2022 •

edited

Loading

lhotari commented Mar 11, 2022 •

edited

Loading

lhotari commented Mar 12, 2022 •

edited

Loading