Skip to content

[issue #11796] throw NPE when readEntry#11808

Closed
casuallc wants to merge 211 commits intoapache:masterfrom
casuallc:branch-2.8
Closed

[issue #11796] throw NPE when readEntry#11808
casuallc wants to merge 211 commits intoapache:masterfrom
casuallc:branch-2.8

Conversation

@casuallc
Copy link
Contributor

#11796
check if OpReadEntity create success

codelipenghui and others added 30 commits June 7, 2021 14:10
…10849)

This reverts commit 6e747ec.

### Motivation
The problem in #10798  is more about "epoll" and not about tc-native.
With adoptopenjdk/openjdk15:alpine-slim docker image the problem does not reproduce, so it is better to stick to 2.0.38.Final

(cherry picked from commit 416ee72)
#10861)

This reverts commit 4264a67.

### Motivation

The change #8796 has broken the Pulsar
Functions running on Kubernetes.

The Pulsar Functions Kubernetes runtime generates a secret and mounts it
using mode `256`. That means the secret is only able to read by the user.
The StatefulSet created by Kubernetes runtime mounts the secrets under the
`root` user. Hence only the root user is able to read the secret. This
results in any functions submitted will fail to read the authentication
information.

Because all the Kubernetes resources generated by the Kubernetes runtime
are hardcoded. There is no easy way to change the security context for the
function statefulsets. 

Let's revert this change for 2.8.0, until we can address the issues in the Kubernetes runtime.

(cherry picked from commit 4f556a2)
…ss brokers (#10862)

* Added metadata cache test to simulate multi broker cache

* fix create and delete ops on cache

1. During create we should add a watch for the path in zookeeper. Without this
we will not be notified if the znode is changed on another brokers

2. similarly when deleting, the cache should be invalidated. But we shouldn't add an
entry to the cache. This could get added again on some other broker. In that
case we need to go a fetch the entry from the zookeeper. Adding an empty
entry to the cache prevents that.

* Address review comments

Also add a small delay in test to allow notifications to propagate to other
caches. Without this the tests are occasionally failing

Co-authored-by: Surinder Singh <surinders@splunk.com>
(cherry picked from commit 798b34f)
…created (#10876)

* Always let system topic TRANSACTION_BUFFER_SNAPSHOT be auto created

* Remove unused import

* Add EVENTS_TOPIC_NAMES set; add unit tests

(cherry picked from commit b1b3b3c)
Fixes #10882

### Motivation

CmdSink and CmdSource uses `gson` to parse the JSON configs from pulsar-admin. But most of connectors are using ObjectMapper to serde the config into actual class. `gson` will also convert int/long value into float by default, which will lead ObjectMapper cannot parse float string into int/long correctlly.
 
### Modifications

use ObjectMapper to parse sink/source config.

(cherry picked from commit 2c9ea81)
*Motivation*

Generating metrics is an expensive task and it can take more than 30 seconds
if you have close to or more than a million topics.

*Modification*

This change provides a setting to adjust the async context timeout.

(cherry picked from commit c75d45b)
…0888)

* Make KeyValueSchema an interface visible in the public Schema API
- allow users of pulsar-client-api to use KeyValueSchema
- move KeyValueSchema implementation to KeyValueSchemaImpl
- introduce a new interface KeyValueSchema

(cherry picked from commit 18f2f4a)
Fixes #10906

Also addresses CVE-2021-28169

(cherry picked from commit 6c03154)
- Zookeeper 3.6.2 gets flagged as vulnerable
  https://ossindex.sonatype.org/component/pkg:maven/org.apache.zookeeper/zookeeper@3.6.2
  because of using vulnerable Netty version

(cherry picked from commit aa36eb8)
…r and make SchemaInfo an interface (#10878)

### Motivation

The java-instance.jar generated by the pulsar-functions-runtime-all module should only contain interfaces that Pulsar Function's framework uses to interact with user code.  The module should on have the following dependencies
    1. pulsar-io-core
    2. pulsar-functions-api
    3. pulsar-client-api
    4. slf4j-api
    5. log4j-slf4j-impl
    6. log4j-api
    7. log4j-core

*Explain here the context, and why you're making that change. What is the problem you're trying to solve.*

### Modifications

Change dep pulsar-client-original to pulsar-client-api

Slight changes in the top level pom for what is included in all sub-modules so that additional deps don't land into java-instance.jar

There is also a fix for an issue introduced by #9673. The thread context class loader was set incorrectly in ThreadRuntime.

### Future improvements

1. We should also add a test in the future to make sure external libraries don't get add accidentally this module and java-instance.jar

2. Rename the module "pulsar-functions-runtime-all" to something that describes its function better.  The current name can be confusing


(cherry picked from commit d81b5f8)
* Use brace initializer to initialize configurations

* Add config tests
### Motivation

When broker entry metadata is enabled, using pulsar-admin to peek messages will fail with

```
[pulsar-web-29-16] ERROR org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [null] Failed to peek message at position 1 from persistent://prop/ns-abc/topic-b604aad8ea8010af my-sub
java.lang.IllegalArgumentException: Invalid unknonwn tag type: 6
	at org.apache.pulsar.common.api.proto.LightProtoCodec.skipUnknownField(LightProtoCodec.java:270) ~[classes/:?]
	at org.apache.pulsar.common.api.proto.MessageMetadata.parseFrom(MessageMetadata.java:1370) ~[classes/:?]
	at org.apache.pulsar.common.protocol.Commands.parseMessageMetadata(Commands.java:426) ~[classes/:?]
```

### Modifications

Skip the broker entry metadata if exists in `generateResponseWithEntry` and add the test to verify it.

(cherry picked from commit 7603a9d)
fixes testExpiredLedgerDeletionAfterManagedLedgerRestart

(cherry picked from commit 1623790)
When admin.topics().getDelayedDeliveryPolicy(topicName) is executed, topicPolicies may not be initialized yet

This problem occurs when only a single unit test is executed.
When the entire unit test is executed, it rarely appears, because other unit tests initialize the TopicPolicies under this Namespace

(cherry picked from commit 96b293a)
* Do not expose meaningless stats for consumer.

Currently we have exposed some meaningless consumer stats to users such as

```
addressLength
addressOffset
connectedSinceOffset
connectedSinceLength
clientVersionOffset
clientVersionLength
stringBuffer
```

All of these stats are not used by users but used internally.
So remove these stats from the exposed consumer stats.

* remove line

(cherry picked from commit ae03e51)
## Motivation
fix backlog issuse with --precise-backlog=true.
Now when `managedLedger` create a new `ledger` complete. if `markDelete` is the `previousLedger` LAC it will delete the previousLedger from `managedLedger` . when get backlog we will use range.close to get `getNumberOfEntries` -1, if previousLedger not exist will get the wrong number.

![image](https://user-images.githubusercontent.com/39078850/122502847-fce47800-d029-11eb-81b3-abc9e595d93e.png)

## implement
 use range.openClose() to `getBacklog`.
### Verifying this change
Add the tests for it

(cherry picked from commit e3a97ee)
### Motivation

Support loading `awsCredentialPluginParam` from secrets for kinesis connectors to avoid leaking of sensitive information with best efforts.

### Modifications

Load kinesis configs with `IOConfigUtils.loadWithSecrets()`.

### Verifying this change

This change is already covered by existing tests and this PR also adds tests to confirm both loading credentials from plaintext config object and secrets work as expected.


(cherry picked from commit cb79032)
### Motivation
If we set enableRetry=true, the client will subscribe to the retry topic.
If we set AllowAutoTopicCreation=false, the client will receive `Topic dose not exists`, which means that the retry Topic does not exists.
This makes users very confused. My topic clearly exists, why is it still prompted like this?
Therefore, the prompt information is improved to facilitate users to troubleshoot problems.

### Modifications
add topic name to the returning message

(cherry picked from commit e72a0d8)
Fixes # throwable exception not thrown.

### Motivation

I found some throwable exception that not thrown, maybe it's a bug.

(cherry picked from commit e0988f7)
…ting the number of partitions (#10910)

Fixes #10673

### Motivation
When updating the number of partitions, we need to update the data in two places in zk:
```
/admin/partitioned-topics
/managed-ledgers/
```

Now we only update the number of partitions in `/admin/partitioned-topics`, so if we do not create a Producer or Consumer, the data obtained in another cluster will be incorrect

### Modifications
1)Try to create managedLedger when updating the number of partitions
2)Ensure that the number of partitions in `/admin/partitioned-topics` is updated every time

(cherry picked from commit 202da11)
If a topic only have non-durable subscriptions but not durable subscriptions,
and the non-durable subscription reach the end of the topic, we will get 0 estimated backlog size
So that the compaction will never been triggered.

The expected behavior is if we have no durable subscriptions, we should use the total size for triggering
the compaction.

(cherry picked from commit 797cb12)
gaoran10 and others added 20 commits August 25, 2021 12:18
…pics (#11731)

* Remove the subscription from the topic when closing Reader subscription.

* remove useless code

(cherry picked from commit 5ac38f1)
#11691)

* Fixed block forever bug in Consumer.batchReceive

Ensure that all poll() calls to pendingBatchReceives
is done within the pinnedInternalExecutor to avoid a
race condition where a peek and a subsequent poll get
different pending receives.

Moved the pinnedInternalExecutor into the ConsumerBase
as both ConsumerImpl and MultiTopicsConsumerImpl require it.

failingPendingReceive() now always submits its work to the
internal executor returning a CompletableFuture and all callers
treat it as an asynchronous operation.

* Fix broken MultiTopicsConsumerImplTest

Needed a real executor service to run the
failPendingReceive() method.

* Ensure all calls to messageReceived happen on internal executor

* Readd missing return statement in ConsumerImpl.closeAsync()

* Ensure correct usage of consumer internal executors

Ensure that the externalPinnedExecutor is only called for user
code and internalPinnedExecutor used for internal tasks.
Some test refactoring to manage creation of executors.

(cherry picked from commit bd942e1)
…11741)

Currently, if encounter the add entry failure, will call producer.disconnect() multiple times during the disconnecting the producer
which will add many disconnect producer tasks to the EventLoop.

1. Added isDisconnecting state for the producer, if the producer in isDisconnecting state, skip the disconnect operation
2. Create new future list only the topic have producers to reduce the heap allocation

Added test to cover disconnecting the producer multiple times, but the EventLoop only execute one time.

(cherry picked from commit 49c0796)
Fixes #10937

*Motivation*

The previours bk version was introduce an incompatible issue
with BouncyCastle FIPS. BookKeeper already fix this and we
should upgrade the version to resolve the issue.

For more information: #10937

*Modifications*

Upgrade bk to 4.14.2

(cherry picked from commit e9292b3)
* Fix the topic in fenced state and can not recover.

Here is the log when the issue happens. The producer continues to reconnect to the broker, but the fenced state of the topic is always true.
```
19:01:42.351 [pulsar-io-4-1] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.24.34.151:48052][persistent://public/default/test-8] Creating producer. producerId=8
19:01:42.352 [Thread-174681] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.24.34.151:48052] persistent://public/default/test-8 configured with schema false
19:01:42.352 [Thread-174681] WARN  org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/test-8] Attempting to add producer to a fenced topic
19:01:42.352 [Thread-174681] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.24.34.151:48052] Failed to add producer to topic persistent://public/default/test-8: producerId=8, org.apache.pulsar.broker.service.BrokerServiceException$TopicFencedException: Topic is temporarily unavailable
```

After check the heap dump of the broker, the `pendingWriteOps` is 5, this is the reason why the topic can not recover from the fenced state.

The topic will change to unfenced only the `pendingWriteOps` is 0, details can find at [PersistentTopic.decrementPendingWriteOpsAndCheck()](https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L463)

But after checking the ML state of the topic, it shows the `pendingAddEntries` is 0 which not equals to `pendingWriteOps` of the topic.
The root cause is we are polling add entry op from the `pendingAddEntries` in multiple threads, one is the the ZK callback thread when complete the ledger creating (https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1406, https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1669),
another one is the ML worker thread when complete the add entry op (https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L181)

After the ongoing add entry op complete, but the corresponding op might been polled by the `clearPendingAddEntries` method. So it will poll another one, but due to
not equals to the current op, the polled op will not get a chance to be failed, so that the `pendingWriteOps` will not change to 0.

I have attached the complete logs for the topic:

The fix is to complete the add entry op with ManagedLedgerException if the polled op is not equals to the current op.

* Release buffer.

* Revert

(cherry picked from commit 1bcbab0)
…ned by an existed topic as a part (#11686)

Fixes #11685

validatePartitionTopicUpdate use contain to check if there has a exist topic will cause conflict, which will cause a failed when exist a topic which contain the new topic's prefix and we want to update the new topic partition;

we have a those topic
"persistent://public/default/113p-partition-0"
"persistent://public/default/113p-partition-1"
"persistent://public/default/113p-partition-2"
"persistent://public/default/3p-partition-0"

when we want to update topic 3p to more partitions ,we failed because "persistent://public/default/113p-partition-0" contain "3p-partition"

Modifications
use the startwith to check if exist the same topic.

* fix the bug,  the old topic contain a same strSub cause couldn't add new partitions

* add update the partitioned topic which a part is coontained in old topic test

Co-authored-by: nicklixinyang <nicklixinyang@didiglobal.com>
(cherry picked from commit 241de4b)
### Motivation
The producer doesn't release the meomry when msg send timeout

### Modifications
When the message timeout, release the memory too.

### Documentation
We don't need to update docs, because it's a bug fix.

### reproduce the problem
- start pulsar server
- start pulsar producer, which will send msg for 10 minutes(queueSize 0, memory limit 50MB)
- close the pulsar server
- wait for 10 minutes over.
- Look the heapdump, it shows pendingMessages size is 0, but Memory used 52429824

(cherry picked from commit 1399eeb)
### Motivation

Under certain conditions applications using the multi-topic consumers might get the consumption stalled:

The conditions to reproduce the issue are:
 * Consumer is subscribed to multiple topics, but only 1 topic has traffic
 * Messages are published in batches (no repro if no batches)
 * Receiver queue size == 1 (or small, in order to exercise race condition)

The problem is that there is race condition between 2 threads when we're deciding to put one of the individual consumers in "paused" state, when the shared queue is full.

What happens is that, just after we checked the conditions and we decide to mark the consumer as paused, the application  has emptied the shared queue completely. From that point on, there is no re-attempt to check whether we need to unblock that consumer.

### Modification

Instead of introducing a sync block (contended by many consumers), we just double check the state of the shared queue after marking the consumer as "paused". If the other thread has emptied the queue in the meantime, we'll be guaranteed to unblock the consumer.

(cherry picked from commit f1d66d1)
Avoid issue due to async update by using Awaitility

Co-authored-by: Surinder Singh <surinders@splunk.com>
(cherry picked from commit 4cf1008)
…ug (#11769)

### Motivation
branch 2.8 run license check failed.
```
com.squareup.okhttp-okhttp-2.7.4.jar unaccounted for in LICENSE

It looks like there are issues with the LICENSE/NOTICE.
```
### Modification
add com.squareup.okhttp-okhttp-2.7.4.jar into LICENSE.bin.txt
* [Security] Use ubuntu:20.04 base image for Pulsar docker images

- Ubuntu fixes critical and high security vulnerabilities.
  - openjdk:11-jdk-slim/openjdk:11-jdk images are based on
    Debian 10 which contains a lot of unfixed vulnerabilities.
    - this causes the Pulsar docker images to get flagged in Docker image vulnerability
      scanning with docker image vulnerability scanning tools such as Clair

* Install Ubuntu updates

* Set DEBIAN_FRONTEND=noninteractive so that dist-upgrade doesn't wait for input

* Set JAVA_HOME

* Fix configuring networkaddress.cache.ttl

- JAVA_HOME set by ENV isn't available for RUN commands

* Configure networkaddress.cache.ttl after installing OpenJDK

(cherry picked from commit f989512)
Fix #11789

### Modification
upgrade aircompressor from 0.19 to 0.20

(cherry picked from commit 051f52d)
…r upgrade to 0.20 (#11792)

### Motivation
Due to aircompressor 0.19 can't work with heap buffer on JDK1.8, so #11594 use copy data to direct buffer to avoid NoSuchMethodError exception. Now aircompressor released 0.20 and #11790 has upgrade the aircompressor version to 0.20 to fix this issue, we can revert #11594 to avoid copy data to direct buffer to improve performance.

### Modification
1. revert Fix java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer when enabling topic metadata compression #11594, but keep the tests.

(cherry picked from commit cc1b983)
@casuallc casuallc changed the title Branch 2.8 [issue #11796] throw NPE when readEntry Aug 27, 2021
@casuallc casuallc changed the base branch from master to branch-2.8 August 27, 2021 02:11
@Anonymitaet
Copy link
Member

Thanks for your contribution. For this PR, do we need to update docs?

(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@casuallc
Copy link
Contributor Author

Thanks for your contribution. For this PR, do we need to update docs?

(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

No need.

@Anonymitaet Anonymitaet added the doc-not-needed Your PR changes do not impact docs label Aug 27, 2021
@BewareMyPower
Copy link
Contributor

Why do you try to merge to branch-2.8 but not master? A PR that is not merging to master branch won't trigger any CI workflow.

Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block this PR first because it's merged to branch-2.8

@casuallc casuallc changed the base branch from branch-2.8 to master August 27, 2021 07:41
@casuallc casuallc closed this Aug 27, 2021
@casuallc casuallc deleted the branch-2.8 branch August 27, 2021 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.