Using the Python pulsar client with a logger can cause arbitrary/unrelated `async` Python functions to return `None` #11823

zbentley · 2021-08-27T23:11:37Z

If a logger object (any logger object) is supplied to pulsar.Client objects in Python, its presence can cause arbitrary async Python functions to return None incorrectly.

Steps to reproduce

To reproduce, run the following:

import asyncio
import logging

from pulsar import Client


async def async_func(client, rv):
    consumer = client.subscribe('sometopic', 'somesub')
    consumer.close()
    print("async returning", rv)
    return rv


if __name__ == '__main__':
    client = Client(
        service_url='pulsar://localhost:6650/',
        logger=logging.getLogger("foo")
    )
    print("returned:", asyncio.run(async_func(client, "bar")))

Test setup

Python: 3.9
OS: MacOS 10.11
Arch: x86_64
Pulsar broker: 2.8.0 standalone, running locally
Client: 2.8.0.post0

Pathology/root cause

This appears to be due to a Boost-python issue (and, in my opinion, a pretty bad one) I found while trying to track down this behavior: boostorg/python#374

Unless there is a way to fix that, it looks like any invocation of pulsar-client's Python logger by the C++ code can, in the right circumstances, corrupt the Python interpreter state and cause the calling async function to return None incorrectly.

The text was updated successfully, but these errors were encountered:

zbentley · 2021-08-27T23:13:17Z

Since this is a reasonably subtle and silent bug when it occurs, I suggest removing client support for a user-supplied logger.

That breaks backwards compatibility, and is not to be undertaken lightly, but the risk otherwise seems pretty significant: the bug surface is the intersection of "a custom logger is in use", "any async python function is running in my program, anywhere" and "any pulsar C++ object is destructed anywhere in the program". That's not good.

zbentley · 2021-08-27T23:18:45Z

Actually, after thinking about it a bit, this may be possible for the Pulsar client to mitigate (though it still seems like a Boost problem fundamentally).

Since PyErr_Print mutates interpreter state (clears the global exception bit), calling it in the C++ python bindings is making this bug worse. Without that call, any logger interactions during C++ object destruction would still fail, and some Python internal interpreter errors would be emitted, but things would still generally work.

However, with the call to PyErr_Print, other state gets corrupted and return values get messed up.

BewareMyPower · 2021-08-28T04:48:28Z

@lbenc135 Could you help take a look?

BewareMyPower · 2021-08-28T05:07:55Z

I cannot reproduce this bug in my local env (macOS Big Sur 11.2, Boost-python 1.74). Here's my output:

async returning bar
returned: bar

I also noticed PyErr_Print is only called when exceptions are caught. Does removing the call of PyErr_Print really work?

BTW, I found #10981 might fix this issue because the Python client I used was compiled from latest master. Could you also try it?

BewareMyPower · 2021-08-28T11:01:57Z

It's weird that somehow I can reproduce this bug now, the output is

async returning bar
StopIteration: bar
returned: None
2021-08-28 19:00:00.107 ERROR [0x700004d1c000] ClientConnection:581 | [127.0.0.1:56560 -> 127.0.0.1:6650] Read failed: Operation canceled

BewareMyPower · 2021-08-28T11:25:08Z

I think a temporary solution is to avoid C++ object's destructor in async functions, like

consumer = None

async def async_func(client, rv):
    global consumer # use the global variable
    consumer = client.subscribe('sometopic', 'somesub')
    consumer.close()
    print("async returning", rv)
    return rv

zbentley · 2021-08-28T12:58:35Z

Does removing the call of PyErr_Print really work?

I believe so, though that only solves part of the issue. Without the call to PyErr_Print, return values aren't altered. However, the logger won't work in those situations (destruction after an async return) even without the call to PyErr_Print, as boost::python::call fails before invoking the requested function in that context.

a temporary solution is to avoid C++ object's destructor in async functions

That does indeed work. However, because the Pulsar client is heavily asynchronous, I don't think that workaround is practical. Consider a big python prorgram with lots of async code that instantiates a Pulsar client globally. Any time that client logs for any reason (not just destructors), there's a chance that the logging action happens while the interpreter is returning from an async function, in which case this bug will occur--even if the async functions running have nothing to do with Pulsar in any way. The case with a destructor is just the most reliable way to encounter this bug, not the only way.

I'm less sure about this, but I think that chance might not be as small as it sounds; it is possible that if the event loop is blocked by something (anything), pending futures whose results have not yet been consumed stay in the StopIteration-exception-pending state until a routine comes along to check on them. If that is the case, then the "coincidence" window where this bug could occur is much wider.

BewareMyPower · 2021-08-30T01:27:52Z

Yeah, I tried in my local env that removing PyErr_Print works. It's because when the LoggerWrapper is constructed in a destructor, py::call_method for logger.getEffectiveLevel failed and PyErr_Print is triggered. I'll push a PR soon.

zbentley · 2021-08-30T13:49:00Z

Confirmed that #10981 does not resolve this issue. That's because the root cause is pretty broad (any time C++ code calls back into Python during a C++ destructor, not just during global finalization, the call fails--and worse, if you then call the standard "print why this failed" python utility function, it corrupts the interpreter frame).

BewareMyPower · 2021-08-30T15:05:27Z

Yeah, I just opened a PR #11840 to fix this issue, PTAL. Since the test cannot be verified in current Python2 based CI, you can verify it in your local env.

zbentley · 2021-08-30T15:53:08Z

@BewareMyPower testing now.

zbentley · 2021-08-30T17:25:13Z

Unfortunately I am able to reproduce the bug on your branch:

∴ python tests/benchmark/scratch.py
async returning bar
StopIteration: bar
returned: None

zbentley · 2021-08-30T21:11:47Z

Additionally, in the process of debugging this issue, I found a couple more issues related to the logger argument in Python.

Annoying/potentially damages logging integrity (medium/low severity for me, unsure about others): [2.8.0] Python client instances emit logs for every client instance ever constructed with a logger argument pulsar-client-python#40
Extremely problematic for any affected code (high severity for me, likely others too unless there is something about my installation that is very unique): [2.8.0] Python programs hang uninterruptibly on shutdown if pulsar clients are used anywhere non-global with the logger argument #11847

BewareMyPower · 2021-08-31T02:13:58Z

I found the second issue as well. We can take a look at these two issues later.

Regarding to this issue, could you upload you code scratch.py to reproduce? In my local env, custom_logger_test.py and the code in this PR work well. Here's my code with debug level logging.

import faulthandler
import asyncio
import logging

from pulsar import Client

def test():
    client = Client(
        service_url='pulsar://localhost:6650/',
        logger=logging.getLogger("foo")
    )

    async def async_func(rv):
        consumer = client.subscribe('sometopic', 'somesub')
        consumer.close()
        print("async returning", rv)
        return rv

    print("returned:", asyncio.run(async_func("bar")))
    client.close()


if __name__ == '__main__':
    faulthandler.enable()
    logging.basicConfig(encoding='utf-8', level=logging.DEBUG)
    test()

When I ran it, the output was like

async returning bar
2021-08-31 10:10:38.706 DEBUG [0x10afd1e00] ConsumerImpl:106 | [persistent://public/default/sometopic, somesub, 0] ~ConsumerImpl
2021-08-31 10:10:38.706 DEBUG [0x10afd1e00] AckGroupingTrackerEnabled:100 | Reference to the HandlerBase is not valid.
DEBUG:foo:Ignoring timer cancelled event, code[system:89]
returned: bar

You can see there're two lines that use the default logger.

In your output, we can still see StopIteration: bar that should be printed by PyErr_Print. I didn't remove all PyErr_Print calls because I think they won't happen in some destructors. If you can provide code to reproduce, I can debug deeper for the cause.

zbentley · 2021-08-31T13:56:02Z

The contents of my scratch.py were the snippet in the description of the bug.

When I run your code exactly as written, I get this output:

INFO:foo:Subscribing on Topic :sometopic
INFO:foo:Created connection for pulsar://localhost:6650/
INFO:foo:[127.0.0.1:61188 -> 127.0.0.1:6650] Connected to broker
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Getting connection from pool
INFO:foo:Created connection for pulsar://localhost:6650
INFO:foo:[127.0.0.1:61189 -> 127.0.0.1:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Created consumer on broker [127.0.0.1:61189 -> 127.0.0.1:6650]
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closing consumer for topic persistent://public/default/sometopic
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closed consumer 0
async returning bar
StopIteration: bar
returned: None
INFO:foo:Closing Pulsar client
INFO:foo:[127.0.0.1:61189 -> 127.0.0.1:6650] Connection closed
INFO:foo:[127.0.0.1:61188 -> 127.0.0.1:6650] Connection closed

...after which the code hangs (per the other bug linked).

I'm using python 3.9.6 installed via brew, in a clean virtual environment; the only commands I've issued other than venv creation are pip install pulsar-client and pip install fastavro.

zbentley · 2021-08-31T14:00:55Z

On Python 3.7.10, the bug still occurs, but I get a lot more output, which may be useful:

DEBUG:foo:Using Binary Lookup
DEBUG:asyncio:Using selector: KqueueSelector
INFO:foo:Subscribing on Topic :sometopic
INFO:foo:Created connection for pulsar://localhost:6650/
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Connected to broker
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Handling incoming command: PARTITIONED_METADATA_RESPONSE
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Received partition-metadata response from server. req_id: 1
DEBUG:foo:PartitionMetadataLookup response for persistent://public/default/sometopic, lookup-broker-url
DEBUG:foo:BatchAcknowledgementTracker for [persistent://public/default/sometopic, somesub, 0] Constructed BatchAcknowledgementTracker
DEBUG:foo:Created negative ack tracker with delay: 60000 ms - Timer interval: 00:00:20
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Getting connection from pool
DEBUG:foo:Got connection from pool for pulsar://localhost:6650/ use_count: 4 @ 0x7fdffc820200
DEBUG:foo:ACK grouping is enabled, grouping time 100ms, grouping max size 1000
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Handling incoming command: LOOKUP_RESPONSE
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2 -- broker-url: pulsar://localhost:6650 -- broker-tls-url:  authoritative: 1 redirect: 1
DEBUG:foo:Lookup response for persistent://public/default/sometopic, lookup-broker-url pulsar://localhost:6650
DEBUG:foo:Getting connection to broker: pulsar://localhost:6650
INFO:foo:Created connection for pulsar://localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Connected to broker: [127.0.0.1:61247 -> 127.0.0.1:6650]
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Received success response from server. req_id: 0
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Created consumer on broker [127.0.0.1:61247 -> 127.0.0.1:6650]
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send initial flow permits: 1000
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send more permits: 1000
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closing consumer for topic persistent://public/default/sometopic
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Received success response from server. req_id: 1
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closed consumer 0
async returning bar
StopIteration: bar

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zac.bentley/.pyenv/versions/3.7.10/lib/python3.7/logging/__init__.py", line 1365, in debug
    if self.isEnabledFor(DEBUG):
SystemError: PyEval_EvalFrameEx returned a result with an error set
DEBUG:foo:Reference to the HandlerBase is not valid.
DEBUG:foo:Ignoring timer cancelled event, code[system:89]
returned: None
INFO:foo:Closing Pulsar client
INFO:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Connection closed
INFO:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Connection closed

BewareMyPower · 2021-08-31T15:02:09Z

after which the code hangs (per the other bug linked).

The output looks like to be the case that you mentioned in another issue but my code has called client.close(), it should not happen.

I'm using python 3.9.6 installed via brew, in a clean virtual environment; the only commands I've issued other than venv creation are pip install pulsar-client and pip install fastavro.

How did you build the C++ library? I suspected your C++ library was not built well. I didn't use pip install. Instead I built from source using CMake in pulsar-client-cpp directory.

mkdir -p _builds
# NOTE: I installed the gtest 1.10.0 and protobuf 3.17.3 dependency into ~/software directory
# The Boost dependency was installed by brew, including Boost::python.
SOFTWARE="$HOME/software"
cd _builds
cmake .. -Wno-dev \
    -DPROTOC_PATH=$SOFTWARE/protobuf-3.17.3/bin/protoc \
    -DCMAKE_PREFIX_PATH="$SOFTWARE/gtest-1.10.0;$SOFTWARE/protobuf-3.17.3" \
    -DBUILD_PYTHON_WRAPPER=ON -DBUILD_TESTS=ON -DBUILD_PERF_TOOLS=OFF
make -j4

After compilation completed, copy the _pulsar.so from _builds/python to python directory. Then I put my Python scripts under python directory and run scripts under python directory to make Python interpreter find the local _pulsar.so instead of where pip installed.

NOTE: the picture above used INFO level instead of DEBUG level of my previous code.

BTW, here is the full output with my code before:

DEBUG:foo:Using Binary Lookup
DEBUG:asyncio:Using selector: KqueueSelector
INFO:foo:Subscribing on Topic :sometopic
INFO:foo:[<none> -> pulsar://localhost:6650/] Create ClientConnection, timeout=10000
INFO:foo:Created connection for pulsar://localhost:6650/
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolving localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to 127.0.0.1:6650...
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Connected to broker
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Handling incoming command: PARTITIONED_METADATA_RESPONSE
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Received partition-metadata response from server. req_id: 1
DEBUG:foo:PartitionMetadataLookup response for persistent://public/default/sometopic, lookup-broker-url 
DEBUG:foo:BatchAcknowledgementTracker for [persistent://public/default/sometopic, somesub, 0] Constructed BatchAcknowledgementTracker
DEBUG:foo:Created negative ack tracker with delay: 60000 ms - Timer interval: 00:00:20
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Getting connection from pool
DEBUG:foo:Got connection from pool for pulsar://localhost:6650/ use_count: 5 @ 0x7f8126855200
DEBUG:foo:ACK grouping is enabled, grouping time 100ms, grouping max size 1000
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Handling incoming command: LOOKUP_RESPONSE
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2 -- broker-url: pulsar://localhost:6650 -- broker-tls-url:  authoritative: 1 redirect: 1
DEBUG:foo:Lookup response for persistent://public/default/sometopic, lookup-broker-url pulsar://localhost:6650
DEBUG:foo:Getting connection to broker: pulsar://localhost:6650
INFO:foo:[<none> -> pulsar://localhost:6650/] Create ClientConnection, timeout=10000
INFO:foo:Created connection for pulsar://localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolving localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to 127.0.0.1:6650...
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Connected to broker: [127.0.0.1:63752 -> 127.0.0.1:6650] 
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Received success response from server. req_id: 0
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Created consumer on broker [127.0.0.1:63752 -> 127.0.0.1:6650] 
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send initial flow permits: 1000
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send more permits: 1000
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closing consumer for topic persistent://public/default/sometopic
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Received success response from server. req_id: 1
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closed consumer 0
async returning bar
2021-08-31 22:47:43.457 DEBUG [0x110d24e00] ConsumerImpl:106 | [persistent://public/default/sometopic, somesub, 0] ~ConsumerImpl
2021-08-31 22:47:43.457 DEBUG [0x110d24e00] AckGroupingTrackerEnabled:100 | Reference to the HandlerBase is not valid.
DEBUG:foo:Ignoring timer cancelled event, code[system:89]
returned: bar
INFO:foo:Closing Pulsar client with 0 producers and 1 consumers
DEBUG:foo:Shutting down producers and consumers for client
DEBUG:foo:0 producers and 1 consumers have been shutdown.
ERROR:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Read failed: Operation canceled
INFO:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Connection closed
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650]  Ignoring timer cancelled event, code[system:89]
INFO:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Connection closed
ERROR:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Read failed: Operation canceled
DEBUG:foo:ConnectionPool is closed
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650]  Ignoring timer cancelled event, code[system:89]
DEBUG:foo:ioExecutorProvider_ is closed
DEBUG:foo:listenerExecutorProvider_ is closed
DEBUG:foo:partitionListenerExecutorProvider_ is closed

And here is the output of your scratch.py:

async returning bar
returned: bar
2021-08-31 22:48:45.714 ERROR [0x700005787000] ClientConnection:572 | [127.0.0.1:63915 -> 127.0.0.1:6650] Read failed: Operation canceled

BewareMyPower · 2021-08-31T15:04:51Z

I'll try to build Python client in a Ubuntu based docker image so that the code can be verified in a general environment. It might take some time.

zbentley · 2021-08-31T16:37:04Z

How did you build the C++ library?

I didn't; pip install pulsar-client does not by default invoke the compiler. Instead, it downloads wheels of precompiled artifacts that were submitted by project maintainers.

However, the usual means of overriding this (pip install pulsar-client --no-binary :all:) does not work for me:

 ∴ pip install pulsar-client --no-binary :all:
ERROR: Could not find a version that satisfies the requirement pulsar-client (from versions: none)
ERROR: No matching distribution found for pulsar-client

I think this means that source artifacts are not published to PyPi, which may be a separate issue. Do you think I should open one?

Anyway, I installed protoc and googletest via Homebrew and ran your build instructions (I didn't have to set DCMAKE_PREFIX_PATH, but otherwise I did everything else the same).

Both tests passed on your branch! I feel very stupid for not fixing up my pathing to use my client compiled against your branch; very sorry for wasting your time with the back-and-forth.

BewareMyPower · 2021-08-31T16:47:50Z

I think this means that source artifacts are not published to PyPi, which may be a separate issue. Do you think I should open one?

You can open an issue or send an email for this issue. I'm not familiar with how Python client was published but someone else might know.

very sorry for wasting your time with the back-and-forth.

Never mind, glad to hear it works for you :)

…#11840) Fixes #11823 ### Motivation When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details. For the code example in #11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`. ### Modifications - Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers. - Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs. - Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected. ### Verifying this change - [x] Make sure that the change passes the CI checks. This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.

zbentley · 2021-08-31T17:06:14Z

Thanks again for the fix!

What's the Pulsar policy on cutting new client releases? Should your bugfix cause publication of updated Python client packages to PyPi, or should that wait until the next main Pulsar release?

Either one is fine, I just want to know whether I should publish a hand-built version of the Python client to my organization's internal package mirror or wait for PyPi to have it.

BewareMyPower · 2021-09-01T02:50:42Z

It should only be published to PyPi for stable versions. For master branch, you need to build from source.

If the related PR was already cherry-picked to branch of last stable version (branch-2.8 currently), the wheel file would be included in StreamNative's weekly release, like https://github.com/streamnative/pulsar/releases/tag/v2.8.0.15.

zbentley · 2021-09-01T22:34:58Z

I did not know about the weekly releases, thanks!

…#11840) Fixes #11823 When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details. For the code example in #11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`. - Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers. - Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs. - Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected. - [x] Make sure that the change passes the CI checks. This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added. (cherry picked from commit 9153e71)

…apache#11840) Fixes apache#11823 ### Motivation When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details. For the code example in apache#11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`. ### Modifications - Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers. - Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs. - Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected. ### Verifying this change - [x] Make sure that the change passes the CI checks. This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.

zbentley added the type/bug The PR fixed a bug or issue reported a bug label Aug 27, 2021

sijie mentioned this issue Aug 28, 2021

ISSUE-11823: Using the Python pulsar client with a logger can cause arbitrary/unrelated async Python functions to return None streamnative/pulsar-archived#2974

Closed

BewareMyPower self-assigned this Aug 30, 2021

BewareMyPower mentioned this issue Aug 30, 2021

[Python] Handle py::call_method error without mutating internal state #11840

Merged

1 task

BewareMyPower closed this as completed in #11840 Aug 31, 2021

zbentley mentioned this issue Sep 2, 2021

If feasible, publish source releases of the Python pulsar-client package #11899

Closed

sijie mentioned this issue Sep 2, 2021

ISSUE-11899: If feasible, publish source releases of the Python pulsar-client package streamnative/pulsar-archived#3006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the Python pulsar client with a logger can cause arbitrary/unrelated `async` Python functions to return `None` #11823

Using the Python pulsar client with a logger can cause arbitrary/unrelated `async` Python functions to return `None` #11823

zbentley commented Aug 27, 2021

zbentley commented Aug 27, 2021

zbentley commented Aug 27, 2021

BewareMyPower commented Aug 28, 2021

BewareMyPower commented Aug 28, 2021

BewareMyPower commented Aug 28, 2021

BewareMyPower commented Aug 28, 2021

zbentley commented Aug 28, 2021

BewareMyPower commented Aug 30, 2021

zbentley commented Aug 30, 2021

BewareMyPower commented Aug 30, 2021

zbentley commented Aug 30, 2021

zbentley commented Aug 30, 2021

zbentley commented Aug 30, 2021

BewareMyPower commented Aug 31, 2021

zbentley commented Aug 31, 2021

zbentley commented Aug 31, 2021

BewareMyPower commented Aug 31, 2021

BewareMyPower commented Aug 31, 2021

zbentley commented Aug 31, 2021

BewareMyPower commented Aug 31, 2021

zbentley commented Aug 31, 2021

BewareMyPower commented Sep 1, 2021

zbentley commented Sep 1, 2021

Using the Python pulsar client with a logger can cause arbitrary/unrelated async Python functions to return None #11823

Using the Python pulsar client with a logger can cause arbitrary/unrelated async Python functions to return None #11823

Comments

zbentley commented Aug 27, 2021

Steps to reproduce

Test setup

Pathology/root cause

zbentley commented Aug 27, 2021

zbentley commented Aug 27, 2021

BewareMyPower commented Aug 28, 2021

BewareMyPower commented Aug 28, 2021

BewareMyPower commented Aug 28, 2021

BewareMyPower commented Aug 28, 2021

zbentley commented Aug 28, 2021

BewareMyPower commented Aug 30, 2021

zbentley commented Aug 30, 2021

BewareMyPower commented Aug 30, 2021

zbentley commented Aug 30, 2021

zbentley commented Aug 30, 2021

zbentley commented Aug 30, 2021

BewareMyPower commented Aug 31, 2021

zbentley commented Aug 31, 2021

zbentley commented Aug 31, 2021

BewareMyPower commented Aug 31, 2021

BewareMyPower commented Aug 31, 2021

zbentley commented Aug 31, 2021

BewareMyPower commented Aug 31, 2021

zbentley commented Aug 31, 2021

BewareMyPower commented Sep 1, 2021

zbentley commented Sep 1, 2021

Using the Python pulsar client with a logger can cause arbitrary/unrelated `async` Python functions to return `None` #11823

Using the Python pulsar client with a logger can cause arbitrary/unrelated `async` Python functions to return `None` #11823