Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the Python pulsar client with a logger can cause arbitrary/unrelated async Python functions to return None #11823

Closed
zbentley opened this issue Aug 27, 2021 · 23 comments · Fixed by #11840
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@zbentley
Copy link
Contributor

If a logger object (any logger object) is supplied to pulsar.Client objects in Python, its presence can cause arbitrary async Python functions to return None incorrectly.

Steps to reproduce

To reproduce, run the following:

import asyncio
import logging

from pulsar import Client


async def async_func(client, rv):
    consumer = client.subscribe('sometopic', 'somesub')
    consumer.close()
    print("async returning", rv)
    return rv


if __name__ == '__main__':
    client = Client(
        service_url='pulsar://localhost:6650/',
        logger=logging.getLogger("foo")
    )
    print("returned:", asyncio.run(async_func(client, "bar")))

Test setup

Python: 3.9
OS: MacOS 10.11
Arch: x86_64
Pulsar broker: 2.8.0 standalone, running locally
Client: 2.8.0.post0

Pathology/root cause

This appears to be due to a Boost-python issue (and, in my opinion, a pretty bad one) I found while trying to track down this behavior: boostorg/python#374

Unless there is a way to fix that, it looks like any invocation of pulsar-client's Python logger by the C++ code can, in the right circumstances, corrupt the Python interpreter state and cause the calling async function to return None incorrectly.

@zbentley zbentley added the type/bug The PR fixed a bug or issue reported a bug label Aug 27, 2021
@zbentley
Copy link
Contributor Author

Since this is a reasonably subtle and silent bug when it occurs, I suggest removing client support for a user-supplied logger.

That breaks backwards compatibility, and is not to be undertaken lightly, but the risk otherwise seems pretty significant: the bug surface is the intersection of "a custom logger is in use", "any async python function is running in my program, anywhere" and "any pulsar C++ object is destructed anywhere in the program". That's not good.

@zbentley
Copy link
Contributor Author

Actually, after thinking about it a bit, this may be possible for the Pulsar client to mitigate (though it still seems like a Boost problem fundamentally).

Since PyErr_Print mutates interpreter state (clears the global exception bit), calling it in the C++ python bindings is making this bug worse. Without that call, any logger interactions during C++ object destruction would still fail, and some Python internal interpreter errors would be emitted, but things would still generally work.

However, with the call to PyErr_Print, other state gets corrupted and return values get messed up.

@BewareMyPower
Copy link
Contributor

@lbenc135 Could you help take a look?

@BewareMyPower
Copy link
Contributor

I cannot reproduce this bug in my local env (macOS Big Sur 11.2, Boost-python 1.74). Here's my output:

async returning bar
returned: bar

I also noticed PyErr_Print is only called when exceptions are caught. Does removing the call of PyErr_Print really work?


BTW, I found #10981 might fix this issue because the Python client I used was compiled from latest master. Could you also try it?

@BewareMyPower
Copy link
Contributor

It's weird that somehow I can reproduce this bug now, the output is

async returning bar
StopIteration: bar
returned: None
2021-08-28 19:00:00.107 ERROR [0x700004d1c000] ClientConnection:581 | [127.0.0.1:56560 -> 127.0.0.1:6650] Read failed: Operation canceled

@BewareMyPower
Copy link
Contributor

I think a temporary solution is to avoid C++ object's destructor in async functions, like

consumer = None

async def async_func(client, rv):
    global consumer # use the global variable
    consumer = client.subscribe('sometopic', 'somesub')
    consumer.close()
    print("async returning", rv)
    return rv

@zbentley
Copy link
Contributor Author

Does removing the call of PyErr_Print really work?

I believe so, though that only solves part of the issue. Without the call to PyErr_Print, return values aren't altered. However, the logger won't work in those situations (destruction after an async return) even without the call to PyErr_Print, as boost::python::call fails before invoking the requested function in that context.

a temporary solution is to avoid C++ object's destructor in async functions

That does indeed work. However, because the Pulsar client is heavily asynchronous, I don't think that workaround is practical. Consider a big python prorgram with lots of async code that instantiates a Pulsar client globally. Any time that client logs for any reason (not just destructors), there's a chance that the logging action happens while the interpreter is returning from an async function, in which case this bug will occur--even if the async functions running have nothing to do with Pulsar in any way. The case with a destructor is just the most reliable way to encounter this bug, not the only way.

I'm less sure about this, but I think that chance might not be as small as it sounds; it is possible that if the event loop is blocked by something (anything), pending futures whose results have not yet been consumed stay in the StopIteration-exception-pending state until a routine comes along to check on them. If that is the case, then the "coincidence" window where this bug could occur is much wider.

@BewareMyPower
Copy link
Contributor

Yeah, I tried in my local env that removing PyErr_Print works. It's because when the LoggerWrapper is constructed in a destructor, py::call_method for logger.getEffectiveLevel failed and PyErr_Print is triggered. I'll push a PR soon.

@BewareMyPower BewareMyPower self-assigned this Aug 30, 2021
@zbentley
Copy link
Contributor Author

Confirmed that #10981 does not resolve this issue. That's because the root cause is pretty broad (any time C++ code calls back into Python during a C++ destructor, not just during global finalization, the call fails--and worse, if you then call the standard "print why this failed" python utility function, it corrupts the interpreter frame).

@BewareMyPower
Copy link
Contributor

Yeah, I just opened a PR #11840 to fix this issue, PTAL. Since the test cannot be verified in current Python2 based CI, you can verify it in your local env.

@zbentley
Copy link
Contributor Author

@BewareMyPower testing now.

@zbentley
Copy link
Contributor Author

Unfortunately I am able to reproduce the bug on your branch:

∴ python tests/benchmark/scratch.py
async returning bar
StopIteration: bar
returned: None

@zbentley
Copy link
Contributor Author

Additionally, in the process of debugging this issue, I found a couple more issues related to the logger argument in Python.

@BewareMyPower
Copy link
Contributor

I found the second issue as well. We can take a look at these two issues later.

Regarding to this issue, could you upload you code scratch.py to reproduce? In my local env, custom_logger_test.py and the code in this PR work well. Here's my code with debug level logging.

import faulthandler
import asyncio
import logging

from pulsar import Client

def test():
    client = Client(
        service_url='pulsar://localhost:6650/',
        logger=logging.getLogger("foo")
    )

    async def async_func(rv):
        consumer = client.subscribe('sometopic', 'somesub')
        consumer.close()
        print("async returning", rv)
        return rv

    print("returned:", asyncio.run(async_func("bar")))
    client.close()


if __name__ == '__main__':
    faulthandler.enable()
    logging.basicConfig(encoding='utf-8', level=logging.DEBUG)
    test()

When I ran it, the output was like

async returning bar
2021-08-31 10:10:38.706 DEBUG [0x10afd1e00] ConsumerImpl:106 | [persistent://public/default/sometopic, somesub, 0] ~ConsumerImpl
2021-08-31 10:10:38.706 DEBUG [0x10afd1e00] AckGroupingTrackerEnabled:100 | Reference to the HandlerBase is not valid.
DEBUG:foo:Ignoring timer cancelled event, code[system:89]
returned: bar

You can see there're two lines that use the default logger.

In your output, we can still see StopIteration: bar that should be printed by PyErr_Print. I didn't remove all PyErr_Print calls because I think they won't happen in some destructors. If you can provide code to reproduce, I can debug deeper for the cause.

@zbentley
Copy link
Contributor Author

The contents of my scratch.py were the snippet in the description of the bug.

When I run your code exactly as written, I get this output:

INFO:foo:Subscribing on Topic :sometopic
INFO:foo:Created connection for pulsar://localhost:6650/
INFO:foo:[127.0.0.1:61188 -> 127.0.0.1:6650] Connected to broker
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Getting connection from pool
INFO:foo:Created connection for pulsar://localhost:6650
INFO:foo:[127.0.0.1:61189 -> 127.0.0.1:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Created consumer on broker [127.0.0.1:61189 -> 127.0.0.1:6650]
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closing consumer for topic persistent://public/default/sometopic
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closed consumer 0
async returning bar
StopIteration: bar
returned: None
INFO:foo:Closing Pulsar client
INFO:foo:[127.0.0.1:61189 -> 127.0.0.1:6650] Connection closed
INFO:foo:[127.0.0.1:61188 -> 127.0.0.1:6650] Connection closed

...after which the code hangs (per the other bug linked).

I'm using python 3.9.6 installed via brew, in a clean virtual environment; the only commands I've issued other than venv creation are pip install pulsar-client and pip install fastavro.

@zbentley
Copy link
Contributor Author

On Python 3.7.10, the bug still occurs, but I get a lot more output, which may be useful:

DEBUG:foo:Using Binary Lookup
DEBUG:asyncio:Using selector: KqueueSelector
INFO:foo:Subscribing on Topic :sometopic
INFO:foo:Created connection for pulsar://localhost:6650/
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Connected to broker
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Handling incoming command: PARTITIONED_METADATA_RESPONSE
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Received partition-metadata response from server. req_id: 1
DEBUG:foo:PartitionMetadataLookup response for persistent://public/default/sometopic, lookup-broker-url
DEBUG:foo:BatchAcknowledgementTracker for [persistent://public/default/sometopic, somesub, 0] Constructed BatchAcknowledgementTracker
DEBUG:foo:Created negative ack tracker with delay: 60000 ms - Timer interval: 00:00:20
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Getting connection from pool
DEBUG:foo:Got connection from pool for pulsar://localhost:6650/ use_count: 4 @ 0x7fdffc820200
DEBUG:foo:ACK grouping is enabled, grouping time 100ms, grouping max size 1000
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Handling incoming command: LOOKUP_RESPONSE
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2
DEBUG:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2 -- broker-url: pulsar://localhost:6650 -- broker-tls-url:  authoritative: 1 redirect: 1
DEBUG:foo:Lookup response for persistent://public/default/sometopic, lookup-broker-url pulsar://localhost:6650
DEBUG:foo:Getting connection to broker: pulsar://localhost:6650
INFO:foo:Created connection for pulsar://localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Connected to broker: [127.0.0.1:61247 -> 127.0.0.1:6650]
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Received success response from server. req_id: 0
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Created consumer on broker [127.0.0.1:61247 -> 127.0.0.1:6650]
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send initial flow permits: 1000
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send more permits: 1000
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closing consumer for topic persistent://public/default/sometopic
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Received success response from server. req_id: 1
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closed consumer 0
async returning bar
StopIteration: bar

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zac.bentley/.pyenv/versions/3.7.10/lib/python3.7/logging/__init__.py", line 1365, in debug
    if self.isEnabledFor(DEBUG):
SystemError: PyEval_EvalFrameEx returned a result with an error set
DEBUG:foo:Reference to the HandlerBase is not valid.
DEBUG:foo:Ignoring timer cancelled event, code[system:89]
returned: None
INFO:foo:Closing Pulsar client
INFO:foo:[127.0.0.1:61247 -> 127.0.0.1:6650] Connection closed
INFO:foo:[127.0.0.1:61246 -> 127.0.0.1:6650] Connection closed

@BewareMyPower
Copy link
Contributor

after which the code hangs (per the other bug linked).

The output looks like to be the case that you mentioned in another issue but my code has called client.close(), it should not happen.

I'm using python 3.9.6 installed via brew, in a clean virtual environment; the only commands I've issued other than venv creation are pip install pulsar-client and pip install fastavro.

How did you build the C++ library? I suspected your C++ library was not built well. I didn't use pip install. Instead I built from source using CMake in pulsar-client-cpp directory.

mkdir -p _builds
# NOTE: I installed the gtest 1.10.0 and protobuf 3.17.3 dependency into ~/software directory
# The Boost dependency was installed by brew, including Boost::python.
SOFTWARE="$HOME/software"
cd _builds
cmake .. -Wno-dev \
    -DPROTOC_PATH=$SOFTWARE/protobuf-3.17.3/bin/protoc \
    -DCMAKE_PREFIX_PATH="$SOFTWARE/gtest-1.10.0;$SOFTWARE/protobuf-3.17.3" \
    -DBUILD_PYTHON_WRAPPER=ON -DBUILD_TESTS=ON -DBUILD_PERF_TOOLS=OFF
make -j4

After compilation completed, copy the _pulsar.so from _builds/python to python directory. Then I put my Python scripts under python directory and run scripts under python directory to make Python interpreter find the local _pulsar.so instead of where pip installed.

image

NOTE: the picture above used INFO level instead of DEBUG level of my previous code.


BTW, here is the full output with my code before:

DEBUG:foo:Using Binary Lookup
DEBUG:asyncio:Using selector: KqueueSelector
INFO:foo:Subscribing on Topic :sometopic
INFO:foo:[<none> -> pulsar://localhost:6650/] Create ClientConnection, timeout=10000
INFO:foo:Created connection for pulsar://localhost:6650/
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolving localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to 127.0.0.1:6650...
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Connected to broker
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Handling incoming command: PARTITIONED_METADATA_RESPONSE
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Received partition-metadata response from server. req_id: 1
DEBUG:foo:PartitionMetadataLookup response for persistent://public/default/sometopic, lookup-broker-url 
DEBUG:foo:BatchAcknowledgementTracker for [persistent://public/default/sometopic, somesub, 0] Constructed BatchAcknowledgementTracker
DEBUG:foo:Created negative ack tracker with delay: 60000 ms - Timer interval: 00:00:20
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Getting connection from pool
DEBUG:foo:Got connection from pool for pulsar://localhost:6650/ use_count: 5 @ 0x7f8126855200
DEBUG:foo:ACK grouping is enabled, grouping time 100ms, grouping max size 1000
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Handling incoming command: LOOKUP_RESPONSE
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Received lookup response from server. req_id: 2 -- broker-url: pulsar://localhost:6650 -- broker-tls-url:  authoritative: 1 redirect: 1
DEBUG:foo:Lookup response for persistent://public/default/sometopic, lookup-broker-url pulsar://localhost:6650
DEBUG:foo:Getting connection to broker: pulsar://localhost:6650
INFO:foo:[<none> -> pulsar://localhost:6650/] Create ClientConnection, timeout=10000
INFO:foo:Created connection for pulsar://localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolving localhost:6650
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Connecting to 127.0.0.1:6650...
DEBUG:foo:[<none> -> pulsar://localhost:6650/] Resolved hostname localhost to 127.0.0.1:6650
INFO:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Handling incoming command: CONNECTED
DEBUG:foo:Connection has max message size setting: 5242880
DEBUG:foo:Current max message size is: 5242880
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Connected to broker: [127.0.0.1:63752 -> 127.0.0.1:6650] 
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Received success response from server. req_id: 0
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Created consumer on broker [127.0.0.1:63752 -> 127.0.0.1:6650] 
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send initial flow permits: 1000
DEBUG:foo:[persistent://public/default/sometopic, somesub, 0] Send more permits: 1000
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closing consumer for topic persistent://public/default/sometopic
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Handling incoming command: SUCCESS
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Received success response from server. req_id: 1
INFO:foo:[persistent://public/default/sometopic, somesub, 0] Closed consumer 0
async returning bar
2021-08-31 22:47:43.457 DEBUG [0x110d24e00] ConsumerImpl:106 | [persistent://public/default/sometopic, somesub, 0] ~ConsumerImpl
2021-08-31 22:47:43.457 DEBUG [0x110d24e00] AckGroupingTrackerEnabled:100 | Reference to the HandlerBase is not valid.
DEBUG:foo:Ignoring timer cancelled event, code[system:89]
returned: bar
INFO:foo:Closing Pulsar client with 0 producers and 1 consumers
DEBUG:foo:Shutting down producers and consumers for client
DEBUG:foo:0 producers and 1 consumers have been shutdown.
ERROR:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Read failed: Operation canceled
INFO:foo:[127.0.0.1:63752 -> 127.0.0.1:6650] Connection closed
DEBUG:foo:[127.0.0.1:63752 -> 127.0.0.1:6650]  Ignoring timer cancelled event, code[system:89]
INFO:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Connection closed
ERROR:foo:[127.0.0.1:63751 -> 127.0.0.1:6650] Read failed: Operation canceled
DEBUG:foo:ConnectionPool is closed
DEBUG:foo:[127.0.0.1:63751 -> 127.0.0.1:6650]  Ignoring timer cancelled event, code[system:89]
DEBUG:foo:ioExecutorProvider_ is closed
DEBUG:foo:listenerExecutorProvider_ is closed
DEBUG:foo:partitionListenerExecutorProvider_ is closed

And here is the output of your scratch.py:

async returning bar
returned: bar
2021-08-31 22:48:45.714 ERROR [0x700005787000] ClientConnection:572 | [127.0.0.1:63915 -> 127.0.0.1:6650] Read failed: Operation canceled

image

@BewareMyPower
Copy link
Contributor

I'll try to build Python client in a Ubuntu based docker image so that the code can be verified in a general environment. It might take some time.

@zbentley
Copy link
Contributor Author

How did you build the C++ library?

I didn't; pip install pulsar-client does not by default invoke the compiler. Instead, it downloads wheels of precompiled artifacts that were submitted by project maintainers.

However, the usual means of overriding this (pip install pulsar-client --no-binary :all:) does not work for me:

 ∴ pip install pulsar-client --no-binary :all:
ERROR: Could not find a version that satisfies the requirement pulsar-client (from versions: none)
ERROR: No matching distribution found for pulsar-client

I think this means that source artifacts are not published to PyPi, which may be a separate issue. Do you think I should open one?

Anyway, I installed protoc and googletest via Homebrew and ran your build instructions (I didn't have to set DCMAKE_PREFIX_PATH, but otherwise I did everything else the same).

Both tests passed on your branch! I feel very stupid for not fixing up my pathing to use my client compiled against your branch; very sorry for wasting your time with the back-and-forth.

@BewareMyPower
Copy link
Contributor

I think this means that source artifacts are not published to PyPi, which may be a separate issue. Do you think I should open one?

You can open an issue or send an email for this issue. I'm not familiar with how Python client was published but someone else might know.

very sorry for wasting your time with the back-and-forth.

Never mind, glad to hear it works for you :)

BewareMyPower added a commit that referenced this issue Aug 31, 2021
…#11840)

Fixes #11823

### Motivation

When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details.

For the code example in #11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`.

### Modifications

- Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers.
- Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs.
- Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected.

### Verifying this change

- [x] Make sure that the change passes the CI checks.

This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.
@zbentley
Copy link
Contributor Author

Thanks again for the fix!

What's the Pulsar policy on cutting new client releases? Should your bugfix cause publication of updated Python client packages to PyPi, or should that wait until the next main Pulsar release?

Either one is fine, I just want to know whether I should publish a hand-built version of the Python client to my organization's internal package mirror or wait for PyPi to have it.

@BewareMyPower
Copy link
Contributor

It should only be published to PyPi for stable versions. For master branch, you need to build from source.

If the related PR was already cherry-picked to branch of last stable version (branch-2.8 currently), the wheel file would be included in StreamNative's weekly release, like https://github.com/streamnative/pulsar/releases/tag/v2.8.0.15.

@zbentley
Copy link
Contributor Author

zbentley commented Sep 1, 2021

I did not know about the weekly releases, thanks!

codelipenghui pushed a commit that referenced this issue Sep 9, 2021
…#11840)

Fixes #11823

When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details.

For the code example in #11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`.

- Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers.
- Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs.
- Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected.

- [x] Make sure that the change passes the CI checks.

This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.

(cherry picked from commit 9153e71)
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this issue Mar 18, 2022
…apache#11840)

Fixes apache#11823

### Motivation

When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details.

For the code example in apache#11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`.

### Modifications

- Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers.
- Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs.
- Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected.

### Verifying this change

- [x] Make sure that the change passes the CI checks.

This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants