New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the Python pulsar client with a logger can cause arbitrary/unrelated async
Python functions to return None
#11823
Comments
Since this is a reasonably subtle and silent bug when it occurs, I suggest removing client support for a user-supplied That breaks backwards compatibility, and is not to be undertaken lightly, but the risk otherwise seems pretty significant: the bug surface is the intersection of "a custom logger is in use", "any |
Actually, after thinking about it a bit, this may be possible for the Pulsar client to mitigate (though it still seems like a Boost problem fundamentally). Since However, with the call to |
@lbenc135 Could you help take a look? |
I cannot reproduce this bug in my local env (macOS Big Sur 11.2, Boost-python 1.74). Here's my output:
I also noticed BTW, I found #10981 might fix this issue because the Python client I used was compiled from latest master. Could you also try it? |
It's weird that somehow I can reproduce this bug now, the output is
|
I think a temporary solution is to avoid C++ object's destructor in consumer = None
async def async_func(client, rv):
global consumer # use the global variable
consumer = client.subscribe('sometopic', 'somesub')
consumer.close()
print("async returning", rv)
return rv |
I believe so, though that only solves part of the issue. Without the call to PyErr_Print, return values aren't altered. However, the logger won't work in those situations (destruction after an async return) even without the call to PyErr_Print, as boost::python::call fails before invoking the requested function in that context.
That does indeed work. However, because the Pulsar client is heavily asynchronous, I don't think that workaround is practical. Consider a big python prorgram with lots of async code that instantiates a Pulsar client globally. Any time that client logs for any reason (not just destructors), there's a chance that the logging action happens while the interpreter is returning from an async function, in which case this bug will occur--even if the async functions running have nothing to do with Pulsar in any way. The case with a destructor is just the most reliable way to encounter this bug, not the only way. I'm less sure about this, but I think that chance might not be as small as it sounds; it is possible that if the event loop is blocked by something (anything), pending futures whose results have not yet been consumed stay in the StopIteration-exception-pending state until a routine comes along to check on them. If that is the case, then the "coincidence" window where this bug could occur is much wider. |
Yeah, I tried in my local env that removing |
Confirmed that #10981 does not resolve this issue. That's because the root cause is pretty broad (any time C++ code calls back into Python during a C++ destructor, not just during global finalization, the call fails--and worse, if you then call the standard "print why this failed" python utility function, it corrupts the interpreter frame). |
Yeah, I just opened a PR #11840 to fix this issue, PTAL. Since the test cannot be verified in current Python2 based CI, you can verify it in your local env. |
@BewareMyPower testing now. |
Unfortunately I am able to reproduce the bug on your branch:
|
Additionally, in the process of debugging this issue, I found a couple more issues related to the
|
I found the second issue as well. We can take a look at these two issues later. Regarding to this issue, could you upload you code import faulthandler
import asyncio
import logging
from pulsar import Client
def test():
client = Client(
service_url='pulsar://localhost:6650/',
logger=logging.getLogger("foo")
)
async def async_func(rv):
consumer = client.subscribe('sometopic', 'somesub')
consumer.close()
print("async returning", rv)
return rv
print("returned:", asyncio.run(async_func("bar")))
client.close()
if __name__ == '__main__':
faulthandler.enable()
logging.basicConfig(encoding='utf-8', level=logging.DEBUG)
test() When I ran it, the output was like
You can see there're two lines that use the default logger. In your output, we can still see |
The contents of my When I run your code exactly as written, I get this output:
...after which the code hangs (per the other bug linked). I'm using python 3.9.6 installed via |
On Python 3.7.10, the bug still occurs, but I get a lot more output, which may be useful:
|
The output looks like to be the case that you mentioned in another issue but my code has called
How did you build the C++ library? I suspected your C++ library was not built well. I didn't use mkdir -p _builds
# NOTE: I installed the gtest 1.10.0 and protobuf 3.17.3 dependency into ~/software directory
# The Boost dependency was installed by brew, including Boost::python.
SOFTWARE="$HOME/software"
cd _builds
cmake .. -Wno-dev \
-DPROTOC_PATH=$SOFTWARE/protobuf-3.17.3/bin/protoc \
-DCMAKE_PREFIX_PATH="$SOFTWARE/gtest-1.10.0;$SOFTWARE/protobuf-3.17.3" \
-DBUILD_PYTHON_WRAPPER=ON -DBUILD_TESTS=ON -DBUILD_PERF_TOOLS=OFF
make -j4 After compilation completed, copy the NOTE: the picture above used INFO level instead of DEBUG level of my previous code. BTW, here is the full output with my code before:
And here is the output of your
|
I'll try to build Python client in a Ubuntu based docker image so that the code can be verified in a general environment. It might take some time. |
I didn't; However, the usual means of overriding this (
I think this means that source artifacts are not published to PyPi, which may be a separate issue. Do you think I should open one? Anyway, I installed protoc and googletest via Homebrew and ran your build instructions (I didn't have to set Both tests passed on your branch! I feel very stupid for not fixing up my pathing to use my client compiled against your branch; very sorry for wasting your time with the back-and-forth. |
You can open an issue or send an email for this issue. I'm not familiar with how Python client was published but someone else might know.
Never mind, glad to hear it works for you :) |
…#11840) Fixes #11823 ### Motivation When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details. For the code example in #11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`. ### Modifications - Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers. - Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs. - Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected. ### Verifying this change - [x] Make sure that the change passes the CI checks. This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.
Thanks again for the fix! What's the Pulsar policy on cutting new client releases? Should your bugfix cause publication of updated Python client packages to PyPi, or should that wait until the next main Pulsar release? Either one is fine, I just want to know whether I should publish a hand-built version of the Python client to my organization's internal package mirror or wait for PyPi to have it. |
It should only be published to PyPi for stable versions. For master branch, you need to build from source. If the related PR was already cherry-picked to branch of last stable version (branch-2.8 currently), the wheel file would be included in StreamNative's weekly release, like https://github.com/streamnative/pulsar/releases/tag/v2.8.0.15. |
I did not know about the weekly releases, thanks! |
…#11840) Fixes #11823 When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details. For the code example in #11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`. - Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers. - Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs. - Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected. - [x] Make sure that the change passes the CI checks. This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added. (cherry picked from commit 9153e71)
…apache#11840) Fixes apache#11823 ### Motivation When the Python logger is customized with underlying `LoggerWrapper` objects, sometimes `async` Python functions may return an incorrect value like `None`. It's because there's a bug (or feature?) of Boost-python that `py::call_method` will fail in C++ object's destructor. See boostorg/python#374 for details. For the code example in apache#11823, it's because in `ConsumerImpl`'s destructor, the logger for `AckGroupingTrackerEnabled` will be created again because the logger is thread local and will be created in new threads. In this case, `py::call_method` in `LoggerWrapper#_updateCurrentPythonLogLevel` will fail, and `PyErr_Print` will be called and the error indicator will be cleared, which leads to the result that `async` functions' result became `None`. ### Modifications - Reduce unnecessary `Logger.getEffectiveLevel` calls to get Python log level , just get the log level when the logger factory is initialized and pass the same level to all loggers. - Remove the `PyErr_Print` calls in `LoggerWrapper` related code. In the cases when `py::call_method` failed, use the fallback logger to print logs. - Add a dependent test for custom logger test because once the `LoggerFactory` was set all other tests would be affected. ### Verifying this change - [x] Make sure that the change passes the CI checks. This change added test `CustomLoggingTest`. Since `asyncio` module was introduced from Python 3.3 while CI is based on Python 2.7, this test cannot be tested by CI unless Python3 based CI was added.
If a
logger
object (any logger object) is supplied topulsar.Client
objects in Python, its presence can cause arbitraryasync
Python functions to returnNone
incorrectly.Steps to reproduce
To reproduce, run the following:
Test setup
Python: 3.9
OS: MacOS 10.11
Arch: x86_64
Pulsar broker: 2.8.0 standalone, running locally
Client:
2.8.0.post0
Pathology/root cause
This appears to be due to a Boost-python issue (and, in my opinion, a pretty bad one) I found while trying to track down this behavior: boostorg/python#374
Unless there is a way to fix that, it looks like any invocation of
pulsar-client
's Python logger by the C++ code can, in the right circumstances, corrupt the Python interpreter state and cause the callingasync
function to returnNone
incorrectly.The text was updated successfully, but these errors were encountered: