Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Python SDK sometimes crashes in streaming jobs running on 2.47.0+ SDK #27330

Closed
1 of 15 tasks
tvalentyn opened this issue Jun 30, 2023 · 7 comments
Closed
1 of 15 tasks
Assignees
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. P1 python

Comments

@tvalentyn
Copy link
Contributor

tvalentyn commented Jun 30, 2023

What happened?

We suspect that an upgrade to protobuf==4.x.x in Beam SDK & worker containers (#24599) introduced a failure mode in Python streaming pipelines, where Python process sometimes crashes with AttributeError messages , segmentation faults and in some cases causes pipeline stuckness. We expect this to be resolved in Beam 2.53.0.

Batch pipelines should not be affected.

Mitigations:

  • Use apache-beam==2.53.0 or above (once released), OR

  • Use apache-beam==2.46.0 or below, OR

  • Install protobuf 3.x in the submission and runtime environment. For example, you can use a --requirements_file pipeline option with a file that includes:

     protobuf==3.20.3
     grpcio-status==1.48.2
    

    OR

  • If you must use protobuf 4.x, use a python implementation of protobuf by setting a PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python environment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from a Dockerfile that looks like the following:

     FROM apache/beam_python3.10_sdk:2.47.0
     ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
    

Example errors:

File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1655, in _create_pardo_operation
output_tags = list(transform_proto.outputs.keys())
AttributeError: 'tuple' object has no attribute 'keys'
File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1314, in get_output_coders
pcoll_id in transform_proto.outputs.items()
AttributeError: 'function' object has no attribute 'items'
File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 819, in wrapper
result = cache[args] = func(*args)
File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 979, in topological_height
for pcoll in descriptor.transforms[transform_id].outputs.values()
AttributeError: 'function' object has no attribute 'values'
File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 819, in wrapper
result = cache[args] = func(*args)
Default Python SDK image for environment is apache/beam_python3.10_sdk:2.47.0.dev
File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 979, in topological_height
for pcoll in descriptor.transforms[transform_id].outputs.values()
AttributeError: 'traceback' object has no attribute 'values'

The pipelines usually recover after the process crash but may cause delays or pipeline stuckness.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tvalentyn
Copy link
Contributor Author

We suspect the issue affects only Python 3.10 SDK. Immediate workaround may be to use Python 3.9 or Python 3.11. We are working on confirming the fix.

@victorrgez
Copy link

We have experienced the same issue in loop for around 5 minutes in Python 3.11 in Dataflow and then it got solved automatically:

 "Error processing instruction process_bundle-12173-17596. Original traceback is
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 295, in _execute
    response = task()
               ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 370, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 629, in do_instruction
    return getattr(self, request_type)(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 660, in process_bundle
    bundle_processor = self.bundle_processor_cache.get(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 491, in get
    processor = bundle_processor.BundleProcessor(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 903, in __init__
    self.ops = self.create_execution_tree(self.process_bundle_descriptor)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 985, in create_execution_tree
    get_operation(transform_id))) for transform_id in sorted(
                                                      ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 818, in wrapper
    result = cache[args] = func(*args)
                           ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 976, in topological_height
    return 1 + max([0] + [
                         ^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in <listcomp>
    topological_height(consumer)
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 818, in wrapper
    result = cache[args] = func(*args)
                           ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/bundle_processor.py", line 978, in topological_height
    for pcoll in descriptor.transforms[transform_id].outputs.values()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'values'

"

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Sep 7, 2023

@victorrgez Thank you for the feedback. The attribute error issues should we fixed once we upgrade to the upcoming release of protobuf library (more details on protobuf issues in #28246).

Python 3.10-specific crashes should be fixed with #28355.

@kennknowles
Copy link
Member

Any update on this P1?

@tvalentyn
Copy link
Contributor Author

we expect it to be resolved in 2.51.0.

@jrmccluskey jrmccluskey added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Oct 17, 2023
@tvalentyn tvalentyn reopened this Nov 20, 2023
@tvalentyn tvalentyn removed this from the 2.51.0 Release milestone Nov 20, 2023
@tvalentyn
Copy link
Contributor Author

There is some evidence that this issue is not fully resolved, so I am reopening it and will look closer to see if I can repro it reliably for further investigation.

@tvalentyn tvalentyn changed the title [Bug]: Python SDK crashes sometimes crashes in streaming jobs running on 2.47.0+ SDK with AttributeError messages [Bug]: Python SDK sometimes crashes in streaming jobs running on 2.47.0+ SDK with AttributeError messages Nov 22, 2023
@tvalentyn
Copy link
Contributor Author

We expect this issue to be resolved in 2.53.0.

@tvalentyn tvalentyn added this to the 2.53.0 Release milestone Dec 19, 2023
@tvalentyn tvalentyn changed the title [Bug]: Python SDK sometimes crashes in streaming jobs running on 2.47.0+ SDK with AttributeError messages [Bug]: Python SDK sometimes crashes in streaming jobs running on 2.47.0+ SDK Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. P1 python
Projects
None yet
Development

No branches or pull requests

4 participants