BDK Crashes occasionally, stack trace attached. #256

dky · 2022-02-10T14:15:44Z

Hey guys, we regularly receive exceptions when running the bot. We are using the bot framework 2.0.1. I've just bumped us up to 2.1.0 but curious if anyone had insight what this stack indicates.

Traceback (most recent call last):
  File "/home/dky/.pyenv/versions/3.9.6/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dky/.pyenv/versions/3.9.6/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dky/git/bot/src/__main__.py", line 131, in <module>
    asyncio.run(run())
  File "/home/dky/.pyenv/versions/3.9.6/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/dky/.pyenv/versions/3.9.6/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/dky/git/bot/src/__main__.py", line 119, in run
    await datafeed_loop.start()
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/service/datafeed/abstract_datafeed_loop.py", line 99, in start
    await self._run_loop()
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/service/datafeed/abstract_datafeed_loop.py", line 149, in _run_loop
    await self._run_loop_iteration()
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/service/datafeed/datafeed_loop_v1.py", line 54, in _run_loop_iteration
    events = await self._read_datafeed()
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/retry/_asyncio.py", line 118, in async_wrapped
    return await fn(*args, **kwargs)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/retry/_asyncio.py", line 80, in __call__
    do = await self.iter(retry_state=retry_state)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/retry/_asyncio.py", line 41, in iter
    should_retry = await self.retry(retry_state=retry_state)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/retry/strategy.py", line 119, in read_datafeed_retry
    raise exception
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/retry/_asyncio.py", line 83, in __call__
    result = await fn(*args, **kwargs)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/service/datafeed/datafeed_loop_v1.py", line 63, in _read_datafeed
    events = await self._datafeed_api.v4_datafeed_id_read_get(id=self._datafeed_id,
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/core/client/trace_id.py", line 46, in add_x_trace_id_header
    return await func(*args, **kwargs)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/gen/api_client.py", line 195, in __call_api
    response_data = await self.request(
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/gen/rest.py", line 190, in GET
    return await self.request("GET", url,
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/symphony/bdk/gen/rest.py", line 165, in request
    r = await self.pool_manager.request(**args)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 913, in start
    self._continue = None
  File "/home/dky/git/bot/env/lib/python3.9/site-packages/aiohttp/helpers.py", line 718, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

The text was updated successfully, but these errors were encountered:

dky · 2022-02-11T14:15:16Z

This looks like an issue with our Symphony pod going offline or having connection issues (Observed the same blip today). Is there any way we can add error handling to make sure the bot reconnects vs sitting on TimeoutError?

symphony-youri · 2022-02-14T14:41:52Z

Hi @dky

Could you confirm that the bot is indeed stopping and not recovering from the error? Normally we should have a retry policy in place and the datafeed loop should not stop.

dky · 2022-02-15T03:20:50Z

@symphony-youri Yes, it hangs indefinitely and our users get super frustrated. The only way to recover is ctrl-c to break out of the loop and re-run the bot. All open to anything you would need on my end to see why the retry is failing. This happened to us just this Friday when our Pod got rebooted or something.

@Retry

It looks like the existing logic to retry on timeouts or client errors was not working as expected. For instance the DF loop would exit on timeouts. The aiohttp client can raise different errors (https://docs.aiohttp.org/en/stable/client_reference.html#client-exceptions) and also asyncio.TimeoutError. I tried locally by misconfiguring the hostname (it cannot be resolved) and the port (triggers a timeout). However this means that retries will be performed upon startup even if the hostname cannot be resolved for instance. Which is not 100% like the Java BDK who has a different logic for authentication and datafeed retries. The problem is how sessions are refreshed, here lazily so we have two nested @Retry calls making it difficult to have different strategies. Fixes finos#256

symphony-youri · 2022-02-15T14:41:13Z

It looks like the retry logic is just wrong and not catching the proper errors. I opened #260 to address that.

dky · 2022-02-15T15:28:13Z

Thanks! Hope to see fix released soon and thanks for the help!

symphony-youri · 2022-02-15T16:32:09Z

We will have to release a 2.2.1 with this change, it should be there shortly, hopefully by the end of the week.

@Retry

It looks like the existing logic to retry on timeouts or client errors was not working as expected. For instance the DF loop would exit on timeouts. The aiohttp client can raise different errors (https://docs.aiohttp.org/en/stable/client_reference.html#client-exceptions) and also asyncio.TimeoutError. I tried locally by misconfiguring the hostname (it cannot be resolved) and the port (triggers a timeout). However this means that retries will be performed upon startup even if the hostname cannot be resolved for instance. Which is not 100% like the Java BDK who has a different logic for authentication and datafeed retries. The problem is how sessions are refreshed, here lazily so we have two nested @Retry calls making it difficult to have different strategies. Fixes finos#256 (cherry picked from commit 5112b21)

symphony-youri mentioned this issue Feb 15, 2022

Retry on client/timeout errors #260

Merged

4 tasks

symphony-youri closed this as completed in #260 Feb 15, 2022

symphony-youri added this to the 2.2.1 milestone Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BDK Crashes occasionally, stack trace attached. #256

BDK Crashes occasionally, stack trace attached. #256

dky commented Feb 10, 2022 •

edited

dky commented Feb 11, 2022

symphony-youri commented Feb 14, 2022

dky commented Feb 15, 2022

symphony-youri commented Feb 15, 2022

dky commented Feb 15, 2022

symphony-youri commented Feb 15, 2022

BDK Crashes occasionally, stack trace attached. #256

BDK Crashes occasionally, stack trace attached. #256

Comments

dky commented Feb 10, 2022 • edited

dky commented Feb 11, 2022

symphony-youri commented Feb 14, 2022

dky commented Feb 15, 2022

symphony-youri commented Feb 15, 2022

dky commented Feb 15, 2022

symphony-youri commented Feb 15, 2022

dky commented Feb 10, 2022 •

edited