Skip to content

TimeoutError from asyncio.wait_for is not handled gracefully and crashes Crawlee #1602

@lestarcdog

Description

@lestarcdog

Issue: TimeoutErrors are handled gracefully coming from Playwright but not from asyncio#wait_for call. It makes Crawlee engine stops.
Expectation: All TimeoutErrors should be handled gracefully

Crawlee version: 1.1.1
Python version: 3.14

Stacktrace:

[crawlee.crawlers._playwright._playwright_crawler] ERROR An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated.
      Traceback (most recent call last):
        File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_context_pipeline.py", line 114, in __call__
          await final_context_consumer(cast('TCrawlingContext', crawling_context))
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\router.py", line 98, in __call__
          return await self._default_handler(context)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\projects\xxx\src\xxx\crawl\shopee\timeout_repro.py", line 13, in default_handler
          await asyncio.wait_for(never_future, timeout=5)
        File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 506, in wait_for
          async with timeouts.timeout(timeout):
                     ~~~~~~~~~~~~~~~~^^^^^^^^^
        File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\timeouts.py", line 116, in __aexit__
          raise TimeoutError from exc_val
      TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1415, in __run_task_function
          await self._run_request_handler(context=context)
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1510, in _run_request_handler
          await wait_for(
          ...<5 lines>...
          )
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\_utils\wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_context_pipeline.py", line 120, in __call__
          raise RequestHandlerError(e, crawling_context) from e
      crawlee.errors.RequestHandlerError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1158, in _handle_request_error
          await wait_for(
          ...<5 lines>...
          )
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\_utils\wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1128, in _handle_request_retries
          f'{get_one_line_error_summary_if_possible(error)}'
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
        File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_logging_utils.py", line 52, in get_one_line_error_summary_if_possible
          most_relevant_part = ',' + reduce_asyncio_timeout_error_to_relevant_traceback_parts(error)[-1]
                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
      IndexError: list index out of range
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_context_pipeline.py", line 114, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\router.py", line 98, in __call__
    return await self._default_handler(context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\xxx\src\xxx\crawl\shopee\timeout_repro.py", line 13, in default_handler
    await asyncio.wait_for(never_future, timeout=5)
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 506, in wait_for
    async with timeouts.timeout(timeout):
               ~~~~~~~~~~~~~~~~^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1415, in __run_task_function
    await self._run_request_handler(context=context)
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1510, in _run_request_handler
    await wait_for(
    ...<5 lines>...
    )
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\_utils\wait.py", line 37, in wait_for
    return await asyncio.wait_for(operation(), timeout.total_seconds())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
    return await fut
           ^^^^^^^^^
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_context_pipeline.py", line 120, in __call__
    raise RequestHandlerError(e, crawling_context) from e
crawlee.errors.RequestHandlerError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\projects\xxx\src\xxx\crawl\shopee\timeout_repro.py", line 20, in <module>
    asyncio.run(main())
    ~~~~~~~~~~~^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "C:\projects\xxx\src\xxx\crawl\shopee\timeout_repro.py", line 16, in main
    await crawler.run(["https://browserleaks.com/"])
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 719, in run
    await run_task
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 774, in _run_crawler
    await self._autoscaled_pool.run()
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\_autoscaling\autoscaled_pool.py", line 126, in run
    await run.result
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\_autoscaling\autoscaled_pool.py", line 277, in _worker_task
    await asyncio.wait_for(
    ...<2 lines>...
    )
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
    return await fut
           ^^^^^^^^^
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1449, in __run_task_function
    await self._handle_request_error(primary_error.crawling_context, primary_error.wrapped_exception)
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1158, in _handle_request_error
    await wait_for(
    ...<5 lines>...
    )
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\_utils\wait.py", line 37, in wait_for
    return await asyncio.wait_for(operation(), timeout.total_seconds())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python313\Lib\asyncio\tasks.py", line 507, in wait_for
    return await fut
           ^^^^^^^^^
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_basic_crawler.py", line 1128, in _handle_request_retries
    f'{get_one_line_error_summary_if_possible(error)}'
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "C:\projects\xxx\.venv\Lib\site-packages\crawlee\crawlers\_basic\_logging_utils.py", line 52, in get_one_line_error_summary_if_possible
    most_relevant_part = ',' + reduce_asyncio_timeout_error_to_relevant_traceback_parts(error)[-1]
                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
IndexError: list index out of range

Repro code:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def default_handler(ctx: PlaywrightCrawlingContext):
        ctx.log.info("Request %s", ctx.request.url)
        never_future = asyncio.get_running_loop().create_future()
        await asyncio.wait_for(never_future, timeout=5)
        ctx.log.info("Never finished %s", ctx.request.url)

    await crawler.run(["https://browserleaks.com/"])


if __name__ == '__main__':
    asyncio.run(main())

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions