Skip to content

Crash when using remote execution on a large element #1961

@abderrahim

Description

@abderrahim

Trying to build a large element (and after raising limit on the number of open files to work around #1842), I got another crash

    An unhandled exception occured:
    
    Traceback (most recent call last):
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/_scheduler/jobs/job.py", line 346, in child_action
        result = self.child_process()  # pylint: disable=assignment-from-no-return
                 ^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/_scheduler/jobs/elementjob.py", line 81, in child_process
        return self._action_cb(self._element)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/_scheduler/queues/buildqueue.py", line 55, in _assemble_element
        element._assemble()
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/element.py", line 1687, in _assemble
        collect = self.assemble(sandbox)  # pylint: disable=assignment-from-no-return
                  ^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/buildelement.py", line 315, in assemble
        with sandbox.batch(root_read_only=True, label="Running commands"):
      File "/usr/lib/python3.11/contextlib.py", line 144, in __exit__
        next(self.gen)
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/sandbox/sandbox.py", line 265, in batch
        batch.execute()
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/sandbox/_sandboxreapi.py", line 234, in execute
        self.sandbox._run_with_flags(
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/sandbox/sandbox.py", line 374, in _run_with_flags
        return self._run(command, flags=flags, cwd=cwd, env=env)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/sandbox/_sandboxreapi.py", line 101, in _run
        action_result = self._execute_action(action, flags)  # pylint: disable=assignment-from-no-return
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/sandbox/_sandboxremote.py", line 302, in _execute_action
        cascache.pull_tree(casremote, tree_digest)
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/_cas/cascache.py", line 251, in pull_tree
        digest = self._fetch_tree(remote, digest)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/_cas/cascache.py", line 617, in _fetch_tree
        dirdigests = self.add_objects(buffers=dirbuffers)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/buildstream/_cas/cascache.py", line 357, in add_objects
        response = local_cas.CaptureFiles(request)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
        return _end_unary_response_blocking(state, call, False, None)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/abderrahimkitouni/.local/pipx/venvs/buildstream/lib/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
        raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    	status = StatusCode.RESOURCE_EXHAUSTED
    	details = "CLIENT: Received message larger than max (7309916 vs. 4194304)"
    	debug_error_string = "UNKNOWN:Error received from peer unix:/tmp/buildstreample8mkmu/cas/casserver-l6ur9wu9.sock {grpc_message:"CLIENT: Received message larger than max (7309916 vs. 4194304)", grpc_status:8, created_time:"2024-10-09T14:46:23.774345481+01:00"}"
    >

This is again a problem with the current implementation of CasCache.add_objects().

A see two ways to trackle this (and #1842):

  • Use BatchUpdateBlobs to send the data to buildbox-casd, composing batches of appropriate size. Since all the data is already in memory.
  • Use root_directory_digest instead of tree_digest from the output directories of an action result (we may need to set output_directory_format and servers are not required to support it)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions