Skip to content

Add new methods to manage instance server (reinstall, backup create, info, download, delete )#279

Open
aliel wants to merge 4 commits intomainfrom
aliel-add-instance-backup-restore-operations
Open

Add new methods to manage instance server (reinstall, backup create, info, download, delete )#279
aliel wants to merge 4 commits intomainfrom
aliel-add-instance-backup-restore-operations

Conversation

@aliel
Copy link
Member

@aliel aliel commented Feb 21, 2026

This PR add new methods to perform new actions on an instance

  • reinstall_instance to reinstall an existing instance to its initial state
  • create_backup to create a backup of an instance and return a pre-signed download link
  • get_backup to get backup info if it exists and return a pre-signed download link
  • restore_from_file to restore an instance from a provided rootfs file
  • restore_from_volume to restore an instance from a pinned volume

related to:
aleph-im/aleph-vm#874

@github-actions
Copy link

Failed to retrieve llama text: POST 503:

503 Service Unavailable


No server is available to handle this request.

@aliel aliel force-pushed the aliel-add-instance-backup-restore-operations branch from e4a8c4c to a8b79ad Compare February 23, 2026 17:14
- Add VmOperation enum to replace raw operation strings
- Fix file descriptor leak in restore_from_file (use context manager)
- Remove duplicated auth-header boilerplate (already handled in _generate_header)
- Route delete_backup and restore_from_volume through perform_operation
- Validate backup_id to prevent path traversal
- Accept Union[str, Path] for rootfs_path
- Remove sync mode from create_backup (server now defaults to async)
- Remove manual Content-Type header (aiohttp sets it via json param)
- Add tests for all new methods"
Copy link

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is well-structured and correct. The VmOperation enum is a clean improvement, the auth-header deduplication is a good simplification, backup_id validation prevents path traversal, and test coverage is solid for the happy paths. A few minor issues: restore_from_file opens the file with synchronous I/O in an async function (problematic for large rootfs images), some tests use regex URL matchers but don't assert that the expected query params were actually sent, and the get_backup name is ambiguous given it returns info rather than file content.

src/aleph/sdk/client/vm_client.py (line 278): Using synchronous open() (and aiohttp reading the file object synchronously) will block the event loop for large rootfs images. Consider either wrapping the upload in asyncio.get_event_loop().run_in_executor(None, ...) or using aiofiles to open and stream the file asynchronously:

import aiofiles

async with aiofiles.open(rootfs_path, "rb") as f:
    data = aiohttp.FormData()
    data.add_field("rootfs", f, filename=rootfs_path.name, content_type="application/octet-stream")
    async with self.session.post(url, headers=header, data=data) as response:
        ...

For rootfs images (which can easily be several GB) this matters.

src/aleph/sdk/client/vm_client.py (line 291): restore_from_file catches aiohttp.ClientError but not OSError (e.g. FileNotFoundError). An OSError from the open() call will propagate to the caller without any logging. Either document that file-not-found raises, or catch and log it consistently with the ClientError branch.

src/aleph/sdk/client/vm_client.py (line 241): get_backup is ambiguous — based on the PR description it returns backup info (metadata + pre-signed download link), not a backup file. A name like get_backup_info would be more explicit and prevent confusion with restore_from_file.

tests/unit/test_vm_client.py (line 334): test_reinstall_instance_keep_volumes uses a regex matcher to accept any URL under .../reinstall but doesn't assert that erase_volumes=false was actually included in the query string. The test would pass even if the param was silently dropped. Consider capturing the real request and checking request.url.query (or using aioresponses requests dict to inspect the call arguments).

tests/unit/test_vm_client.py (line 378): Same issue as test_reinstall_instance_keep_volumes: test_create_backup_with_options doesn't verify that include_volumes=true and skip_fsfreeze=true were actually sent as query params. The regex match only confirms the right endpoint was called.

tests/unit/test_vm_client.py (line 498): Missing negative-path test for restore_from_file: no test covers what happens when rootfs_path does not exist. Since OSError is not caught (see comment above), this would raise an unhandled exception — worth documenting and testing explicitly.

Copy link

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is well-structured: the VmOperation enum cleanly replaces raw strings, the backup_id path-traversal guard is correct, and the refactor to centralise auth-header generation in _generate_header removes duplication. Three quality concerns are worth addressing before merge: blocking synchronous file I/O inside an async function, an inconsistent error-handling surface in restore_from_file, and tests that don't actually verify the query parameters they exist to cover.

src/aleph/sdk/client/vm_client.py (line 278): Blocking I/O in async context. open() is synchronous and will stall the event loop for the entire read time. For a rootfs image (which can be several GB) this will freeze all other async tasks. Consider wrapping the upload in asyncio.get_event_loop().run_in_executor(None, ...) or using aiofiles for the file handle. Alternatively, stream the file with aiohttp.streamer so only small chunks are read at a time.

src/aleph/sdk/client/vm_client.py (line 291): Inconsistent error surface. aiohttp.ClientError is caught and returned as (None, str(e)), but OSError / FileNotFoundError (file doesn't exist, permission denied, etc.) propagates as a raw exception. Callers currently have to handle two completely different error shapes depending on the failure mode. Either catch OSError in the same block and return (None, str(e)), or let both bubble as exceptions and remove the catch entirely — but pick one convention.

src/aleph/sdk/client/vm_client.py (line 221): Implicit reliance on server default for erase_volumes=True. When erase_volumes=True (the default), no query parameter is sent at all; the server's own default is assumed to be true. If the server default ever changes this SDK method silently changes behaviour. Being explicit — always sending erase_volumes=true or erase_volumes=false — would be more robust:

params = {"erase_volumes": "true" if erase_volumes else "false"}

tests/unit/test_vm_client.py (line 335): Test doesn't verify the query parameter is actually sent. re.compile(rf"http://localhost/control/machine/{vm_id}/reinstall") matches any URL with that prefix, including one with no query string at all. The test would pass even if the erase_volumes=false param were dropped. Assert against the recorded request instead:

req = list(m.requests.values())[0][0]
assert req.kwargs["params"] == {"erase_volumes": "false"}

The same applies to test_create_backup_with_options — it uses a regex match but never checks that include_volumes=true&skip_fsfreeze=true appear in the request.

src/aleph/sdk/client/vm_client.py (line 241): Naming nit: get_backup is ambiguous. The PR description says it returns a pre-signed download link, but the name reads like it fetches the backup payload itself. get_backup_info or backup_info would be clearer about what is returned.

src/aleph/sdk/client/vm_client.py (line 318): manage_instance still accepts List[str] while all other call-sites now use VmOperation. This is pre-existing code touched by the refactor, so not a blocker, but it is the one remaining place that bypasses the enum and could accept an invalid operation string silently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants