Skip to content

Upgrade Dockerfiles to Python 3.13 and handlers 2.0.0b8#433

Merged
kthare10 merged 20 commits into
mainfrom
resource-calendar
May 20, 2026
Merged

Upgrade Dockerfiles to Python 3.13 and handlers 2.0.0b8#433
kthare10 merged 20 commits into
mainfrom
resource-calendar

Conversation

@kthare10
Copy link
Copy Markdown
Collaborator

@kthare10 kthare10 commented May 20, 2026

Summary

  • Upgrade all Dockerfiles from Python 3.12.12 to 3.13.13
  • Bump fabric-am-handlers to 2.0.0b8 (ansible 13.6.0 / ansible-core 2.20)
  • Pin openstack.cloud:<2.0.0 for compatibility with openstacksdk 0.61.0 on remote hosts
  • Install community.libvirt ansible collection in Dockerfile-auth
  • Bump fabric-cf to 2.0.0b2

Test plan

  • VM create/delete tested on UKY site
  • Build and verify all Docker images (auth, broker, orchestrator, cf)
  • End-to-end slice provisioning test

kthare10 added 20 commits April 9, 2026 13:33
Move do_relinquish() from except to finally block in both the
BlockedJoin close path and the probe_pending() Closing path so
the Broker is notified regardless of whether the Authority close
RPC succeeds or fails. do_relinquish() is idempotent so the
duplicate call in update_lease(CloseWait) is safe.
occupied_node_capacity() now defaults to start=now, end=now when no time
range is specified. Previously, all Ticketed reservations (including future
advance reservations) were counted as currently occupied, causing
cores_allocated to exceed cores_capacity at sites like UCSD (657 > 640).
Calendar/scheduling callers that pass explicit start/end are unaffected.
…cate broker queries

When PeriodicProcessor starts a refresh, concurrent user requests were
bypassing the cache and each sending expensive queries to the broker.
Now non-forced requests return the stale cached model while a refresh
is already in progress, eliminating redundant broker round-trips.
Add timeout-based recovery to BqmWrapper.can_refresh(): if
refresh_in_progress has been True longer than refresh_interval_in_seconds,
treat it as a failed refresh and allow a new one. This prevents the cache
from freezing permanently when save() is never called due to an exception,
hung broker query, or thread issue.
In plug_produce_bqm_summary(), component allocations were looked up
per-device and accumulated per type/model, causing workers with N
devices of the same type/model to report N× the actual allocation.
Split into two passes: first accumulate capacity per device, then
set allocations once per type/model from DB query results—matching
the pattern already used in plug_produce_bqm().
Read-only methods (get_reservations, get_components, get_links, etc.)
were not rolling back the session on exception. Since sessions are
cached per thread, a failed query left PostgreSQL in a failed
transaction state, causing all subsequent queries on the same thread
to fail with InFailedSqlTransaction until the process was restarted.
- Base image: python:3.11.0 → python:3.12.12
- Create /opt/venv in all Dockerfiles to comply with PEP 668
- Update all python/pip paths to use /opt/venv/bin
- AM handlers: 1.9.1 → 2.0.0b2
- ansible: unversioned → 13.6.0
- fabric_fss_utils: 1.6.2 → 1.7.0
- fabrictestbed: 2.0.6 → 2.0.7
- requires-python: >=3.9 → >=3.12
- Version bump to 2.0.0b1
Install community.libvirt, openstack.cloud, and cisco.nso collections
that are not bundled with ansible 13.6.0 but required by AM handlers.
@kthare10 kthare10 requested a review from paul-ruth May 20, 2026 16:26
@kthare10 kthare10 self-assigned this May 20, 2026
@kthare10 kthare10 merged commit 93fdd0f into main May 20, 2026
4 checks passed
@kthare10 kthare10 deleted the resource-calendar branch May 20, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants