Skip to content

Feature branch sync - pub/q2_dev to pub/telemetry#4367

Merged
abhishek-sa1 merged 29 commits into
pub/telemetryfrom
pub/q2_dev
May 5, 2026
Merged

Feature branch sync - pub/q2_dev to pub/telemetry#4367
abhishek-sa1 merged 29 commits into
pub/telemetryfrom
pub/q2_dev

Conversation

@abhishek-sa1
Copy link
Copy Markdown
Collaborator

Feature branch sync - pub/q2_dev to pub/telemetry

sakshi-singla-1735 and others added 28 commits April 16, 2026 15:28
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
Login compiler node and slurm node atmic lock cuda installation with drivers, dcgm and peermem
* pub telemetry changes

* service to scrape metrics from OTEL collector

* vmservice to scrape metrics from otel collector

* update endpoints

* revert other changes

* revert merge changes as per head

* revert variable set

* revert changes

* revert changes

* pylint fixes

* ansible lint fixes

* updating completion messaage

* telemetry validation while prepare oim

* update condition

* added check for LDMS
- Extract service readiness checks into separate block/rescue pattern
- Add SMD API health check before discovery attempt
- Implement automatic retry on discovery failure with service restart
- Increase service check timeout from 2 to 2 minutes (12 retries × 10s delay)
- Prevent connection refused errors by ensuring SMD endpoint is ready

This addresses the race condition where systemd marks smd service as "started"
but the HTTP endpoint at oimcp.oim.test:8443 isn't accepting connections yet.
…ment (#4366)

1. Skip OME credential prompt when enable_bmc_discovery is false
   - Changed discovery credentials from mandatory to conditional_mandatory
     in credential utility vars (gated on enable_bmc_discovery)
   - Added set_fact in prepare_oim.yml to promote enable_bmc_discovery
     from namespaced to top-level scope before credential utility runs

2. Fix PARENT_SERVICE_TAG assignment in PXE mapping
   - Source changed from service_kube_control_plane to service_kube_node
   - Only slurm_node_aarch64 and slurm_node_x86_64 receive PARENT_SERVICE_TAG
   - All other roles (control_plane, kube_node, login, slurm_control) remain empty
@abhishek-sa1 abhishek-sa1 marked this pull request as ready for review May 5, 2026 11:34
ochami smd restries added after cloud-int service restart
@abhishek-sa1 abhishek-sa1 merged commit a7b47fd into pub/telemetry May 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants