Skip to content

Feature branch sync - pub/q2_dev to pub/q2_upgrade#4385

Merged
abhishek-sa1 merged 57 commits into
pub/q2_upgradefrom
pub/q2_dev
May 8, 2026
Merged

Feature branch sync - pub/q2_dev to pub/q2_upgrade#4385
abhishek-sa1 merged 57 commits into
pub/q2_upgradefrom
pub/q2_dev

Conversation

@abhishek-sa1
Copy link
Copy Markdown
Collaborator

@abhishek-sa1 abhishek-sa1 commented May 8, 2026

Feature branch sync - pub/q2_dev to pub/q2_upgrade

jagadeeshnv and others added 30 commits April 17, 2026 19:42
…ecution

- Implement literal moustache preservation for cloud-init variables using !unsafe with from_yaml.
- Generalize node_key support to allow full metadata paths (e.g., ds.meta_data.instance_data.local_ipv4).
- Fix b64encode evaluation in permission and directory creation tasks to ensure valid shell command generation.
- Centralize mount-specific runcmd execution in cloud-init templates using base64 encoding to prevent shell injection/parsing issues.
- Update storage_config schema to support flexible cloud-init variable references for node_key.
- Add safety defaults for mount_on_oim selection to prevent 'attribute not found' errors.
- Restore 'union' filter for mounts to maintain idempotency across multiple runs.
…ion updates

New tasks: provision/roles/mount_config/tasks/{swap_config.yml,process_single_swap.yml}
Build per-functional-group swap entries in cloud_init_groups_dict
Validate swap filename path and size format; include optional maxsize
Idempotent merge via combine(recursive=True)
Render swap block in cloud-init group templates
Inject conditional swap: {filename, size, [maxsize]} across login, compiler, kube (control/node), slurm (control/node) for x86_64/aarch64
Only rendered when cloud_init_groups_dict[fg].swap is present
Strengthen input validation (common_validation.py)
Enforce no overlap of functional_group_prefix/group across swap entries
Validate maxsize >= size (supports bytes or K/M/G/T, and “auto”)
Add helper parsers and validators for size comparison
Update storage_config schema for swap (storage_config.json)
Remove deprecated swap.name field
Required: filename, size
oneOf: require either functional_group_prefix or group
Preserve existing size semantics (supports units/auto)
- Change mount_item reference to config_item parameter
- Update process_single_mount.yml to pass config_item: mount_item
- Update process_single_swap.yml to pass config_item: swap_item
- Enable reuse of target group determination logic

This allows both mount and swap processing to use the same task.
Signed-off-by: Jagadeesh N V <39791839+jagadeeshnv@users.noreply.github.com>
- Fix pxe_mapping_file reference in config and validation flow
- Add validation to detect duplicate mount points per expanded functional group
- Include entry names in error messages for clarity
- Update schema and examples for consistency
- Remove extra mount_params profiles (vast_nfs_performance, powervault_iscsi, network_storage, local_storage, bind_mounts, scratch_storage, global)
- Keep only essential profiles: default and vast_nfs
- Simplify mounts section with two core examples (nfs_slurm, nfs_k8s)
- Remove VAST-specific entries and per-node bind mount examples
- Comment out powervault_config entries (move to examples)
- Comment out swap section (move to examples)
- Fix field names: group → groups in comments and examples
- Remove vast_enabled flag
- Align with storage_config.json schema (only mount_params, mounts, powervault_config, swap sections)
- Update comments to reference correct field names (groups instead of group)
- Enable host-specific mount map generation from PXE mapping data
- Add vendor_data metadata structure to cloud-init runcmd for dynamic mounts
- Update cloud-init variable syntax from Jinja2 template to cloud-init query
- Fix variable reference in build_host_mount_map.yml (PXE_GROUP -> GROUP_NAME)
- Change mkdir commands to verbose mode (mkdir -pv)
- Update pxe_mapping_file_path to use omnia_shared_path with path normalization

This allows per-host mount configurations to be dynamically applied during
node provisioning via cloud-init vendor_data metadata.
- Restructure runcmd to mount NFS shares before creating bind mount sources
- Add explicit mount commands after each NFS fstab entry to ensure mounts
  are available before subdirectories are created on them
- Add mkdir commands for bind mount target directories on mounted NFS
- Consolidate mount -av at appropriate checkpoints to avoid premature
  mounting before all fstab entries and directories are ready
- Update process_single_mount.yml to add mkdir for bind mount targets
  before fstab entries
- Add "Refresh mounts" task to append final mount -av after all setup

This ensures the correct sequence:
1. Add NFS mount to fstab
2. Mount NFS share
3. Create subdirectories on mounted NFS
4. Add bind mount entries to fstab
5. Mount bind mounts

Fixes mount failures: "special device does not exist" and "mount point
does not exist" errors during cloud-init execution.
Signed-off-by: Jagadeesh N V <39791839+jagadeeshnv@users.noreply.github.com>
Signed-off-by: Jagadeesh N V <39791839+jagadeeshnv@users.noreply.github.com>
- Replace nfs_client_params with mounts list for consistency
- Extract k8s_nfs_server_ip from mounts source field (split on ':')
- Extract k8s_client_mount_path from mounts mount_point field
- Introduce k8s_nfs_server_path as unified source reference
- Update error message to use k8s_nfs_server_path for clarity
- Comment out deprecated k8s_server_share_path and nfs_server_ip references
- Consolidate 'mount -av' into single set_fact call instead of separate task
- Remove redundant debug task for single_mnt_runcmd
- Add loop_control label for better Ansible output readability
- Improve code formatting for regex_replace operations (no functional change)
- Streamline permission and bind mount runcmd list construction
jagadeeshnv and others added 24 commits May 7, 2026 00:50
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
gpu defect- peermem defect- prolog removal
Signed-off-by: Jagadeesh N V <39791839+jagadeeshnv@users.noreply.github.com>
…oot and defect fixes (#4373)

* pub telemetry changes

* service to scrape metrics from OTEL collector

* vmservice to scrape metrics from otel collector

* update endpoints

* revert other changes

* revert merge changes as per head

* revert variable set

* revert changes

* revert changes

* pylint fixes

* ansible lint fixes

* updating completion messaage

* telemetry validation while prepare oim

* update condition

* added check for LDMS

* Fix for crashloopback state on node reboot

* addressing review comment to move into vars

* remove rsyslog layer, update vmscraper and enabled external health monitor of csi driver

* fix for k8s_server_ip undefined variable

* fix for syntax error

* fix for UT issues - DNS resolution and keep powerscale configuration manual

* ansible lint fixes

* NAtive operator based vmscraper

* remove usused syslog template

* fix for nfs_client_param check in telemetry config

* update storage_config variable

* fix for k8s_nfs_server_path undefined variable

* fix for kustomization error

* remove powerscale syslog configuration

---------

Signed-off-by: priti-parate <140157516+priti-parate@users.noreply.github.com>
@abhishek-sa1 abhishek-sa1 reopened this May 8, 2026
@abhishek-sa1 abhishek-sa1 changed the title Feature branch sync - pub/q2 Feature branch sync - pub/q2_dev to pub/q2_upgrade May 8, 2026
@abhishek-sa1 abhishek-sa1 marked this pull request as ready for review May 8, 2026 09:00
Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>
@abhishek-sa1 abhishek-sa1 merged commit c86e7bd into pub/q2_upgrade May 8, 2026
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants