Skip to content

Add SLURM 'comment' field to RSC job metrics scuba table#147

Merged
meta-codesync[bot] merged 1 commit into
facebookresearch:mainfrom
sju2:export-D103691035
May 13, 2026
Merged

Add SLURM 'comment' field to RSC job metrics scuba table#147
meta-codesync[bot] merged 1 commit into
facebookresearch:mainfrom
sju2:export-D103691035

Conversation

@sju2

@sju2 sju2 commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary:
In this diff D98927579, we add --comment in the Slurm script for LCA training but the field is not currently exported from RSC to the scuba_fair_gpu_metrics Scuba table.

This enables ML Taxonomy use cases where --comment is used as a reliable source of truth for model type classification (e.g. lca_pretrain, lca_posttrain), since unlike --job-name, --comment cannot be overridden via CLI.

Changes:

  • shelper/local.go: Added Comment field to SlurmMetadata and SlurmMetadataList structs
  • shelper/slurm_helpers.go: Parse Comment from scontrol output in AttributeGPU2SlurmMetadata, aggregate in GetGPUData
  • slurmprocessor/common.go: Added SlurmComment constant and wired into AddSlurmMetadataStr / AddSlurmMetadataSlice
  • DCGM.thrift: Added comment dimension (field 117) to ODS3 schema
  • Updated all test data files and test expectations

Post-land steps

  1. Register the ODS3 schema — run this command to push the new comment dimension to ODS3:
buck2 run //monitoring/otel_gateway/ods3_client_schemas:DCGM_ods3-DCGM_ods3-register
  1. fotel redeployment — the slurmprocessor runs inside fotel on each GPU node. Coordinate with the ai_pe_rsc oncall team to get fotel rebuilt and rolled out to RSC/AVA clusters. Until fotel is redeployed, the comment field will not be collected.

  2. Once fotel is redeployed, the comment column will auto-appear in scuba_fair_gpu_metrics as data flows in — no separate Scuba schema change is needed.

Reviewed By: luccabb

Differential Revision: D103691035

Summary:
In this diff D98927579, we add `--comment` in the Slurm script for LCA training but the field is not currently exported from RSC to the `scuba_fair_gpu_metrics` Scuba table.

This enables ML Taxonomy use cases where `--comment` is used as a reliable source of truth for model type classification (e.g. `lca_pretrain`, `lca_posttrain`), since unlike `--job-name`, `--comment` cannot be overridden via CLI.

Changes:
- `shelper/local.go`: Added `Comment` field to `SlurmMetadata` and `SlurmMetadataList` structs
- `shelper/slurm_helpers.go`: Parse `Comment` from scontrol output in `AttributeGPU2SlurmMetadata`, aggregate in `GetGPUData`
- `slurmprocessor/common.go`: Added `SlurmComment` constant and wired into `AddSlurmMetadataStr` / `AddSlurmMetadataSlice`
- `DCGM.thrift`: Added `comment` dimension (field 117) to ODS3 schema
- Updated all test data files and test expectations

## Post-land steps

1. **Register the ODS3 schema** — run this command to push the new `comment` dimension to ODS3:
```
buck2 run //monitoring/otel_gateway/ods3_client_schemas:DCGM_ods3-DCGM_ods3-register
```

2. **fotel redeployment** — the slurmprocessor runs inside fotel on each GPU node. Coordinate with the `ai_pe_rsc` oncall team to get fotel rebuilt and rolled out to RSC/AVA clusters. Until fotel is redeployed, the `comment` field will not be collected.

3. Once fotel is redeployed, the `comment` column will auto-appear in `scuba_fair_gpu_metrics` as data flows in — no separate Scuba schema change is needed.

Reviewed By: luccabb

Differential Revision: D103691035
@meta-codesync

meta-codesync Bot commented May 13, 2026

Copy link
Copy Markdown

@sju2 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103691035.

@luccabb luccabb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@luccabb

luccabb commented May 13, 2026

Copy link
Copy Markdown
Contributor

this seems reasonable, comment is user input, so just be careful downstream with the cardinality increase on ODS3 tables

@github-actions

Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync meta-codesync Bot merged commit 15b7c36 into facebookresearch:main May 13, 2026
37 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants