Add SLURM 'comment' field to RSC job metrics scuba table#147
Conversation
Summary: In this diff D98927579, we add `--comment` in the Slurm script for LCA training but the field is not currently exported from RSC to the `scuba_fair_gpu_metrics` Scuba table. This enables ML Taxonomy use cases where `--comment` is used as a reliable source of truth for model type classification (e.g. `lca_pretrain`, `lca_posttrain`), since unlike `--job-name`, `--comment` cannot be overridden via CLI. Changes: - `shelper/local.go`: Added `Comment` field to `SlurmMetadata` and `SlurmMetadataList` structs - `shelper/slurm_helpers.go`: Parse `Comment` from scontrol output in `AttributeGPU2SlurmMetadata`, aggregate in `GetGPUData` - `slurmprocessor/common.go`: Added `SlurmComment` constant and wired into `AddSlurmMetadataStr` / `AddSlurmMetadataSlice` - `DCGM.thrift`: Added `comment` dimension (field 117) to ODS3 schema - Updated all test data files and test expectations ## Post-land steps 1. **Register the ODS3 schema** — run this command to push the new `comment` dimension to ODS3: ``` buck2 run //monitoring/otel_gateway/ods3_client_schemas:DCGM_ods3-DCGM_ods3-register ``` 2. **fotel redeployment** — the slurmprocessor runs inside fotel on each GPU node. Coordinate with the `ai_pe_rsc` oncall team to get fotel rebuilt and rolled out to RSC/AVA clusters. Until fotel is redeployed, the `comment` field will not be collected. 3. Once fotel is redeployed, the `comment` column will auto-appear in `scuba_fair_gpu_metrics` as data flows in — no separate Scuba schema change is needed. Reviewed By: luccabb Differential Revision: D103691035
|
@sju2 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103691035. |
luccabb
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
|
this seems reasonable, |
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
15b7c36
into
facebookresearch:main
Summary:
In this diff D98927579, we add
--commentin the Slurm script for LCA training but the field is not currently exported from RSC to thescuba_fair_gpu_metricsScuba table.This enables ML Taxonomy use cases where
--commentis used as a reliable source of truth for model type classification (e.g.lca_pretrain,lca_posttrain), since unlike--job-name,--commentcannot be overridden via CLI.Changes:
shelper/local.go: AddedCommentfield toSlurmMetadataandSlurmMetadataListstructsshelper/slurm_helpers.go: ParseCommentfrom scontrol output inAttributeGPU2SlurmMetadata, aggregate inGetGPUDataslurmprocessor/common.go: AddedSlurmCommentconstant and wired intoAddSlurmMetadataStr/AddSlurmMetadataSliceDCGM.thrift: Addedcommentdimension (field 117) to ODS3 schemaPost-land steps
commentdimension to ODS3:fotel redeployment — the slurmprocessor runs inside fotel on each GPU node. Coordinate with the
ai_pe_rsconcall team to get fotel rebuilt and rolled out to RSC/AVA clusters. Until fotel is redeployed, thecommentfield will not be collected.Once fotel is redeployed, the
commentcolumn will auto-appear inscuba_fair_gpu_metricsas data flows in — no separate Scuba schema change is needed.Reviewed By: luccabb
Differential Revision: D103691035