Skip to content

Convert memory values from MB to bytes before writing to Scuba#71

Draft
Yash0270 wants to merge 1 commit intofacebookresearch:mainfrom
Yash0270:export-D94605913
Draft

Convert memory values from MB to bytes before writing to Scuba#71
Yash0270 wants to merge 1 commit intofacebookresearch:mainfrom
Yash0270:export-D94605913

Conversation

@Yash0270
Copy link
Contributor

Summary:
A user reported jobs not starting despite nodes appearing idle
(Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak).
During debugging, we discovered that Scuba displays memory values incorrectly
for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these
raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter
then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes
when it's actually 1.524 TB (1,524,000 MB). This makes investigating node
memory availability confusing and error-prone.

What's New

Component Description
mb_to_bytes New helper in parsing.py to convert MB integers to bytes
convert_memory_to_bytes New helper that converts suffixed strings (e.g., "60G") directly to bytes
maybe_int_mb_to_bytes Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes)
sinfo_node.py MEMORY and FREE_MEM now stored in bytes
squeue.py MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes
scontrol.py TresMEM now stored in bytes (billing weights unchanged)

Key Design Decisions

  • Field-level conversion over parse_value_from_tres change: parse_value_from_tres
    is shared between actual memory amounts (TRES totals, allocated) and billing weight
    coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting
    billing weights, which are dimensionless coefficients, not memory amounts.
  • MB * 1,000,000 (decimal): Matches Scuba's SI-prefix formatter which uses
    powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12).

After Fix

Value (bytes) Scuba displays Meaning
500,000,000,000 "500 G" 500 GB
1,524,000,000,000 "1.524 T" 1.524 TB

Reviewed By: luccabb, yonglimeta

Differential Revision: D94605913

@github-actions
Copy link

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync
Copy link

meta-codesync bot commented Feb 28, 2026

@Yash0270 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94605913.

Yash0270 added a commit to Yash0270/gcm that referenced this pull request Feb 28, 2026
…ookresearch#71)

Summary:
Pull Request resolved: facebookresearch#71

A user reported jobs not starting despite nodes appearing idle
(Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak).
During debugging, we discovered that Scuba displays memory values incorrectly
for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these
raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter
then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes
when it's actually 1.524 TB (1,524,000 MB). This makes investigating node
memory availability confusing and error-prone.

## What's New
| Component | Description |
|-----------|-------------|
| `mb_to_bytes` | New helper in parsing.py to convert MB integers to bytes |
| `convert_memory_to_bytes` | New helper that converts suffixed strings (e.g., "60G") directly to bytes |
| `maybe_int_mb_to_bytes` | Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes) |
| `sinfo_node.py` | MEMORY and FREE_MEM now stored in bytes |
| `squeue.py` | MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes |
| `scontrol.py` | TresMEM now stored in bytes (billing weights unchanged) |

## Key Design Decisions
- **Field-level conversion over parse_value_from_tres change**: parse_value_from_tres
  is shared between actual memory amounts (TRES totals, allocated) and billing weight
  coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting
  billing weights, which are dimensionless coefficients, not memory amounts.
- **MB * 1,000,000 (decimal)**: Matches Scuba's SI-prefix formatter which uses
  powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12).

## After Fix
| Value (bytes) | Scuba displays | Meaning |
|---|---|---|
| 500,000,000,000 | "500 G" | 500 GB |
| 1,524,000,000,000 | "1.524 T" | 1.524 TB |

Reviewed By: luccabb, yonglimeta

Differential Revision: D94605913
data[name] = int(match.group(1))
else:
data[name] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont we add "slurmctld_thread_cpu" to sdiag here and keep the collect_slurm concise.

return int(float(value[:-1]) * multipliers[suffix])


def maybe_parse_memory_to_bytes(v: str) -> int | None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use this decorator from 74 @log_error(name, return_on_error=0), so no need to import logger

Comment on lines +325 to +327
output = subprocess.check_output(
["scontrol", "show", "config"], text=True
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't feel it's good pattern to subprocess call from inside a call, then also do a regex search. It adds a lot of complexity to sdiag_structured.

@Yash0270 Yash0270 marked this pull request as draft March 2, 2026 18:30
@Yash0270
Copy link
Contributor Author

Yash0270 commented Mar 2, 2026

Thanks for the comments @giongto35 , I pushed an additional incomplete PR in this change. Correcting it now.

Yash0270 added a commit to Yash0270/gcm that referenced this pull request Mar 2, 2026
…ookresearch#71)

Summary:
Pull Request resolved: facebookresearch#71

A user reported jobs not starting despite nodes appearing idle
(Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak).
During debugging, we discovered that Scuba displays memory values incorrectly
for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these
raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter
then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes
when it's actually 1.524 TB (1,524,000 MB). This makes investigating node
memory availability confusing and error-prone.

## What's New
| Component | Description |
|-----------|-------------|
| `mb_to_bytes` | New helper in parsing.py to convert MB integers to bytes |
| `convert_memory_to_bytes` | New helper that converts suffixed strings (e.g., "60G") directly to bytes |
| `maybe_int_mb_to_bytes` | Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes) |
| `sinfo_node.py` | MEMORY and FREE_MEM now stored in bytes |
| `squeue.py` | MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes |
| `scontrol.py` | TresMEM now stored in bytes (billing weights unchanged) |

## Key Design Decisions
- **Field-level conversion over parse_value_from_tres change**: parse_value_from_tres
  is shared between actual memory amounts (TRES totals, allocated) and billing weight
  coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting
  billing weights, which are dimensionless coefficients, not memory amounts.
- **MB * 1,000,000 (decimal)**: Matches Scuba's SI-prefix formatter which uses
  powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12).

## After Fix
| Value (bytes) | Scuba displays | Meaning |
|---|---|---|
| 500,000,000,000 | "500 G" | 500 GB |
| 1,524,000,000,000 | "1.524 T" | 1.524 TB |

Reviewed By: luccabb, yonglimeta

Differential Revision: D94605913
Summary:
Adds `parse_memory_to_bytes` as a single robust entry point for all Slurm memory parsing, handling plain integers (assumed MB per Slurm default), suffixed strings (M/G/T/P), and N/A values. This addresses reviewer feedback that if Slurm switches away from MB we break.

## What's New
| Component | Description |
|-----------|-------------|
| `parse_memory_to_bytes` | New robust parser: plain ints (MB), suffixed (M/G/T/P), zero |
| `maybe_parse_memory_to_bytes` | Nullable wrapper returning None for N/A/empty/invalid |
| `sinfo_node.py` | FREE_MEM/MEMORY use `maybe_parse_memory_to_bytes` instead of `maybe_int_mb_to_bytes` |
| `squeue.py` | MIN_MEMORY uses `maybe_parse_memory_to_bytes` instead of `convert_memory_to_mb` |
| `convert_memory_to_mb` | Fixed plain-digit branch: treats unsuffixed values as MB (was incorrectly treating as bytes) |

## Key Design Decisions
- **Single entry point**: All memory parsing goes through `parse_memory_to_bytes` so format changes only need fixing in one place
- **Plain int = MB**: Slurm's default unit is MB; unsuffixed integers are treated as MB, not bytes
- **Kept `convert_memory_to_mb`**: Still used by TRES path (`parse_value_from_tres`); fixed its plain-digit branch for consistency

## Related
- Reviewer comments: D94605913 (luccab)
- Parent diff: D94607438

Reviewed By: luccabb, yonglimeta

Differential Revision: D94605913
Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants