# Node usage analysis with `sacct` output

This notebook analyzes the node usage from the Slurm `sacct` output.  
The analysis includes:
- Parsing the `AllocTRES` field to extract CPU, memory, and node information.
- Calculating total node hours used and remaining.
- Estimating running days for a given number of nodes.
- Estimating the required nodes for specific remaining days.

In [12]:
import numpy as np
import pandas as pd

## Load the `sacct` result

The `sacct` command outputs job information. Below is the command used to generate the file:

```shell
sacct -S 2024-10-01 -E 2024-12-18 --format="JOBID,JobName,Partition,State,AllocTres,ElapsedRaw -p -T" > sacct.txt
```

In [14]:
# Load the `sacct` result
info1 = pd.read_csv("./sacct-20241212.txt", delimiter="|")
info2 = pd.read_csv("./sacct-20241218.txt", delimiter="|")

## Parse `AllocTRES` field

Extract `cpu`, `mem`, and `node` information from the `AllocTRES` column and add them as new columns to the DataFrame.

In [15]:
def parse_AllocTRES(info):
    # Extract cpu, mem, and node from AllocTRES
    alloc_cols = info["AllocTRES"].str.extract(r'cpu=(\d+),mem=(\d+M),node=(\d+)')
    alloc_cols.columns = ["cpu", "mem", "node"]

    # Add the parsed data to the DataFrame
    return(pd.concat([info, alloc_cols], axis=1))

In [16]:
info1 = parse_AllocTRES(info1)
info2 = parse_AllocTRES(info2)

In [17]:
info1

Unnamed: 0,JobID,JobName,Partition,State,AllocTRES,ElapsedRaw,Unnamed: 6,cpu,mem,node
0,14775436,check_cores,normal,COMPLETED,"cpu=2,mem=14200M,node=1",2,,2,14200M,1
1,14775436.batch,batch,,COMPLETED,"cpu=2,mem=14200M,node=1",2,,2,14200M,1
2,14775436.extern,extern,,COMPLETED,"cpu=2,mem=14200M,node=1",2,,2,14200M,1
3,14858548,hr5-016-01,normal,COMPLETED,"cpu=32,mem=227200M,node=1",4,,32,227200M,1
4,14858548.batch,batch,,COMPLETED,"cpu=32,mem=227200M,node=1",4,,32,227200M,1
...,...,...,...,...,...,...,...,...,...,...
185,15020763,CPL1,large_cpu,CANCELLED by 43801,,0,,,,
186,15020836,LCDM,large_cpu,CANCELLED by 43801,"billing=3024,cpu=4320,mem=7726620M,node=30",0,,4320,7726620M,30
187,15020838,CPL0,large_cpu,CANCELLED by 43801,"billing=3024,cpu=4320,mem=7726620M,node=30",0,,4320,7726620M,30
188,15020842,CPL1,large_cpu,CANCELLED by 43801,"billing=3024,cpu=4320,mem=7726620M,node=30",0,,4320,7726620M,30


In [18]:
info2

Unnamed: 0,JobID,JobName,Partition,State,AllocTRES,ElapsedRaw,Unnamed: 6,cpu,mem,node
0,14775436,check_cores,normal,COMPLETED,"cpu=2,mem=14200M,node=1",2,,2,14200M,1
1,14775436.batch,batch,,COMPLETED,"cpu=2,mem=14200M,node=1",2,,2,14200M,1
2,14775436.extern,extern,,COMPLETED,"cpu=2,mem=14200M,node=1",2,,2,14200M,1
3,14858548,hr5-016-01,normal,COMPLETED,"cpu=32,mem=227200M,node=1",4,,32,227200M,1
4,14858548.batch,batch,,COMPLETED,"cpu=32,mem=227200M,node=1",4,,32,227200M,1
...,...,...,...,...,...,...,...,...,...,...
215,15126900.0,ramses-lcdm-3d,,RUNNING,"cpu=2400,mem=7726620M,node=30",25510,,2400,7726620M,30
216,15126905,CPL1,large_cpu,RUNNING,"billing=3024,cpu=4320,mem=7726620M,node=30",25450,,4320,7726620M,30
217,15126905.batch,batch,,RUNNING,"cpu=144,mem=257554M,node=1",25450,,144,257554M,1
218,15126905.extern,extern,,RUNNING,"billing=3024,cpu=4320,mem=7726620M,node=30",25450,,4320,7726620M,30


In [20]:
info2[190:]

Unnamed: 0,JobID,JobName,Partition,State,AllocTRES,ElapsedRaw,Unnamed: 6,cpu,mem,node
190,15020838,CPL0,large_cpu,CANCELLED by 43801,"billing=3024,cpu=4320,mem=7726620M,node=30",252085,,4320.0,7726620M,30.0
191,15020838.batch,batch,,CANCELLED,"cpu=144,mem=257554M,node=1",252086,,144.0,257554M,1.0
192,15020838.extern,extern,,COMPLETED,"billing=3024,cpu=4320,mem=7726620M,node=30",252085,,4320.0,7726620M,30.0
193,15020838.0,ramses-cpl0-3d,,FAILED,"cpu=1200,mem=7726620M,node=30",252089,,1200.0,7726620M,30.0
194,15020842,CPL1,large_cpu,CANCELLED by 43801,"billing=3024,cpu=4320,mem=7726620M,node=30",234289,,4320.0,7726620M,30.0
195,15020842.batch,batch,,CANCELLED,"cpu=144,mem=257554M,node=1",234290,,144.0,257554M,1.0
196,15020842.extern,extern,,COMPLETED,"billing=3024,cpu=4320,mem=7726620M,node=30",234289,,4320.0,7726620M,30.0
197,15020842.0,ramses-cpl1-3d,,FAILED,"cpu=1200,mem=7726620M,node=30",234295,,1200.0,7726620M,30.0
198,15034844,LCDM4,large_cpu,CANCELLED by 43801,,0,,,,
199,15065870,LCDM,large_cpu,TIMEOUT,"billing=3024,cpu=4320,mem=7726620M,node=30",86419,,4320.0,7726620M,30.0


In [45]:
info2[0:220:3]

Unnamed: 0,JobID,JobName,Partition,State,AllocTRES,ElapsedRaw,Unnamed: 6,cpu,mem,node,NodeHours
0,14775436,check_cores,normal,COMPLETED,"cpu=2,mem=14200M,node=1",2.0,,2,14200M,1.0,0.000556
3,14858548,hr5-016-01,normal,COMPLETED,"cpu=32,mem=227200M,node=1",4.0,,32,227200M,1.0,0.001111
6,14858548.0,ramses3d,,COMPLETED,"cpu=16,mem=113600M,node=1",4.0,,16,113600M,1.0,0.001111
9,14858550.batch,batch,,COMPLETED,"cpu=32,mem=227200M,node=1",4.0,,32,227200M,1.0,0.001111
12,14858551,hr5-016-02,normal,COMPLETED,"cpu=32,mem=227200M,node=1",2.0,,32,227200M,1.0,0.000556
...,...,...,...,...,...,...,...,...,...,...,...
207,15124476.0,ramses-lcdm-3d,,FAILED,"cpu=1200,mem=7726620M,node=30",106308.0,,1200,7726620M,30.0,885.900000
210,15124479.extern,extern,,RUNNING,"billing=3024,cpu=4320,mem=7726620M,node=30",131039.0,,4320,7726620M,30.0,1091.991667
213,15126900.batch,batch,,RUNNING,"cpu=144,mem=257554M,node=1",25511.0,,144,257554M,1.0,7.086389
216,15126905,CPL1,large_cpu,RUNNING,"billing=3024,cpu=4320,mem=7726620M,node=30",25450.0,,4320,7726620M,30.0,212.083333


## Compute node hours

In [21]:
def add_node_hours(info):
    info["ElapsedRaw"] = info["ElapsedRaw"].astype(float)  # sec
    info["node"] = info["node"].astype(float)
    info["NodeHours"] = (info["ElapsedRaw"] * info["node"]) / 3600  # hr

In [23]:
add_node_hours(info1)
add_node_hours(info2)

In [43]:
node_hours_used1 = info1[info1['Partition']=='large_cpu']["NodeHours"].sum()
node_hours_used2 = info2[info2['Partition']=='large_cpu']["NodeHours"].sum()

In [34]:
node_hours_used2 - node_hours_used1 # node hours for final runs

7196.800000000003

In [37]:
(node_hours_used2 - node_hours_used1)/24/30 # node days for final runs
# It gives 10 days, but it's more than 5+3+2?

9.99555555555556

In [38]:
info2[info2['Partition']=='large_cpu']["NodeHours"].sum() 

14371.44166666667

In [47]:
node_hours_total = 63000 # total node hours available on Olaf
node_hours_used = info2[info2['Partition']=='large_cpu']["NodeHours"].sum() 
node_hours_left = node_hours_total - node_hours_used

print(f"Total node hours: {node_hours_total:10.2f} h")
print(f"Used node hours:  {node_hours_used:10.2f} h ({node_hours_used/node_hours_total*100:.2f}%)")
print(f"Left node hours:  {node_hours_left:10.2f} h ({node_hours_left/node_hours_total*100:.2f}%)")

Total node hours:   63000.00 h
Used node hours:    14371.44 h (22.81%)
Left node hours:    48628.56 h (77.19%)


## Estimate remaining days for specific node usage

In [48]:
nodes = 90
days_left = node_hours_left / 24 / nodes
print(f"Running days with {nodes} nodes: {days_left:.2f} days")

Running days with 90 nodes: 22.51 days


## Estimate required nodes for a specific remaining days

In [49]:
days_left = 30
nodes = node_hours_left / 24 / days_left
print(f"Required nodes for {days_left} days: {nodes:.2f} nodes")

Required nodes for 30 days: 67.54 nodes


# Estimate the number of maximum runs

In [55]:
# For one simulation,
nodes = 30
days_run = 20 # days (~ 10 days for ideal case?)
node_hours_run = nodes * days_run * 24

In [56]:
node_hours_total / node_hours_run

4.375