# Workflow Analysis - DDMD

This notebook provides a simplified interface to the workflow analysis using modular Python scripts.

## Overview
The analysis process includes:
1. Loading workflow data from datalife statistics
2. Estimating transfer rates using 4D interpolation from IOR benchmark data
3. Calculating Storage Performance Metrics (SPM) for different storage configurations
4. Generating visualizations and recommendations

In [1]:
# Import required libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os

# Import our workflow analysis modules
from modules.workflow_config import DEFAULT_WF, TEST_CONFIGS, STORAGE_LIST
from modules.workflow_data_utils import (
    load_workflow_data, calculate_io_time_breakdown
)
from modules.workflow_interpolation import (
    estimate_transfer_rates_for_workflow, calculate_aggregate_filesize_per_node
)
from modules.workflow_spm_calculator import (
    calculate_spm_for_workflow, filter_storage_options,
    display_top_sorted_averaged_rank, select_best_storage_and_parallelism
)
from modules.workflow_visualization import plot_all_visualizations

from modules.workflow_data_staging import insert_data_staging_rows

print("Modules imported successfully!")

Modules imported successfully!


## Configuration

Set the workflow to analyze and other parameters.

In [2]:
# Configuration
WORKFLOW_NAME = "1kg"  # Change this to analyze different workflows
IOR_DATA_PATH = "../perf_profiles/updated_master_ior_df.csv"

print(f"Analyzing workflow: {WORKFLOW_NAME}")
print(f"Available workflows: {list(TEST_CONFIGS.keys())}")
print(f"IOR data path: {IOR_DATA_PATH}")

ALLOWED_PARALLELISM = TEST_CONFIGS[WORKFLOW_NAME]["ALLOWED_PARALLELISM"]

Analyzing workflow: ddmd_4n_l
Available workflows: ['ddmd_2n_s', 'ddmd_4n_l', '1kg', '1kg_2', 'pyflex_240f', 'pyflex_s9_48f', 'ptychonn', 'montage', 'seismology', 'llm_wf', 'template_workflow']
IOR data path: ../perf_profiles/updated_master_ior_df.csv


## Step 1: Load Workflow Data

Load and process the workflow data from datalife statistics.

In [3]:
# Load workflow data
print("Loading workflow data...")
wf_df, task_order_dict, all_wf_dict = load_workflow_data(WORKFLOW_NAME, debug=False)

print(f"\nWorkflow data loaded:")
print(f"- Total records: {len(wf_df)}")
print(f"- Task definitions: {len(task_order_dict)}")
print(f"- Unique tasks: {list(wf_df['taskName'].unique())}")
print(f"- Stages: {sorted(wf_df['stageOrder'].unique())}")



Loading workflow data...
Trial folders: ['./ddmd/ddmd_4n_pfs_large/4n_pfs_t1']
len(blk_files) = 89


  wf_df = pd.concat([wf_df, new_row], ignore_index=True)
  folder_dfs = pd.concat([folder_dfs, wf_df], ignore_index=True)



Workflow data loaded:
- Total records: 267
- Task definitions: 4
- Unique tasks: ['openmm', 'aggregate', 'training', 'inference']
- Stages: [np.int64(0), np.int64(1), np.int64(2)]


In [4]:
# Display first few rows
print("\nFirst few rows of workflow data:")
print(wf_df.head())
print(wf_df.columns)


First few rows of workflow data:
   operation  randomOffset  transferSize  aggregateFilesizeMB  numTasks  \
0          0             1   1066.397658             7.293896        12   
1          0             1   1066.397658             7.293896        12   
2          0             1   1066.397658             7.293896        12   
3          0             1   1066.397658             7.293896        12   
4          0             1   1066.397658             7.293896        12   

   parallelism  totalTime numNodesList  numNodes  tasksPerNode       trMiB  \
0           12   0.026276    [1, 2, 4]         1            12  277.589578   
1           12   0.026276    [1, 2, 4]         2             6  277.589578   
2           12   0.026276    [1, 2, 4]         4             3  277.589578   
3           12   0.026276    [1, 2, 4]         1            12  277.589578   
4           12   0.026276    [1, 2, 4]         2             6  277.589578   

   storageType  opCount taskName       taskPID

## Step 2: Calculate I/O Time Breakdown

Calculate the I/O time breakdown for each task in the workflow.

In [5]:
# Get configuration for the workflow
config = TEST_CONFIGS[WORKFLOW_NAME]
num_nodes_list = config["NUM_NODES_LIST"]

# Create task name to parallelism mapping
task_name_to_parallelism = {task: info['parallelism'] for task, info in task_order_dict.items()}

# Calculate I/O time breakdown
print("Calculating I/O time breakdown...")
io_breakdown = calculate_io_time_breakdown(wf_df, task_name_to_parallelism, num_nodes_list)

print(f"\nI/O breakdown results:")
for key, value in io_breakdown.items():
    if isinstance(value, dict):
        print(f"{key}:")
        for sub_key, sub_value in value.items():
            print(f"  {sub_key}: {sub_value:.2f} seconds")
    else:
        print(f"{key}: {value:.2f} seconds")

Calculating I/O time breakdown...
Total I/O time per taskName:
 aggregate (write): 0.10176218799999999 (sec)
 inference (write): 2.8633e-05 (sec)
 openmm (write): 0.05025775266666667 (sec)
 training (write): 1.95473184 (sec)
 aggregate (read): 2.748506033 (sec)
 inference (read): 1.850840342 (sec)
 training (read): 11.551488481 (sec)
Total I/O time per workflow: 18.257615269666665

I/O breakdown results:
total_io_time: 18.26 seconds
total_write_time: 2.11 seconds
total_read_time: 16.15 seconds
task_io_time_adjust:
  read: 16.15 seconds
  write: 2.11 seconds


## Step 2.1: Calculate Aggregate File Size per Node

Calculate the aggregate file size per node for proper scaling.

In [6]:
# Calculate aggregate file size per node
print("Calculating aggregate file size per node...")
wf_df = calculate_aggregate_filesize_per_node(wf_df)

print("\nAggregate file size calculation completed.")
print(f"Updated columns: {[col for col in wf_df.columns if 'aggregateFilesizeMB' in col]}")

Calculating aggregate file size per node...

Aggregate file size calculation completed.
Updated columns: ['aggregateFilesizeMBtask', 'aggregateFilesizeMB']


In [7]:
# # Display rows with taskName include string "stage-"
# staged = insert_data_staging_rows(wf_df)
# print(staged[staged['taskName'].str.startswith('stage_')][['taskName', 'stageOrder', 'operation']])

## Step 2.2: Insert data movement steps to workflow

In [8]:
# Step 2.2: Insert data staging (I/O) rows into the workflow DataFrame

# Set debug=False to see detailed output, or False for silent operation
wf_df = insert_data_staging_rows(wf_df, debug=False)

print("Data staging rows inserted. New DataFrame shape:", wf_df.shape)
display(wf_df.head(10))

Data staging rows inserted. New DataFrame shape: (393, 19)


Unnamed: 0,operation,randomOffset,transferSize,aggregateFilesizeMBtask,numTasks,parallelism,totalTime,numNodesList,numNodes,tasksPerNode,trMiB,storageType,opCount,taskName,taskPID,fileName,stageOrder,prevTask,aggregateFilesizeMB
0,0,1,1066.397658,7.293896,12,12,0.026276,"[1, 2, 4]",1,12,277.589578,5,7172,openmm,190075-dlt02,stage0000_task0000.h5,0.0,,174.608185
1,0,1,1066.397658,7.293896,12,12,0.026276,"[1, 2, 4]",2,6,277.589578,5,7172,openmm,190075-dlt02,stage0000_task0000.h5,0.0,,87.304092
2,0,1,1066.397658,7.293896,12,12,0.026276,"[1, 2, 4]",4,3,277.589578,5,7172,openmm,190075-dlt02,stage0000_task0000.h5,0.0,,43.652046
3,0,1,1066.397658,7.293896,12,12,0.026276,"[1, 2, 4]",1,12,277.589578,5,7172,openmm,190075-dlt02,stage0000_task0000.dcd,0.0,,174.608185
4,0,1,1066.397658,7.293896,12,12,0.026276,"[1, 2, 4]",2,6,277.589578,5,7172,openmm,190075-dlt02,stage0000_task0000.dcd,0.0,,87.304092
5,0,1,1066.397658,7.293896,12,12,0.026276,"[1, 2, 4]",4,3,277.589578,5,7172,openmm,190075-dlt02,stage0000_task0000.dcd,0.0,,43.652046
6,0,1,1064.135603,7.274364,12,12,0.024391,"[1, 2, 4]",1,12,298.234391,5,7168,openmm,74814-dlt06,stage0000_task0011.dcd,0.0,,174.608185
7,0,1,1064.135603,7.274364,12,12,0.024391,"[1, 2, 4]",2,6,298.234391,5,7168,openmm,74814-dlt06,stage0000_task0011.dcd,0.0,,87.304092
8,0,1,1064.135603,7.274364,12,12,0.024391,"[1, 2, 4]",4,3,298.234391,5,7168,openmm,74814-dlt06,stage0000_task0011.dcd,0.0,,43.652046
9,0,1,1064.135603,7.274364,12,12,0.024391,"[1, 2, 4]",1,12,298.234391,5,7168,openmm,74814-dlt06,stage0000_task0011.h5,0.0,,174.608185


In [9]:
# Check for tasks being added to stage_out
for checktask in ['inference', 'aggregate']:
    new_rows = wf_df[wf_df['taskName'] == f'stage_out-{checktask}']
    if not new_rows.empty:
        print(f"\nSample of stage_out-{checktask} rows:")
        print(new_rows[['taskName', 'fileName', 'stageOrder', 'operation']].head(3))
        print(f"Total stage_out-{checktask} rows: {len(new_rows)}")
    else:
        print(f"\nNo stage_out-{checktask} tasks found in the DataFrame")


Sample of stage_out-inference rows:
                taskName                                           fileName  \
363  stage_out-inference  stage0000_task0000.h5,stage0000_task0005.h5,st...   
364  stage_out-inference  stage0000_task0000.h5,stage0000_task0005.h5,st...   
365  stage_out-inference  stage0000_task0000.h5,stage0000_task0005.h5,st...   

     stageOrder operation  
363         2.5        cp  
364         2.5        cp  
365         2.5        cp  
Total stage_out-inference rows: 30

Sample of stage_out-aggregate rows:
                taskName                                           fileName  \
273  stage_out-aggregate  stage0000_task0007.h5,stage0000_task0002.h5,st...   
274  stage_out-aggregate  stage0000_task0007.h5,stage0000_task0002.h5,st...   
275  stage_out-aggregate  stage0000_task0007.h5,stage0000_task0002.h5,st...   

     stageOrder operation  
273         1.5        cp  
274         1.5        cp  
275         1.5        cp  
Total stage_out-aggregate rows: 2

In [10]:
# Display first few rows
print("\nFirst few rows of workflow data:")
print(wf_df.head())
print(wf_df.columns)
print(wf_df[wf_df['operation'] == 0]['taskName'].unique())


First few rows of workflow data:
  operation  randomOffset  transferSize  aggregateFilesizeMBtask  numTasks  \
0         0             1   1066.397658                 7.293896        12   
1         0             1   1066.397658                 7.293896        12   
2         0             1   1066.397658                 7.293896        12   
3         0             1   1066.397658                 7.293896        12   
4         0             1   1066.397658                 7.293896        12   

   parallelism totalTime numNodesList  numNodes  tasksPerNode       trMiB  \
0           12  0.026276    [1, 2, 4]         1            12  277.589578   
1           12  0.026276    [1, 2, 4]         2             6  277.589578   
2           12  0.026276    [1, 2, 4]         4             3  277.589578   
3           12  0.026276    [1, 2, 4]         1            12  277.589578   
4           12  0.026276    [1, 2, 4]         2             6  277.589578   

  storageType  opCount taskName   

## Step 3: Load IOR Benchmark Data and Estimate Transfer Rates

Load the IOR benchmark data and estimate transfer rates for different storage configurations.

In [11]:
# Load IOR benchmark data
print("Loading IOR benchmark data...")
if os.path.exists(IOR_DATA_PATH):
    df_ior = pd.read_csv(IOR_DATA_PATH)
    print(f"Loaded {len(df_ior)} IOR benchmark records")

    # Estimate transfer rates
    print("\nEstimating transfer rates...")
    cp_scp_parallelism = set(wf_df.loc[wf_df['operation'].isin(['cp', 'scp']), 'parallelism'].unique())
    ALLOWED_PARALLELISM = sorted(set(ALLOWED_PARALLELISM).union(cp_scp_parallelism))

    # Then call the function:
    wf_df = estimate_transfer_rates_for_workflow(
        wf_df, df_ior, STORAGE_LIST, ALLOWED_PARALLELISM, multi_nodes=True, debug=False
    )
    # wf_df = estimate_transfer_rates_for_workflow(wf_df, df_ior, STORAGE_LIST, ALLOWED_PARALLELISM)
    print("Transfer rate estimation completed")
    
    # Show estimated transfer rate columns
    estimated_cols = [col for col in wf_df.columns if col.startswith('estimated_trMiB_')]
    print(f"\nEstimated transfer rate columns: {len(estimated_cols)}")
    print(f"Sample columns: {estimated_cols}")
else:
    print(f"Warning: IOR data file not found at {IOR_DATA_PATH}")
    print("Skipping transfer rate estimation...")
    df_ior = pd.DataFrame()

# Save the estimated transfer rates to a new CSV file
os.makedirs("./analysis_data", exist_ok=True)
wf_df.to_csv(f"./analysis_data/{WORKFLOW_NAME}_estimated_transfer_rates.csv", index=True)
print(f"Saved estimated transfer rates to: ./analysis_data/{WORKFLOW_NAME}_estimated_transfer_rates.csv")

Loading IOR benchmark data...
Loaded 21916 IOR benchmark records

Estimating transfer rates...
Found 126 stage_in/stage_out tasks:
  Task: stage_in-aggregate, Operation: cp, Storage: beegfs-ssd
  Task: stage_in-aggregate, Operation: cp, Storage: beegfs-ssd
  Task: stage_in-aggregate, Operation: cp, Storage: beegfs-ssd
  Task: stage_in-aggregate, Operation: cp, Storage: beegfs-tmpfs
  Task: stage_in-aggregate, Operation: cp, Storage: beegfs-tmpfs
  Task: stage_in-aggregate, Operation: cp, Storage: beegfs-tmpfs
  Task: stage_in-aggregate, Operation: scp, Storage: ssd-ssd
  Task: stage_in-aggregate, Operation: scp, Storage: ssd-ssd
  Task: stage_in-aggregate, Operation: scp, Storage: ssd-ssd
  Task: stage_in-aggregate, Operation: scp, Storage: tmpfs-tmpfs
  Task: stage_in-aggregate, Operation: scp, Storage: tmpfs-tmpfs
  Task: stage_in-aggregate, Operation: scp, Storage: tmpfs-tmpfs
  Task: stage_in-training, Operation: cp, Storage: beegfs-ssd
  Task: stage_in-training, Operation: cp, Sto

Task[openmm] Storage[beegfs] Parallelism[6] aggregateFilesizeMB[87.30409240722656] -> estimated_trMiB_beegfs_6p = 3439.5747231786645
Task[openmm] Storage[tmpfs] Parallelism[6] aggregateFilesizeMB[87.30409240722656] -> estimated_trMiB_tmpfs_6p = 16896.208198187138
Task[openmm] Storage[localssd] Parallelism[3] aggregateFilesizeMB[43.65204620361328] -> estimated_trMiB_localssd_3p = 2004.4998460299155
Task[openmm] Storage[beegfs] Parallelism[3] aggregateFilesizeMB[43.65204620361328] -> estimated_trMiB_beegfs_3p = 4751.5405899503285
Task[openmm] Storage[tmpfs] Parallelism[3] aggregateFilesizeMB[43.65204620361328] -> estimated_trMiB_tmpfs_3p = 18191.426045347114
Task[openmm] Storage[localssd] Parallelism[12] aggregateFilesizeMB[174.60818481445312] -> estimated_trMiB_localssd_12p = 908.8797582463574
Task[openmm] Storage[beegfs] Parallelism[12] aggregateFilesizeMB[174.60818481445312] -> estimated_trMiB_beegfs_12p = 2766.716567513573
Task[openmm] Storage[tmpfs] Parallelism[12] aggregateFilesize

In [12]:
# Show all rows
pd.set_option('display.max_rows', None)
# Show all columns
pd.set_option('display.max_columns', None)
# Show full width
pd.set_option('display.width', None)
# Don't truncate column content
pd.set_option('display.max_colwidth', None)

# print(wf_df.head(10))

# Check for tasks being added to stage_out
for checktask in ['inference', 'aggregate']:
    new_rows = wf_df[wf_df['taskName'] == f'stage_out-{checktask}']
    if not new_rows.empty:
        print(f"\nSample of stage_out-{checktask} rows:")
        print(new_rows.head(3))
        print(f"Total stage_out-{checktask} rows: {len(new_rows)}")
    else:
        print(f"\nNo stage_out-{checktask} tasks found in the DataFrame")


Sample of stage_out-inference rows:
    operation  randomOffset  transferSize  aggregateFilesizeMBtask  numTasks  \
363        cp             0        4096.0                      NaN        14   
364        cp             0        4096.0                      NaN        14   
365        cp             0        4096.0                      NaN        14   

     parallelism totalTime numNodesList  numNodes  tasksPerNode trMiB  \
363           14              [1, 2, 4]         1            14         
364           14              [1, 2, 4]         2             7         
365           14              [1, 2, 4]         4             4         

    storageType  opCount             taskName taskPID  \
363  beegfs-ssd       14  stage_out-inference           
364  beegfs-ssd       14  stage_out-inference           
365  beegfs-ssd       14  stage_out-inference           

                                                                                                                        

## Step 5: Calculate SPM Values

Calculate Storage Performance Metrics (SPM) for the workflow.

In [13]:
# Calculate SPM values
print("Calculating SPM values...")
# Set debug=False for verbose output, debug=False for minimal output
spm_results = calculate_spm_for_workflow(wf_df, debug=False)

print(f"\nSPM calculation completed:")
print(f"- Producer-consumer pairs: {len(spm_results)}")
for pair in spm_results.keys():
    print(f"  - {pair}")

Calculating SPM values...

SPM calculation completed:
- Producer-consumer pairs: 10
  - openmm:stage_out-openmm
  - openmm:aggregate
  - openmm:inference
  - openmm:training
  - stage_in-aggregate:aggregate
  - stage_in-training:training
  - aggregate:stage_out-aggregate
  - training:stage_out-training
  - stage_in-inference:inference
  - inference:stage_out-inference


In [14]:
# # printe all spm results with producer : consumer pairs

# # Print weighted SPM values for debugging
# for pair, data in spm_results.items():
#     print(f"\nProducer-Consumer Pair: {pair}")
#     print("SPM:")
#     for storage_n, spm_values in data['SPM'].items():
#         print(f"  {storage_n}: {spm_values[0:10]} ...")
#     print("estT_prod:")
#     for storage_n, estT_prod_values in data['estT_prod'].items():
#         print(f"  {storage_n}: {estT_prod_values[0:10]} ...")
#     print("estT_cons:")
#     for storage_n, estT_cons_values in data['estT_cons'].items():
#         print(f"  {storage_n}: {estT_cons_values[0:10]} ...")
#     print("dsize_prod:")
#     for storage_n, dsize_prod_values in data['dsize_prod'].items():
#         print(f"  {storage_n}: {dsize_prod_values[0:10]} ...")
#     print("dsize_cons:")
#     for storage_n, dsize_cons_values in data['dsize_cons'].items():
#         print(f"  {storage_n}: {dsize_cons_values[0:10]} ...")

## Step 6: Filter Storage Options and Select Best Configuration

Filter storage options and select the best storage configuration for each producer-consumer pair.

In [15]:
# Filter storage options
print("Filtering storage options...")
filtered_spm_results = filter_storage_options(spm_results, WORKFLOW_NAME)

# Select best storage and parallelism
print("\nSelecting best storage and parallelism...")
best_results = select_best_storage_and_parallelism(spm_results, baseline=0)

print("\nBest storage configurations:")
for pair, config in best_results.items():
    print(f"{pair}:")
    print(f"  Best storage: {config['best_storage_type']}")
    print(f"  Best parallelism: {config['best_parallelism']}")
    print(f"  Best rank: {config['best_rank']:.2f}")

Filtering storage options...

Selecting best storage and parallelism...
Selecting best storage configurations for 10 pairs...
Selected best storage configurations for 10 pairs.

Best storage configurations:
openmm:stage_out-openmm:
  Best storage: beegfs-ssd
  Best parallelism: beegfs-ssd_12_24p
  Best rank: 0.77
openmm:aggregate:
  Best storage: tmpfs
  Best parallelism: tmpfs_12_1p
  Best rank: 14467.22
openmm:inference:
  Best storage: tmpfs
  Best parallelism: tmpfs_12_1p
  Best rank: 18113.89
openmm:training:
  Best storage: tmpfs
  Best parallelism: tmpfs_12_1p
  Best rank: 48.88
stage_in-aggregate:aggregate:
  Best storage: tmpfs
  Best parallelism: tmpfs_14_1p
  Best rank: 14418.33
stage_in-training:training:
  Best storage: tmpfs
  Best parallelism: tmpfs_37_1p
  Best rank: 0.00
aggregate:stage_out-aggregate:
  Best storage: beegfs-ssd
  Best parallelism: beegfs-ssd_1_1p
  Best rank: 0.01
training:stage_out-training:
  Best storage: beegfs-ssd
  Best parallelism: beegfs-ssd_1_

## Step 7: Display Top Results

Display the top storage configurations based on rank values.

In [16]:
# Display top results
print("Displaying top results...")
display_top_sorted_averaged_rank(spm_results, top_n=20)

Displaying top results...
Top 20 Averaged SPM Values Closest to Baseline = 0:

Producer: openmm, Consumer: stage_out-openmm
- Rank 1: beegfs-ssd_12_24p with Averaged rank = 0.7714285268133748
- Rank 2: beegfs-ssd_6_24p with Averaged rank = 0.7714285268133748
- Rank 3: beegfs-ssd_3_24p with Averaged rank = 0.7714285268133748
- Rank 4: beegfs-ssd_12_12p with Averaged rank = 0.8008579032661048
- Rank 5: beegfs-ssd_6_12p with Averaged rank = 0.8008579032661048
- Rank 6: beegfs-ssd_3_12p with Averaged rank = 0.8008579032661048
- Rank 7: beegfs-ssd_12_6p with Averaged rank = 0.8328524434130604
- Rank 8: beegfs-ssd_6_6p with Averaged rank = 0.8328524434130604
- Rank 9: beegfs-ssd_3_6p with Averaged rank = 0.8328524434130604
- Rank 10: beegfs-tmpfs_12_6p with Averaged rank = 1.1039801233045914
- Rank 11: beegfs-tmpfs_6_6p with Averaged rank = 1.1039801233045914
- Rank 12: beegfs-tmpfs_3_6p with Averaged rank = 1.1039801233045914
- Rank 13: beegfs-tmpfs_12_12p with Averaged rank = 9.56555576402

## Step 8: Generate Visualizations

Generate comprehensive visualizations of the analysis results.

In [17]:
# # Generate visualizations
# print("Generating visualizations...")
# plot_all_visualizations(wf_df, best_results, io_breakdown['task_io_time_adjust'])
# print("Visualizations completed!")

## Step 9: Save Results

Save the analysis results to files for future reference.

In [18]:
#FIXME Step 9: Debug and save filtered SPM results
import os
import pandas as pd
import numpy as np
from modules.workflow_results_exporter import extract_producer_consumer_results, print_storage_analysis

# Debug: Check what's in filtered_spm_results
print("=== Debugging filtered_spm_results ===")
print(f"Type: {type(filtered_spm_results)}")
print(f"Length: {len(filtered_spm_results) if filtered_spm_results else 'None/Empty'}")

if filtered_spm_results:
    print(f"Keys: {list(filtered_spm_results.keys())[:5]}...")  # First 5 keys
    
    # Examine first item structure
    first_key = list(filtered_spm_results.keys())[0]
    first_value = filtered_spm_results[first_key]
    print(f"\nFirst item - Key: '{first_key}'")
    print(f"  Type: {type(first_value)}")
    if isinstance(first_value, dict):
        print(f"  Keys: {list(first_value.keys())}")
        for k, v in first_value.items():
            print(f"    {k}: {type(v)} = {v}")

# Try to extract results
print("\n=== Attempting to Extract Results ===")
try:
    results_df = extract_producer_consumer_results(filtered_spm_results, wf_df)
    print(f"Extracted {len(results_df)} rows")
    
    if not results_df.empty:
        print("Sample data:")
        print(results_df.head())
        
        # Save to CSV
        output_dir = "workflow_spm_results"
        os.makedirs(output_dir, exist_ok=True)
        workflow_name = "ddmd_4n_l"  # or your workflow name
        csv_filename = f"{workflow_name}_filtered_spm_results.csv"
        csv_path = os.path.join(output_dir, csv_filename)
        
        results_df.to_csv(csv_path, index=False)
        print(f"✓ Saved to: {csv_path}")
        
        # Print storage analysis
        print_storage_analysis(results_df)
        
    else:
        print("Error: Extracted DataFrame is empty - trying alternative method...")
        
        # Alternative extraction method
        results_data = []
        for pair, data in filtered_spm_results.items():
            if isinstance(data, dict):
                # Try to find storage and SPM information
                storage_info = None
                spm_value = None
                
                # Look for storage-related keys
                for key in data.keys():
                    if 'storage' in key.lower() or 'type' in key.lower():
                        storage_info = data[key]
                    if 'spm' in key.lower() or 'rank' in key.lower():
                        spm_value = data[key]
                
                if storage_info:
                    producer, consumer = pair.split(':') if ':' in pair else ('unknown', 'unknown')
                    results_data.append({
                        'producer': producer,
                        'producerStage': -1,
                        'consumer': consumer,
                        'consumerStage': -1,
                        'prodParallelism': np.nan,
                        'consParallelism': np.nan,
                        'p-c-Storage': storage_info,
                        'p-c-SPM': spm_value if spm_value else np.nan
                    })
        
        if results_data:
            alt_df = pd.DataFrame(results_data)
            
            # Fill stage information
            task_stage_mapping = {}
            for _, row in wf_df.iterrows():
                task_name = row['taskName']
                stage_order = row['stageOrder']
                if task_name not in task_stage_mapping:
                    task_stage_mapping[task_name] = stage_order
            
            for i, row in alt_df.iterrows():
                if row['producer'] in task_stage_mapping:
                    alt_df.at[i, 'producerStage'] = task_stage_mapping[row['producer']]
                if row['consumer'] in task_stage_mapping:
                    alt_df.at[i, 'consumerStage'] = task_stage_mapping[row['consumer']]
            
            # Save alternative results
            output_dir = "workflow_spm_results"
            os.makedirs(output_dir, exist_ok=True)
            workflow_name = "ddmd_4n_l"
            csv_filename = f"{workflow_name}_filtered_spm_results_alt.csv"
            csv_path = os.path.join(output_dir, csv_filename)
            
            alt_df.to_csv(csv_path, index=False)
            print(f"✓ Saved alternative results to: {csv_path}")
            print(f"Alternative DataFrame shape: {alt_df.shape}")
            print(alt_df.head())
        else:
            print("Error: No data could be extracted with alternative method")
            
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

=== Debugging filtered_spm_results ===
Type: <class 'dict'>
Length: 10
Keys: ['openmm:stage_out-openmm', 'openmm:aggregate', 'openmm:inference', 'openmm:training', 'stage_in-aggregate:aggregate']...

First item - Key: 'openmm:stage_out-openmm'
  Type: <class 'dict'>
  Keys: ['SPM', 'estT_prod', 'estT_cons', 'rank', 'par_prod', 'par_cons', 'dsize_prod', 'dsize_cons']
    SPM: <class 'dict'> = {'beegfs-tmpfs_12_24p': [35.43370223126123], 'beegfs-tmpfs_12_12p': [47.290441065145096], 'beegfs-tmpfs_12_6p': [409.7531663521236], 'beegfs-tmpfs_6_24p': [28.500195938260955], 'beegfs-tmpfs_6_12p': [38.036861843195645], 'beegfs-tmpfs_6_6p': [329.574523461888], 'beegfs-tmpfs_3_24p': [20.63094964732033], 'beegfs-tmpfs_3_12p': [27.534427592322487], 'beegfs-tmpfs_3_6p': [238.57504044222011], 'beegfs-ssd_12_24p': [586.3917853575819], 'beegfs-ssd_12_12p': [564.8434625781006], 'beegfs-ssd_12_6p': [543.1446526831071], 'beegfs-ssd_6_24p': [471.6492978973394], 'beegfs-ssd_6_12p': [454.3174532781143], 'beegf

## Summary

The workflow analysis has been completed successfully. The results include:

1. **Workflow Data**: Processed datalife statistics organized in a DataFrame
2. **I/O Breakdown**: Time analysis for each task in the workflow
3. **Transfer Rate Estimates**: Estimated transfer rates for different storage configurations
4. **SPM Values**: Storage Performance Metrics for producer-consumer pairs
5. **Best Configurations**: Recommended storage and parallelism settings
6. **Visualizations**: Comprehensive plots and charts
7. **Saved Results**: CSV and JSON files for future reference

### Key Findings

The analysis provides insights into:
- Which storage types perform best for each workflow stage
- Optimal parallelism levels for different storage configurations
- I/O bottlenecks and performance characteristics
- Recommendations for storage selection

### Next Steps

You can:
- Analyze different workflows by changing the `WORKFLOW_NAME` variable
- Modify the analysis parameters in the configuration
- Use the saved results for further analysis or comparison
- Run the analysis programmatically using the `workflow_analysis_main.py` script