# Assignment 1: Trace Analysis with HTA

Use Holistic Trace Analysis to compare training setups:
- Baseline (Single GPU)
- DDP (DistributedDataParallel)
- FSDP with different sharding strategies

Key analyses:
1. Temporal breakdown (compute vs communication vs idle)
2. Trace diff (exact operation differences)
3. Memory usage patterns
4. Communication overhead

In [1]:
!pip install HolisticTraceAnalysis

Defaulting to user installation because normal site-packages is not writeable
Collecting HolisticTraceAnalysis
  Downloading holistictraceanalysis-0.5.0-py3-none-any.whl (371 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m371.2/371.2 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting plotly>=5.11.0
  Downloading plotly-6.5.0-py3-none-any.whl (9.9 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.9/9.9 MB[0m [31m150.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting jupyterlab>=3.5.1
  Downloading jupyterlab-4.5.1-py3-none-any.whl (12.4 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.4/12.4 MB[0m [31m186.1 MB/s[0m eta [36m0:00:00[0m00:01[0m


In [2]:
# Install HTA if needed
# !pip install HolisticTraceAnalysis

import pandas as pd
import matplotlib.pyplot as plt
from hta.trace_analysis import TraceAnalysis
from hta.trace_diff import TraceDiff, LabeledTrace, DeviceType

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/lib/python3/dist-packages/traitlets/config/application.py", line 846, in launch_instance
    app.start()
  File "/usr/lib/python3/dist-packages/ipykernel/kernelapp.py", line 677, in start
    s

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

## 1. Load Traces

Load all the traces generated from training runs.

In [None]:
# Define trace directories
trace_dirs = {
    'baseline': 'outputs/traces/baseline/',
    'ddp': 'outputs/traces/ddp/',
    'fsdp_full': 'outputs/traces/fsdp_full_shard/',
    'fsdp_grad': 'outputs/traces/fsdp_shard_grad_op/'
}

# Load traces for temporal analysis
traces = {}
for name, trace_dir in trace_dirs.items():
    print(f"Loading {name}...")
    traces[name] = TraceAnalysis(trace_dir=trace_dir)

print("\n‚úÖ All traces loaded!")

## 2. Temporal Breakdown Analysis

See how GPU time is spent in each setup.

In [None]:
# Get temporal breakdown for all setups
breakdowns = {}
for name, trace in traces.items():
    print(f"\nüìä {name.upper()} Temporal Breakdown:")
    breakdown = trace.get_temporal_breakdown(visualize=False)
    breakdowns[name] = breakdown
    print(breakdown)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, (name, breakdown) in enumerate(breakdowns.items()):
    # Handle different column naming conventions
    data = breakdown[['compute_time(us)', 'non_compute_time(us)', 'idle_time(us)']].iloc[0]
    
    axes[idx].pie(data, labels=['Compute', 'Non-Compute', 'Idle'], autopct='%1.1f%%')
    axes[idx].set_title(f'{name.upper()} Time Breakdown')

plt.tight_layout()
plt.show()

## 3. Communication-Computation Overlap Analysis

Analyze how well communication overlaps with computation in distributed setups.
Good overlap means the GPU is computing while communication happens in the background,
improving overall efficiency.

In [None]:
# Compute communication-computation overlap for distributed setups
# Skip baseline as it has no communication
for name in ['ddp', 'fsdp_full', 'fsdp_grad']:
    if name in traces:
        print(f"\nüìä {name.upper()} Communication-Computation Overlap:")
        overlap_df = traces[name].get_comm_comp_overlap(visualize=False)
        if overlap_df is not None and len(overlap_df) > 0:
            print(overlap_df)
        else:
            print("‚ö†Ô∏è  No overlap data available")

### Understanding Overlap Metrics

Key concepts:
- **comm_exposed**: Communication time that is NOT overlapped with computation (bad)
- **total_comm_duration**: Total time spent in communication operations
- **Overlap %**: Percentage of communication that is hidden by computation (good)

## 4. Prepare LabeledTrace Objects for TraceDiff

TraceDiff requires LabeledTrace objects and specific rank/iteration numbers.

In [None]:
# Create LabeledTrace objects for each setup
labeled_traces = {
    'baseline': LabeledTrace(label="Baseline", trace_dir='outputs/traces/baseline/'),
    'ddp': LabeledTrace(label="DDP", trace_dir='outputs/traces/ddp/'),
    'fsdp_full': LabeledTrace(label="FSDP_Full", trace_dir='outputs/traces/fsdp_full_shard/'),
    'fsdp_grad': LabeledTrace(label="FSDP_Grad", trace_dir='outputs/traces/fsdp_shard_grad_op/')
}

# Check available ranks and iterations for each trace
for name, lt in labeled_traces.items():
    print(f"\n{name}:")
    print(f"  Ranks: {lt.ranks()}")
    print(f"  Iterations: {lt.iterations()[:5]}...")  # Show first 5 iterations

## 5. Compare Two Setups (Trace Diff)

Use TraceDiff to compare operator counts and durations between setups.

In [None]:
def compare_two_setups(trace1, trace2, name1, name2, rank1=0, rank2=0, iteration=0, device_type=DeviceType.GPU):
    """
    Compare two training setups using HTA TraceDiff.
    
    Args:
        trace1, trace2: LabeledTrace objects
        name1, name2: Names for display
        rank1, rank2: Rank numbers for each trace
        iteration: Iteration number to compare
        device_type: DeviceType.CPU, DeviceType.GPU, or DeviceType.ALL
    """
    print(f"\n{'='*60}")
    print(f"Comparing: {name1} vs {name2}")
    print(f"Rank: {rank1} vs {rank2}, Iteration: {iteration}, Device: {device_type.name}")
    print(f"{'='*60}")
    
    try:
        # Compare traces using class method
        df_comp = TraceDiff.compare_traces(
            trace1, trace2, 
            rank1, rank2, 
            iteration, iteration, 
            device_type
        )
        
        if df_comp is None or len(df_comp) == 0:
            print("\n‚ö†Ô∏è  No differences found")
            return None
        
        print(f"\nFound {len(df_comp)} operators")
        print(f"\nTop 10 operators by duration difference:")
        print(df_comp.nlargest(10, 'diff_duration')[['diff_duration', 'diff_counts']])
        
        # Show operators unique to trace2 (added operations)
        col_name = f"{trace1.label}_counts"
        if col_name in df_comp.columns:
            added_ops = df_comp[df_comp[col_name] == 0]
            if len(added_ops) > 0:
                print(f"\nüÜï Operations added in {name2}: {len(added_ops)}")
                print(added_ops.head(10))
        
        # Get ops that were removed/added
        ops_diff = TraceDiff.ops_diff(
            trace1, trace2,
            rank1, rank2,
            iteration, iteration,
            device_type
        )
        
        if ops_diff and 'added' in ops_diff:
            print(f"\nüì° Communication/collective operations added:")
            comm_ops = [op for op in ops_diff['added'] if any(x in op.lower() for x in ['nccl', 'allreduce', 'allgather', 'reduce_scatter', 'broadcast'])]
            if comm_ops:
                for op in comm_ops[:10]:
                    print(f"  - {op}")
        
        return df_comp
        
    except Exception as e:
        print(f"\n‚ùå Error comparing traces: {e}")
        print("This might be due to missing ranks/iterations in the traces.")
        return None

## 6. Baseline vs DDP Comparison

In [None]:
# Compare baseline with DDP
# Note: baseline uses rank 0, DDP uses rank 0 as well
# Iteration 4 is the first profiled iteration (profiler schedule: wait=2, warmup=2, active=6)
diff_baseline_ddp = compare_two_setups(
    labeled_traces['baseline'],
    labeled_traces['ddp'],
    'Baseline',
    'DDP',
    rank1=0,
    rank2=0,
    iteration=4,
    device_type=DeviceType.GPU
)

## 7. DDP vs FSDP Comparisons

Compare DDP with different FSDP sharding strategies.

In [None]:
# DDP vs FSDP FULL_SHARD
diff_ddp_fsdp_full = compare_two_setups(
    labeled_traces['ddp'],
    labeled_traces['fsdp_full'],
    'DDP',
    'FSDP_FULL_SHARD',
    rank1=0,
    rank2=0,
    iteration=4,
    device_type=DeviceType.GPU
)

In [None]:
# DDP vs FSDP SHARD_GRAD_OP
diff_ddp_fsdp_grad = compare_two_setups(
    labeled_traces['ddp'],
    labeled_traces['fsdp_grad'],
    'DDP',
    'FSDP_SHARD_GRAD_OP',
    rank1=0,
    rank2=0,
    iteration=4,
    device_type=DeviceType.GPU
)

In [None]:
# FSDP strategies comparison
diff_fsdp_strategies = compare_two_setups(
    labeled_traces['fsdp_full'],
    labeled_traces['fsdp_grad'],
    'FSDP_FULL_SHARD',
    'FSDP_SHARD_GRAD_OP',
    rank1=0,
    rank2=0,
    iteration=4,
    device_type=DeviceType.GPU
)