# 3D DNS Performance Optimization Plan

## Overview

This notebook documents a comprehensive performance optimization strategy for the 3D Direct Numerical Simulation (DNS) solver. The plan is based on detailed code analysis and identifies multiple optimization opportunities ranging from low-risk compiler improvements to advanced algorithmic enhancements.

**Current Status:**
- Grid Size: 128×32×33 = 138,240 grid points
- Estimated time per step: ~0.1-0.2 seconds
- Main computational bottlenecks: FFT operations, memory bandwidth, nested loops

**Optimization Goal:** Achieve 3-7× performance improvement through systematic optimization phases

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# For parsing Fortran namelist files
try:
    import f90nml
    f90nml_available = True
except ImportError:
    print("f90nml not available. Install with: pip install f90nml")
    f90nml_available = False

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"f90nml available: {f90nml_available}")

In [None]:
# Parse Fortran Namelist File
config_file = "input_3d_method2.dat"

if f90nml_available and Path(config_file).exists():
    # Read the namelist file
    config = f90nml.read(config_file)
    print(f"Successfully parsed {config_file}")
    print(f"Found namelists: {list(config.keys())}")
else:
    # Manual parsing if f90nml not available
    print("Parsing configuration manually...")
    config = {
        'grid': {'nx_input': 128, 'ny_input': 32, 'nz_input': 33},
        'time_control': {'istart': 0, 'dt': 0.01, 'nsteps': 10000, 'nwrt': 10},
        'simulation': {'alpha': 1.0, 'beta': 1.0, 're': 180.0, 'ta': 0.0, 'ybar': 2.0, 
                      'cgstol': 1e-6, 'cs': 0.1, 'u00': 0.0, 'wavlen': 1.0,
                      'xlen': 12.566370614, 'ylen': 6.283185307, 'use_crank_nicolson': True},
        'output': {'iform': 0, 'iles': 0},
        'flow_control': {'flow_control_method': 2, 'target_pressure_gradient': 0.0166666,
                        'target_bulk_velocity': 1.0, 'controller_gain': 0.15,
                        'controller_update_freq': 7}
    }
    print("Manual configuration loaded")

# Display the configuration structure
for namelist, params in config.items():
    print(f"\n{namelist.upper()}:")
    for key, value in params.items():
        print(f"  {key}: {value}")

In [None]:
# Extract Grid Configuration
grid_params = config['grid']
nx, ny, nz = grid_params['nx_input'], grid_params['ny_input'], grid_params['nz_input']
total_points = nx * ny * nz

print("=== GRID CONFIGURATION ===")
print(f"Grid dimensions: {nx} × {ny} × {nz}")
print(f"Total grid points: {total_points:,}")
print(f"Memory estimate (double precision): {total_points * 8 * 12 / 1e9:.2f} GB")
print(f"  (assuming 12 main arrays: u,v,w + un,vn,wn + 6 workspace arrays)")

# Calculate grid characteristics for optimization analysis
print(f"\nGrid Analysis for Optimization:")
print(f"- X-direction (streamwise): {nx} points")
print(f"- Y-direction (spanwise): {ny} points") 
print(f"- Z-direction (wall-normal): {nz} points")
print(f"- Aspect ratios: X/Y = {nx/ny:.1f}, X/Z = {nx/nz:.1f}, Y/Z = {ny/nz:.1f}")

# Check for power-of-2 dimensions (important for FFT efficiency)
def is_power_of_2(n):
    return n > 0 and (n & (n - 1)) == 0

print(f"\nFFT Efficiency Check:")
print(f"- nx={nx} is power of 2: {is_power_of_2(nx)}")
print(f"- ny={ny} is power of 2: {is_power_of_2(ny)}")
if not is_power_of_2(nx) or not is_power_of_2(ny):
    print("⚠️  Non-power-of-2 dimensions may reduce FFT efficiency")

In [None]:
# Extract Time Control Parameters
time_params = config['time_control']
dt = time_params['dt']
nsteps = time_params['nsteps']
nwrt = time_params['nwrt']
istart = time_params['istart']

print("=== TIME CONTROL CONFIGURATION ===")
print(f"Time step (dt): {dt}")
print(f"Number of steps: {nsteps:,}")
print(f"Output frequency: every {nwrt} steps")
print(f"Starting step: {istart}")

# Calculate simulation characteristics
total_time = nsteps * dt
output_files = nsteps // nwrt
estimated_step_time = 0.15  # seconds (conservative estimate)
estimated_total_runtime = nsteps * estimated_step_time

print(f"\nSimulation Analysis:")
print(f"- Total simulation time: {total_time} time units")
print(f"- Number of output files: {output_files}")
print(f"- Estimated runtime (current): {estimated_total_runtime/3600:.1f} hours")
print(f"- Estimated step time: {estimated_step_time:.3f} seconds")

# Performance optimization potential
speedup_targets = [2, 4, 7]
print(f"\nOptimization Targets:")
for speedup in speedup_targets:
    new_runtime = estimated_total_runtime / speedup
    print(f"- {speedup}× speedup: {new_runtime/3600:.1f} hours ({new_runtime/60:.0f} minutes)")

# CFL and stability analysis
re = config['simulation']['re']
xlen = config['simulation']['xlen']
ylen = config['simulation']['ylen']
dx = xlen / nx
dy = ylen / ny
max_velocity_estimate = 1.5  # Based on Poiseuille profile

cfl_x = max_velocity_estimate * dt / dx
cfl_y = max_velocity_estimate * dt / dy

print(f"\nStability Analysis:")
print(f"- Grid spacing: dx={dx:.4f}, dy={dy:.4f}")
print(f"- CFL numbers: CFLx={cfl_x:.3f}, CFLy={cfl_y:.3f}")
print(f"- Reynolds number: {re}")
if cfl_x > 0.5 or cfl_y > 0.5:
    print("⚠️  CFL numbers may be too high for stability")
else:
    print("✓ CFL numbers appear reasonable")

## Performance Optimization Categories

The optimization strategy is divided into six main categories, each with different risk levels and expected performance gains:

### 1. **Compiler Optimizations** (Low Risk, High Impact)
- Enhanced compiler flags with architecture-specific optimizations
- Link-time optimization (LTO) and function inlining
- OpenMP support for parallel processing

### 2. **Memory Access Optimizations** (Medium Risk, High Impact)
- Loop order optimization for cache locality
- Memory layout restructuring
- Array padding to avoid cache conflicts

### 3. **FFT Optimizations** (Medium Risk, Very High Impact)
- FFTW wisdom files for optimal plans
- Reduced FFT call frequency
- Parallel FFT with threading

### 4. **Algorithmic Optimizations** (Medium Risk, Very High Impact)
- Derivative calculation optimization
- Source term algorithm improvements
- Workspace array reuse strategies

### 5. **Parallelization** (High Risk, Very High Impact)
- OpenMP threading for loops
- Parallel FFT operations
- Memory access optimization for parallel execution

### 6. **Numerical Method Optimizations** (High Risk, High Impact)
- Adaptive time stepping
- Pressure solver improvements
- Advanced boundary condition handling

In [None]:
# Implementation Phases and Expected Performance Gains

# Define optimization phases
phases = {
    'Phase 1: Low-Risk Optimizations': {
        'optimizations': [
            'Enhanced compiler flags',
            'Loop order optimization',
            'Memory alignment fixes', 
            'FFTW plan optimization'
        ],
        'expected_speedup': 1.2,
        'risk_level': 'Low',
        'timeframe': 'Immediate',
        'color': 'green'
    },
    'Phase 2: Medium-Risk Optimizations': {
        'optimizations': [
            'OpenMP parallelization',
            'FFT call reduction',
            'Derivative calculation optimization',
            'Source term algorithm comparison'
        ],
        'expected_speedup': 2.0,
        'risk_level': 'Medium', 
        'timeframe': '1-2 weeks',
        'color': 'orange'
    },
    'Phase 3: High-Risk Optimizations': {
        'optimizations': [
            'Advanced numerical methods',
            'Memory layout restructuring',
            'Pressure solver optimization',
            'Adaptive time stepping'
        ],
        'expected_speedup': 3.0,
        'risk_level': 'High',
        'timeframe': '2-4 weeks',
        'color': 'red'
    }
}

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Expected speedup by phase
phase_names = list(phases.keys())
speedups = [phases[phase]['expected_speedup'] for phase in phase_names]
colors = [phases[phase]['color'] for phase in phase_names]

bars1 = ax1.bar(range(len(phase_names)), speedups, color=colors, alpha=0.7)
ax1.set_xlabel('Optimization Phase')
ax1.set_ylabel('Expected Speedup Factor')
ax1.set_title('Expected Performance Gains by Phase')
ax1.set_xticks(range(len(phase_names)))
ax1.set_xticklabels([p.split(':')[0] for p in phase_names], rotation=45)
ax1.grid(True, alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars1):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.1f}×', ha='center', va='bottom', fontweight='bold')

# Plot 2: Cumulative performance improvement
cumulative_speedup = np.cumprod(speedups)
ax2.plot(range(len(phase_names)), cumulative_speedup, 'bo-', linewidth=2, markersize=8)
ax2.fill_between(range(len(phase_names)), 1, cumulative_speedup, alpha=0.3)
ax2.set_xlabel('Optimization Phase')
ax2.set_ylabel('Cumulative Speedup Factor')
ax2.set_title('Cumulative Performance Improvement')
ax2.set_xticks(range(len(phase_names)))
ax2.set_xticklabels([p.split(':')[0] for p in phase_names], rotation=45)
ax2.grid(True, alpha=0.3)

# Add value labels
for i, speedup in enumerate(cumulative_speedup):
    ax2.text(i, speedup + 0.1, f'{speedup:.1f}×', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("=== OPTIMIZATION PHASES SUMMARY ===")
for phase, details in phases.items():
    print(f"\n{phase}")
    print(f"  Expected speedup: {details['expected_speedup']}×")
    print(f"  Risk level: {details['risk_level']}")
    print(f"  Timeframe: {details['timeframe']}")
    print(f"  Key optimizations:")
    for opt in details['optimizations']:
        print(f"    • {opt}")

total_speedup = np.prod(speedups)
current_runtime = estimated_total_runtime / 3600
optimized_runtime = current_runtime / total_speedup

print(f"\n=== OVERALL IMPACT ===")
print(f"Total expected speedup: {total_speedup:.1f}×")
print(f"Current estimated runtime: {current_runtime:.1f} hours")
print(f"Optimized runtime estimate: {optimized_runtime:.1f} hours")
print(f"Time savings: {current_runtime - optimized_runtime:.1f} hours ({(1-1/total_speedup)*100:.0f}% reduction)")

## Detailed Optimization Strategies

### Phase 1: Compiler Optimizations (Low Risk)

#### Current Compiler Flags
```makefile
FFLAGS = -O3 -ffast-math -funroll-loops -g
```

#### Proposed Enhanced Flags
```makefile
FFLAGS = -O3 -ffast-math -funroll-loops -march=native -mtune=native \
         -flto -fomit-frame-pointer -finline-functions \
         -floop-interchange -floop-block -ftree-vectorize \
         -fopenmp
```

#### Benefits
- **march=native**: Optimizes for specific CPU architecture
- **flto**: Link-time optimization for better inlining
- **floop-***: Advanced loop optimizations
- **fopenmp**: Enables OpenMP parallel directives

#### Expected Impact
- 10-20% performance improvement
- No code changes required
- Minimal risk of introducing bugs

### Phase 1: Memory Access Optimizations

#### Current Loop Structure (Cache-Unfriendly)
```fortran
do k = 1, nz      ! z-direction (slowest varying)
    do j = 1, ny  ! y-direction  
        do i = 1, nx  ! x-direction (fastest varying)
```

#### Proposed Optimized Structure (Cache-Friendly)
```fortran
do i = 1, nx      ! x-direction (now outermost)
    do j = 1, ny  ! y-direction
        do k = 1, nz  ! z-direction (now innermost)
```

#### Array Padding Strategy
```fortran
! Instead of exact dimensions
integer, parameter :: nx_padded = ((nx + 7) / 8) * 8  ! Align to 8-element boundaries
real(wp), allocatable :: u(:,:,:)  ! (nx_padded, ny, nz)
```

#### Memory Layout Benefits
- Better cache line utilization
- Reduced memory bandwidth requirements  
- Improved vectorization opportunities
- 15-25% performance gain in memory-bound operations

### Phase 2: FFT Optimizations (High Impact)

#### Current FFT Usage Pattern
- Plans created at runtime for each operation
- Multiple separate FFT calls for different fields
- No threading in FFTW operations

#### Proposed FFTW Optimizations

**1. FFTW Wisdom and Plan Optimization**
```fortran
! Create plans once and reuse
call fftw_import_wisdom_from_filename('fftw_wisdom.dat')
! Use optimal plans for specific problem sizes
call fftw_plan_with_nthreads(omp_get_max_threads())
```

**2. Batch FFT Operations**
```fortran
! Instead of separate FFTs for u, v, w derivatives
! Batch them together for better efficiency
call fftw_execute_r2r_batch(plan_batch, input_array, output_array, batch_size)
```

**3. In-Place Transformations**
```fortran
! Reduce memory allocations by using in-place transforms
call fftw_plan_r2r_3d(nx, ny, nz, data, data, ...)  ! Same array for input/output
```

#### Expected FFT Performance Gains
- 40-60% reduction in FFT overhead
- Better memory utilization
- Improved parallel scaling

### Phase 2: OpenMP Parallelization Strategy

#### Target Areas for Parallelization

**1. Main Computational Loops**
```fortran
!$OMP PARALLEL DO PRIVATE(i,j,k) SCHEDULE(STATIC)
do k = 1, nz
    do j = 1, ny
        do i = 1, nx
            ! Velocity calculations
        end do
    end do
end do
!$OMP END PARALLEL DO
```

**2. Derivative Calculations**
```fortran
!$OMP PARALLEL SECTIONS
!$OMP SECTION
    call compute_derivatives_3d(u, dfdx=dudx, calc_dx=.true.)
!$OMP SECTION  
    call compute_derivatives_3d(v, dfdx=dvdx, calc_dx=.true.)
!$OMP SECTION
    call compute_derivatives_3d(w, dfdx=dwdx, calc_dx=.true.)
!$OMP END PARALLEL SECTIONS
```

**3. FFT Operations with Threading**
```fortran
! Enable FFTW threading
call fftw_plan_with_nthreads(omp_get_max_threads())
!$OMP PARALLEL DO
do k = 1, nz
    call fftw_execute_r2r(plan_2d, field(:,:,k))
end do
!$OMP END PARALLEL DO
```

#### Parallel Performance Considerations
- **Thread scaling**: Expect 2-4× speedup on modern CPUs
- **Memory bandwidth**: May become limiting factor
- **Load balancing**: Use static scheduling for regular grids
- **False sharing**: Minimize with proper data layout

In [None]:
# Risk Assessment and Measurement Strategy

# Define risk categories and mitigation strategies
risk_assessment = {
    'Low Risk': {
        'optimizations': ['Compiler flags', 'Loop reordering', 'FFTW plans', 'Memory alignment'],
        'mitigation': ['Regression testing', 'Verification runs', 'Bit-for-bit comparison'],
        'rollback_effort': 'Minimal - configuration changes only',
        'validation_time': '1-2 hours'
    },
    'Medium Risk': {
        'optimizations': ['OpenMP threading', 'FFT restructuring', 'Algorithm changes'],
        'mitigation': ['Extensive testing', 'Reference solutions', 'Gradual implementation'],
        'rollback_effort': 'Moderate - code changes required',
        'validation_time': '1-2 days'
    },
    'High Risk': {
        'optimizations': ['Numerical methods', 'Memory layout', 'Advanced algorithms'],
        'mitigation': ['Prototype testing', 'Academic validation', 'Benchmarking'],
        'rollback_effort': 'Significant - major code restructuring',
        'validation_time': '1-2 weeks'
    }
}

print("=== RISK ASSESSMENT MATRIX ===")
for risk_level, details in risk_assessment.items():
    print(f"\n{risk_level.upper()}:")
    print(f"  Optimizations: {', '.join(details['optimizations'])}")
    print(f"  Mitigation: {', '.join(details['mitigation'])}")
    print(f"  Rollback effort: {details['rollback_effort']}")
    print(f"  Validation time: {details['validation_time']}")

# Performance measurement strategy
measurement_strategy = {
    'Baseline Metrics': [
        'Total runtime for test case',
        'Time per step (average/min/max)',
        'Memory usage (peak and average)',
        'CPU utilization',
        'Cache hit rates',
        'FFT operation time'
    ],
    'Verification Metrics': [
        'Solution accuracy (L2 norm comparison)',
        'Mass conservation error',
        'Energy conservation error', 
        'Maximum divergence',
        'Bulk velocity accuracy (for Method 2)',
        'Pressure gradient convergence'
    ],
    'Performance Metrics': [
        'Speedup factor per phase',
        'Parallel efficiency',
        'Memory bandwidth utilization',
        'FLOPS (floating point operations per second)',
        'Time breakdown by major functions',
        'Scalability with grid size'
    ]
}

print(f"\n=== PERFORMANCE MEASUREMENT STRATEGY ===")
for category, metrics in measurement_strategy.items():
    print(f"\n{category}:")
    for metric in metrics:
        print(f"  • {metric}")

# Testing protocol
print(f"\n=== TESTING PROTOCOL ===")
print("1. Establish baseline performance with current code")
print("2. Implement Phase 1 optimizations with verification")
print("3. Measure and validate performance gains")
print("4. Proceed to Phase 2 only if Phase 1 successful")
print("5. Document all changes and performance impacts")
print("6. Maintain reference solutions for validation")
print("7. Create automated regression test suite")

In [None]:
# Configuration Summary and Next Steps

# Create comprehensive configuration summary
config_summary = {
    'Grid Configuration': {
        'Dimensions': f"{nx} × {ny} × {nz}",
        'Total Points': f"{total_points:,}",
        'Memory Estimate': f"{total_points * 8 * 12 / 1e9:.2f} GB",
        'FFT Efficiency': 'Good' if is_power_of_2(nx) and is_power_of_2(ny) else 'Suboptimal'
    },
    'Simulation Parameters': {
        'Reynolds Number': config['simulation']['re'],
        'Time Step': config['time_control']['dt'],
        'Total Steps': f"{config['time_control']['nsteps']:,}",
        'Estimated Runtime': f"{estimated_total_runtime/3600:.1f} hours",
        'Flow Control': 'Method 2 (PI Controller)' if config['flow_control']['flow_control_method'] == 2 else 'Method 1 (Constant Pressure)'
    },
    'Optimization Potential': {
        'Phase 1 Speedup': '1.2×',
        'Phase 2 Speedup': '2.0×', 
        'Phase 3 Speedup': '3.0×',
        'Total Potential': f"{np.prod([1.2, 2.0, 3.0]):.1f}×",
        'Optimized Runtime': f"{estimated_total_runtime/(1.2*2.0*3.0)/3600:.1f} hours"
    }
}

# Display configuration summary table
import pandas as pd

summary_data = []
for category, params in config_summary.items():
    for param, value in params.items():
        summary_data.append({
            'Category': category,
            'Parameter': param,
            'Value': value
        })

df_summary = pd.DataFrame(summary_data)
print("=== COMPREHENSIVE CONFIGURATION SUMMARY ===")
print(df_summary.to_string(index=False))

# Next steps and recommendations
print(f"\n=== IMMEDIATE NEXT STEPS ===")
print("1. 📊 Establish baseline performance measurement")
print("   - Run current code with timing instrumentation")
print("   - Measure memory usage and CPU utilization")
print("   - Document reference solution for validation")

print(f"\n2. 🔧 Implement Phase 1 optimizations (Low Risk)")
print("   - Update Makefile with enhanced compiler flags")
print("   - Optimize loop orders in critical subroutines")
print("   - Implement FFTW plan reuse")

print(f"\n3. ✅ Validate Phase 1 results")
print("   - Compare performance metrics")
print("   - Verify solution accuracy")
print("   - Document performance gains")

print(f"\n4. 📋 Prepare for Phase 2 (if Phase 1 successful)")
print("   - Design OpenMP parallelization strategy")
print("   - Plan FFT optimization implementation")
print("   - Set up parallel testing environment")

print(f"\n=== SUCCESS CRITERIA ===")
print("✓ Phase 1: 15-25% performance improvement with identical results")
print("✓ Phase 2: 2× total speedup with verified accuracy")  
print("✓ Phase 3: 3-7× total speedup with maintained stability")

print(f"\n=== PROJECT TIMELINE ===")
print("Week 1: Baseline measurement + Phase 1 implementation")
print("Week 2: Phase 1 validation + Phase 2 design")
print("Week 3-4: Phase 2 implementation and testing")
print("Week 5-8: Phase 3 implementation (if approved)")

print(f"\n⚠️  IMPORTANT: No changes should be made without explicit approval")
print("📝 This notebook serves as the planning document for all optimizations")