# Shader Workload Efficiency Case Study

This notebook explores GPU shader workload energy efficiency, with a focus on optimizing compute operations for Apple's GPU architecture. We'll analyze how different shader implementations affect power consumption and identify strategies for maximizing performance per watt.

## Key Concepts

1. **Shader Efficiency** - How effectively shader code uses GPU resources
2. **ALU vs. Memory Operations** - Balance between computation and memory accesses
3. **Instruction Mix** - Types of operations used in shader code
4. **Thread Divergence** - Impact of control flow on GPU execution efficiency
5. **Workgroup Optimization** - Tuning thread counts and organization

In [None]:
# Import required libraries
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directory to path
sys.path.insert(0, os.path.abspath('..'))

# Import project modules
from src.benchmarks.compute_benchmarks import MatrixMultiplication, ConvolutionBenchmark
from src.data_collection.collectors import SimulatedPowerCollector
from src.analysis.efficiency import calculate_energy_consumption, analyze_energy_efficiency
from src.analysis.visualization import plot_power_over_time
from src.analysis.optimization import (
    identify_hotspots, identify_inefficient_patterns,
    identify_dvfs_opportunities, generate_optimization_recommendations
)

## 1. Setting Up the Experiment

First, we'll create a series of shader workload simulations to test various implementation strategies and their impact on energy consumption.

In [None]:
# Create output directory for results
os.makedirs('./data/shader_study', exist_ok=True)

# Initialize data collector
power_collector = SimulatedPowerCollector(output_dir='./data/shader_study')

# Create a custom shader benchmark class to simulate different shader implementations
class ShaderImplementationBenchmark(MatrixMultiplication):
    def __init__(self, shader_type, shader_config=None):
        super().__init__()
        self.name = f"shader_{shader_type}"
        self.description = f"Shader implementation: {shader_type}"
        self.shader_type = shader_type
        self.shader_config = shader_config or {}
    
    def _execute(self, parameters):
        # Base functionality from MatrixMultiplication
        base_result = super()._execute(parameters)
        
        # Modify the result based on shader implementation characteristics
        modified_result = base_result.copy()
        
        # Apply implementation-specific adjustments
        if self.shader_type == 'naive':
            # Naive implementation: inefficient memory access, high instruction count
            modified_result['operations'] = base_result['operations'] * 1.5  # More operations (inefficient)
            modified_result['memory_used'] = base_result['memory_used'] * 1.2  # More memory traffic
            modified_result['execution_time'] = base_result.get('execution_time', 0) * 1.4  # Slower
            
        elif self.shader_type == 'optimized':
            # Optimized: efficient ALU usage, good memory patterns
            modified_result['operations'] = base_result['operations'] * 0.9  # Fewer redundant operations
            modified_result['memory_used'] = base_result['memory_used'] * 0.8  # Less memory traffic
            modified_result['execution_time'] = base_result.get('execution_time', 0) * 0.7  # Faster
            
        elif self.shader_type == 'divergent':
            # High thread divergence: conditionals cause inefficiency
            modified_result['operations'] = base_result['operations'] * 1.2  # More operations due to divergence
            modified_result['memory_used'] = base_result['memory_used'] * 1.1  # Slightly more memory traffic
            modified_result['execution_time'] = base_result.get('execution_time', 0) * 1.8  # Much slower
            
        elif self.shader_type == 'tiled':
            # Tiled: optimized for cache/tile memory in Apple GPUs
            modified_result['operations'] = base_result['operations'] * 0.95  # Slightly fewer operations
            modified_result['memory_used'] = base_result['memory_used'] * 0.6  # Much less memory traffic
            modified_result['execution_time'] = base_result.get('execution_time', 0) * 0.5  # Much faster
            
        return modified_result

# Initialize different shader implementation benchmarks
shader_implementations = {
    'naive': {
        'description': 'Naive Implementation',
        'benchmark': ShaderImplementationBenchmark('naive'),
        'color': 'firebrick'
    },
    'divergent': {
        'description': 'High Divergence',
        'benchmark': ShaderImplementationBenchmark('divergent'),
        'color': 'darkorange'
    },
    'optimized': {
        'description': 'Optimized',
        'benchmark': ShaderImplementationBenchmark('optimized'),
        'color': 'royalblue'
    },
    'tiled': {
        'description': 'Tiled/Cache-Optimized',
        'benchmark': ShaderImplementationBenchmark('tiled'),
        'color': 'forestgreen'
    }
}

## 2. Testing Different Shader Implementations

Now let's run the different shader implementations and measure their performance and power consumption.

In [None]:
# Run the shader implementation benchmarks
results = {}
power_data = {}

for impl_key, impl_info in shader_implementations.items():
    print(f"Testing {impl_info['description']} Shader Implementation...")
    
    # Run the benchmark
    benchmark = impl_info['benchmark']
    params = {'matrix_size': 1024}
    results[impl_key] = benchmark.run(params)
    
    # Configure a power profile for this implementation
    duration = 5.0  # seconds
    num_samples = int(duration / power_collector.sampling_interval)
    
    # Create appropriate activity pattern for each implementation
    if impl_key == 'naive':
        # Naive implementation has high, somewhat variable power consumption
        base_activity = 0.9  # High activity
        activity_pattern = np.random.normal(base_activity, 0.05, num_samples)
        activity_pattern = np.clip(activity_pattern, 0.7, 1.0)
        
    elif impl_key == 'divergent':
        # Divergent implementation has spiky power consumption
        base_activity = 0.85
        # Create a pattern with high variability
        x = np.linspace(0, 10, num_samples)
        activity_pattern = np.sin(x * 4) * 0.15 + base_activity
        activity_pattern = np.clip(activity_pattern, 0.6, 1.0)
        
    elif impl_key == 'optimized':
        # Optimized implementation has moderate, steady power
        base_activity = 0.7
        activity_pattern = np.random.normal(base_activity, 0.03, num_samples)
        activity_pattern = np.clip(activity_pattern, 0.6, 0.8)
        
    elif impl_key == 'tiled':
        # Tiled implementation has lower, very steady power
        base_activity = 0.6
        activity_pattern = np.random.normal(base_activity, 0.02, num_samples)
        activity_pattern = np.clip(activity_pattern, 0.55, 0.65)
    
    # Collect power data
    power_data[impl_key] = power_collector.collect_for_duration(duration, activity_pattern)

## 3. Analyzing Execution Time and Power Consumption

Let's analyze the execution time and power consumption of each shader implementation.

In [None]:
# Analyze execution time
execution_times = {impl_key: result.get('mean_execution_time', 0) 
                  for impl_key, result in results.items()}

# Calculate average power consumption
avg_power = {}
for impl_key, data in power_data.items():
    df = pd.DataFrame(data)
    avg_power[impl_key] = df['total_power'].mean()

# Create comparison plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot execution time
bars1 = ax1.bar(
    [shader_implementations[k]['description'] for k in execution_times.keys()],
    list(execution_times.values()),
    color=[shader_implementations[k]['color'] for k in execution_times.keys()]
)

# Add value labels
for bar in bars1:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
            f'{height:.3f} s', ha='center', va='bottom')

ax1.set_title('Execution Time by Shader Implementation')
ax1.set_ylabel('Execution Time (seconds)')
ax1.grid(True, linestyle='--', alpha=0.7, axis='y')

# Plot average power
bars2 = ax2.bar(
    [shader_implementations[k]['description'] for k in avg_power.keys()],
    list(avg_power.values()),
    color=[shader_implementations[k]['color'] for k in avg_power.keys()]
)

# Add value labels
for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{height:.2f} W', ha='center', va='bottom')

ax2.set_title('Average Power Consumption by Shader Implementation')
ax2.set_ylabel('Power (W)')
ax2.grid(True, linestyle='--', alpha=0.7, axis='y')

plt.tight_layout()
plt.savefig('./data/shader_study/execution_power_comparison.png', dpi=300)
plt.show()

## 4. Calculating Energy Consumption

Now let's calculate the total energy consumed by each shader implementation.

In [None]:
# Calculate energy consumption for each implementation
energy_consumption = {}

for impl_key, data in power_data.items():
    # Convert to DataFrame
    df = pd.DataFrame(data)
    
    # Calculate energy
    energy = calculate_energy_consumption(df)
    energy_consumption[impl_key] = energy
    
    print(f"{shader_implementations[impl_key]['description']} Energy Consumption: {energy:.2f} joules")

# Create bar chart of energy consumption
plt.figure(figsize=(12, 6))
bars = plt.bar(
    [shader_implementations[k]['description'] for k in energy_consumption.keys()], 
    list(energy_consumption.values()),
    color=[shader_implementations[k]['color'] for k in energy_consumption.keys()]
)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.3,
            f'{height:.2f} J', ha='center', va='bottom')

plt.title('Energy Consumption by Shader Implementation')
plt.xlabel('Implementation')
plt.ylabel('Energy Consumption (joules)')
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.savefig('./data/shader_study/energy_consumption.png', dpi=300)
plt.show()

## 5. Power Profiles Over Time

Let's examine the power consumption patterns over time for each implementation.

In [None]:
# Plot power consumption over time for all implementations
plt.figure(figsize=(14, 7))

for impl_key, data in power_data.items():
    df = pd.DataFrame(data)
    # Normalize time to start at 0
    time_values = df['timestamp'] - df['timestamp'].min()
    plt.plot(time_values, df['total_power'], 
             label=shader_implementations[impl_key]['description'],
             color=shader_implementations[impl_key]['color'],
             linewidth=2, alpha=0.8)

plt.title('Power Consumption Over Time by Shader Implementation')
plt.xlabel('Time (s)')
plt.ylabel('Power (W)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.savefig('./data/shader_study/power_over_time.png', dpi=300)
plt.show()

## 6. Evaluating Energy Efficiency

Let's calculate energy efficiency metrics for each implementation.

In [None]:
# Calculate operations per joule for each implementation
ops_per_joule = {}

for impl_key, result in results.items():
    operations = result.get('operations', 0)
    energy = energy_consumption[impl_key]
    
    if energy > 0:
        ops_per_joule[impl_key] = operations / energy
    else:
        ops_per_joule[impl_key] = 0
    
    print(f"{shader_implementations[impl_key]['description']} Operations per Joule: {ops_per_joule[impl_key]:.2e}")

# Create bar chart of operations per joule
plt.figure(figsize=(12, 6))
bars = plt.bar(
    [shader_implementations[k]['description'] for k in ops_per_joule.keys()], 
    list(ops_per_joule.values()),
    color=[shader_implementations[k]['color'] for k in ops_per_joule.keys()]
)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 1e8,
            f'{height:.2e}', ha='center', va='bottom')

plt.title('Energy Efficiency (Operations per Joule) by Shader Implementation')
plt.xlabel('Implementation')
plt.ylabel('Operations per Joule')
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.savefig('./data/shader_study/ops_per_joule.png', dpi=300)
plt.show()

## 7. Component Power Analysis

Let's examine which components (compute, memory, I/O) dominate power consumption in each implementation.

In [None]:
# Calculate component power breakdown for each implementation
component_data = []

for impl_key, data in power_data.items():
    df = pd.DataFrame(data)
    
    # Calculate average power for each component
    avg_compute = df['compute_power'].mean()
    avg_memory = df['memory_power'].mean()
    avg_io = df['io_power'].mean()
    avg_total = df['total_power'].mean()
    
    # Calculate percentages
    compute_pct = avg_compute / avg_total * 100
    memory_pct = avg_memory / avg_total * 100
    io_pct = avg_io / avg_total * 100
    
    component_data.append({
        'implementation': impl_key,
        'description': shader_implementations[impl_key]['description'],
        'avg_compute_power': avg_compute,
        'avg_memory_power': avg_memory,
        'avg_io_power': avg_io,
        'avg_total_power': avg_total,
        'compute_pct': compute_pct,
        'memory_pct': memory_pct,
        'io_pct': io_pct
    })

# Convert to DataFrame
component_df = pd.DataFrame(component_data)

# Create stacked bar chart of component power
plt.figure(figsize=(14, 6))

# Create data for stacked bars
implementations = component_df['description']
compute_power = component_df['avg_compute_power']
memory_power = component_df['avg_memory_power']
io_power = component_df['avg_io_power']

# Create stacked bars
plt.bar(implementations, compute_power, label='Compute', color='#5DA5DA')
plt.bar(implementations, memory_power, bottom=compute_power, label='Memory', color='#FAA43A')
plt.bar(implementations, io_power, bottom=compute_power+memory_power, label='I/O', color='#60BD68')

# Add percentage annotations
for i, impl in enumerate(implementations):
    # Compute percentage
    compute_pct = component_df.iloc[i]['compute_pct']
    plt.text(i, compute_power[i]/2, f'{compute_pct:.1f}%', ha='center', va='center', color='white', fontweight='bold')
    
    # Memory percentage
    memory_pct = component_df.iloc[i]['memory_pct']
    plt.text(i, compute_power[i] + memory_power[i]/2, f'{memory_pct:.1f}%', ha='center', va='center', color='white', fontweight='bold')
    
    # I/O percentage
    io_pct = component_df.iloc[i]['io_pct']
    if io_pct > 5:  # Only show percentage if it's large enough
        plt.text(i, compute_power[i] + memory_power[i] + io_power[i]/2, f'{io_pct:.1f}%', ha='center', va='center', color='white', fontweight='bold')

plt.title('Component Power Breakdown by Shader Implementation')
plt.xlabel('Implementation')
plt.ylabel('Power (W)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.savefig('./data/shader_study/component_power.png', dpi=300)
plt.show()

## 8. Optimization Analysis

Based on our findings, let's identify optimization opportunities in shader implementations.

In [None]:
# Create combined datasets for analysis
combined_power_data = []
combined_counter_data = []

for impl_key, data in power_data.items():
    # Add implementation identifier to power data
    for sample in data:
        sample_copy = sample.copy()
        sample_copy['implementation'] = impl_key
        combined_power_data.append(sample_copy)
    
    # Create simulated counter data based on implementation characteristics
    # In a real scenario, this would come from hardware performance counters
    if impl_key == 'naive':
        # Naive implementation has inefficient compute and memory patterns
        sm_activity = 95      # High SM utilization
        memory_utilization = 85   # High memory utilization
        cache_hit_rate = 40      # Poor cache utilization
        alu_util = 70            # Moderate ALU utilization
        thread_divergence = 20   # Some thread divergence
        instructions_per_cycle = 1.0  # Low IPC
    elif impl_key == 'divergent':
        # Divergent has high thread divergence, good memory access
        sm_activity = 90
        memory_utilization = 65
        cache_hit_rate = 70
        alu_util = 50            # Low ALU utilization due to divergence
        thread_divergence = 80   # High thread divergence
        instructions_per_cycle = 0.6  # Very low IPC
    elif impl_key == 'optimized':
        # Optimized has efficient compute and good memory patterns
        sm_activity = 85
        memory_utilization = 70
        cache_hit_rate = 80
        alu_util = 90            # High ALU utilization
        thread_divergence = 5    # Very little thread divergence
        instructions_per_cycle = 2.2  # Good IPC
    else:  # tiled
        # Tiled has excellent memory patterns, good compute
        sm_activity = 80
        memory_utilization = 50
        cache_hit_rate = 95      # Excellent cache utilization
        alu_util = 85            # Good ALU utilization
        thread_divergence = 5    # Very little thread divergence
        instructions_per_cycle = 2.8  # Excellent IPC
    
    # Add random variation to counters
    for i, sample in enumerate(data):
        # Create counter data with timestamp matching power data
        counter_sample = {
            'timestamp': sample['timestamp'],
            'implementation': impl_key,
            'sm_activity': sm_activity + np.random.normal(0, 3),
            'memory_utilization': memory_utilization + np.random.normal(0, 3),
            'cache_hit_rate': cache_hit_rate + np.random.normal(0, 2),
            'alu_utilization': alu_util + np.random.normal(0, 3),
            'thread_divergence': thread_divergence + np.random.normal(0, 2),
            'instructions_per_cycle': instructions_per_cycle + np.random.normal(0, 0.1),
            # Operations based on the benchmark results
            'operations': results[impl_key].get('operations', 0) / len(data)
        }
        combined_counter_data.append(counter_sample)

# Convert to DataFrames
power_df = pd.DataFrame(combined_power_data)
counter_df = pd.DataFrame(combined_counter_data)

# Let's focus on the inefficient implementations (naive and divergent)
naive_power = power_df[power_df['implementation'] == 'naive']
naive_counters = counter_df[counter_df['implementation'] == 'naive']

divergent_power = power_df[power_df['implementation'] == 'divergent']
divergent_counters = counter_df[counter_df['implementation'] == 'divergent']

# Identify hotspots and inefficient patterns
# For naive implementation
print("\nAnalyzing Naive Shader Implementation:")
naive_hotspots = identify_hotspots(naive_power, naive_counters)
naive_patterns = identify_inefficient_patterns(naive_counters, naive_power)
naive_dvfs = identify_dvfs_opportunities(naive_counters, naive_power)

if naive_hotspots.get('hotspots_found', False):
    print(f"  Found {naive_hotspots['count']} power hotspots")
    for i, period in enumerate(naive_hotspots.get('hotspot_periods', [])):
        print(f"  Hotspot {i+1}: Dominant component = {period['dominant_component']}")
else:
    print("  No significant power hotspots found.")

print("\n  Inefficient patterns:")
for pattern, details in naive_patterns.items():
    print(f"  - {pattern.replace('_', ' ').title()}: {details.get('recommendation', '')}")

# For divergent implementation
print("\nAnalyzing Divergent Shader Implementation:")
divergent_hotspots = identify_hotspots(divergent_power, divergent_counters)
divergent_patterns = identify_inefficient_patterns(divergent_counters, divergent_power)
divergent_dvfs = identify_dvfs_opportunities(divergent_counters, divergent_power)

if divergent_hotspots.get('hotspots_found', False):
    print(f"  Found {divergent_hotspots['count']} power hotspots")
    for i, period in enumerate(divergent_hotspots.get('hotspot_periods', [])):
        print(f"  Hotspot {i+1}: Dominant component = {period['dominant_component']}")
else:
    print("  No significant power hotspots found.")

print("\n  Inefficient patterns:")
for pattern, details in divergent_patterns.items():
    print(f"  - {pattern.replace('_', ' ').title()}: {details.get('recommendation', '')}")

## 9. Optimization Recommendations

Based on our analysis, let's generate specific recommendations for optimizing shader code for energy efficiency.

In [None]:
# Generate optimization recommendations for naive implementation
naive_recommendations = generate_optimization_recommendations(
    naive_hotspots, naive_patterns, naive_dvfs)

# Generate optimization recommendations for divergent implementation
divergent_recommendations = generate_optimization_recommendations(
    divergent_hotspots, divergent_patterns, divergent_dvfs)

# Display recommendations
print("Optimization Recommendations for Naive Shader Implementation:")
print("=======================================================")
for i, rec in enumerate(naive_recommendations):
    print(f"\n{i+1}. {rec['description']}")
    print(f"   Impact: {rec['estimated_impact']}")
    print(f"   Estimated energy savings: {rec.get('estimated_savings', 0)*100:.1f}%")

print("\n\nOptimization Recommendations for Divergent Shader Implementation:")
print("================================================================")
for i, rec in enumerate(divergent_recommendations):
    print(f"\n{i+1}. {rec['description']}")
    print(f"   Impact: {rec['estimated_impact']}")
    print(f"   Estimated energy savings: {rec.get('estimated_savings', 0)*100:.1f}%")

# Visualize the recommendations
from src.analysis.optimization import visualize_optimization_impact

# Create visualizations
plt.figure(figsize=(16, 12))

plt.subplot(2, 1, 1)
visualize_optimization_impact(naive_recommendations, figsize=(12, 5), max_recommendations=5)
plt.title('Optimization Impact for Naive Shader Implementation')

plt.subplot(2, 1, 2)
visualize_optimization_impact(divergent_recommendations, figsize=(12, 5), max_recommendations=5)
plt.title('Optimization Impact for Divergent Shader Implementation')

plt.tight_layout()
plt.savefig('./data/shader_study/optimization_impact.png', dpi=300)
plt.show()

## 10. Shader Optimization Techniques Analysis

Let's examine specific shader optimization techniques and their impact on energy efficiency.

In [None]:
# Analyze optimization techniques and their impact
techniques = [
    {
        'name': 'Shared Memory Usage',
        'description': 'Using shared/tile memory for data reuse',
        'energy_reduction': 40,  # %
        'complexity': 'Medium',
        'applicable_to': ['naive', 'divergent', 'optimized'],
        'apple_specific': True,
        'category': 'memory'
    },
    {
        'name': 'Divergence Reduction',
        'description': 'Minimizing control flow divergence between threads',
        'energy_reduction': 35,  # %
        'complexity': 'Hard',
        'applicable_to': ['divergent'],
        'apple_specific': False,
        'category': 'compute'
    },
    {
        'name': 'Tiled Execution',
        'description': 'Processing data in cache-friendly tiles',
        'energy_reduction': 30,  # %
        'complexity': 'Medium',
        'applicable_to': ['naive', 'divergent'],
        'apple_specific': True,
        'category': 'memory'
    },
    {
        'name': 'Workgroup Size Optimization',
        'description': 'Tuning workgroup dimensions for hardware',
        'energy_reduction': 15,  # %
        'complexity': 'Low',
        'applicable_to': ['naive', 'divergent', 'optimized'],
        'apple_specific': True,
        'category': 'compute'
    },
    {
        'name': 'Loop Unrolling',
        'description': 'Manually unrolling loops to reduce branch overhead',
        'energy_reduction': 12,  # %
        'complexity': 'Low',
        'applicable_to': ['naive', 'divergent'],
        'apple_specific': False,
        'category': 'compute'
    },
    {
        'name': 'Memory Coalescing',
        'description': 'Ensuring memory accesses are coalesced for efficiency',
        'energy_reduction': 25,  # %
        'complexity': 'Medium',
        'applicable_to': ['naive'],
        'apple_specific': False,
        'category': 'memory'
    },
    {
        'name': 'Math Optimization',
        'description': 'Using specialized/fused math operations (mul-add, etc.)',
        'energy_reduction': 8,  # %
        'complexity': 'Low',
        'applicable_to': ['naive', 'divergent', 'optimized'],
        'apple_specific': False,
        'category': 'compute'
    },
    {
        'name': 'Warp/Simd Utilization',
        'description': 'Ensuring full utilization of SIMD width',
        'energy_reduction': 20,  # %
        'complexity': 'Medium',
        'applicable_to': ['naive', 'divergent'],
        'apple_specific': False,
        'category': 'compute'
    },
    {
        'name': 'Register Pressure Reduction',
        'description': 'Minimizing register usage for better occupancy',
        'energy_reduction': 10,  # %
        'complexity': 'Hard',
        'applicable_to': ['naive', 'optimized'],
        'apple_specific': False,
        'category': 'compute'
    },
    {
        'name': 'Unified Memory Optimization',
        'description': 'Leveraging unified memory for zero-copy operations',
        'energy_reduction': 18,  # %
        'complexity': 'Low',
        'applicable_to': ['naive', 'divergent', 'optimized'],
        'apple_specific': True,
        'category': 'memory'
    }
]

# Create DataFrame for better analysis
techniques_df = pd.DataFrame(techniques)

# Analyze by category
category_summary = techniques_df.groupby('category')['energy_reduction'].agg(['mean', 'max', 'min', 'count'])
print("Optimization Techniques by Category:")
print(category_summary)

# Analyze Apple-specific optimizations
apple_summary = techniques_df.groupby('apple_specific')['energy_reduction'].agg(['mean', 'max', 'min', 'count'])
print("\nApple-Specific vs. General Optimizations:")
print(apple_summary)

# Sort techniques by energy reduction potential
techniques_df = techniques_df.sort_values('energy_reduction', ascending=False)

# Create visualization of optimization techniques
plt.figure(figsize=(14, 8))

# Create bar colors based on category
colors = techniques_df['category'].map({'memory': '#3498db', 'compute': '#e74c3c'})
# Add markers for Apple-specific techniques
markers = techniques_df['apple_specific'].map({True: '^', False: ''})

# Create bars
bars = plt.bar(techniques_df['name'], techniques_df['energy_reduction'], color=colors)

# Add markers for Apple-specific techniques
for i, (is_apple, bar) in enumerate(zip(techniques_df['apple_specific'], bars)):
    if is_apple:
        plt.plot(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, '^', color='black', markersize=10)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height - 3,
            f'{height}%', ha='center', va='bottom', color='white', fontweight='bold')

# Add complexity ratings
for i, (complexity, x) in enumerate(zip(techniques_df['complexity'], range(len(techniques_df)))):
    color = {'Low': 'green', 'Medium': 'orange', 'Hard': 'red'}[complexity]
    plt.annotate(complexity, xy=(x, -3), xytext=(0, -5), textcoords='offset points',
                ha='center', va='top', color=color, fontweight='bold')

# Create legend
from matplotlib.patches import Patch
from matplotlib.lines import Line2D

legend_elements = [
    Patch(facecolor='#3498db', label='Memory Optimization'),
    Patch(facecolor='#e74c3c', label='Compute Optimization'),
    Line2D([0], [0], marker='^', color='w', markerfacecolor='black', markersize=10, label='Apple-Specific')
]

complexity_elements = [
    Line2D([0], [0], color='green', lw=0, marker='s', markersize=10, label='Low Complexity'),
    Line2D([0], [0], color='orange', lw=0, marker='s', markersize=10, label='Medium Complexity'),
    Line2D([0], [0], color='red', lw=0, marker='s', markersize=10, label='High Complexity')
]

# Create two legends
legend1 = plt.legend(handles=legend_elements, loc='upper right')
plt.gca().add_artist(legend1)
plt.legend(handles=complexity_elements, loc='upper left')

plt.title('Energy Reduction Potential of Shader Optimization Techniques')
plt.xlabel('Optimization Technique')
plt.ylabel('Energy Reduction Potential (%)')
plt.xticks(rotation=45, ha='right')
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.savefig('./data/shader_study/optimization_techniques.png', dpi=300, bbox_inches='tight')
plt.show()

## 11. Shader Optimization Guidelines for Apple GPUs

Based on our analysis, here are concrete guidelines for optimizing shader code for energy efficiency on Apple GPUs:

### Shader Energy Efficiency Guidelines for Apple GPUs

1. **Optimize for Tile-Based Architecture**
   - Structure computations to maximize work within tiles
   - Ensure render passes are designed for TBDR efficiency
   - Keep tile memory usage within hardware limits to avoid spilling

2. **Minimize Thread Divergence**
   - Avoid conditional code that causes threads to take different paths
   - Move conditionals outside of compute-intensive loops when possible
   - Use predication instead of branches for simple conditionals
   - Consider sorting data to reduce divergence in data-dependent branches

3. **Optimize Memory Access Patterns**
   - Ensure memory accesses are coalesced (sequential for adjacent threads)
   - Use shared/tile memory for data that will be reused
   - Structure algorithms around the memory hierarchy, not the other way around
   - Consider storage formats that match access patterns (AoS vs SoA)

4. **Leverage Apple-Specific Hardware Features**
   - Use unified memory to avoid redundant copies
   - Optimize workgroup sizes for Apple GPU architecture (multiples of 32)
   - Take advantage of Apple's Metal Performance Shaders when applicable
   - Consider explicit memory management with MTLHeaps for large resources

5. **Maximize ALU Efficiency**
   - Use built-in functions and fused operations (fma, etc.) when available
   - Unroll small loops to reduce branch overhead
   - Balance arithmetic intensity with memory operations
   - Consider precision requirements carefully (half precision when appropriate)

6. **Reduce Register Pressure**
   - Limit the number of live variables in complex functions
   - Consider breaking complex shaders into multiple passes
   - Be aware of implicit temp registers from complex expressions
   - Profile register usage with Metal shader profiling tools

7. **Optimize for Power States**
   - Group similar operations to allow for efficient power gating
   - Consider batching work to avoid frequent power state transitions
   - Be aware of the energy cost of waking up idle hardware
   - Use Metal's low-power mode hints for appropriate workloads

## 12. Conclusion

Our case study has demonstrated the significant impact that shader implementation choices have on GPU energy consumption. Key findings include:

1. **Tiled/cache-optimized implementations** provide the highest energy efficiency, leveraging Apple's TBDR architecture effectively.

2. **Thread divergence** is a major source of energy inefficiency, causing a significant increase in both execution time and energy consumption.

3. **Memory access patterns** have a larger impact on energy consumption than pure computational optimizations.

4. **Apple-specific optimizations** (like leveraging tile memory and unified memory) provide substantial energy savings compared to general shader optimizations.

By applying the optimization guidelines detailed in this study, developers can significantly improve the energy efficiency of their shader code on Apple GPUs, leading to better performance per watt and longer battery life for Apple devices.