# Comparing Clustering Configurations

This notebook compares different clustering configurations to find the optimal trade-off
between accuracy and computational speed.

We compare:

- **Number of clusters**: How many typical periods are needed?
- **Inner-period segmentation**: Can we reduce timesteps within each cluster?

!!! note "Requirements"
    This notebook requires the `tsam` package: `pip install tsam`

In [None]:
import timeit

import pandas as pd

import flixopt as fx

fx.CONFIG.notebook()

## Setup

District heating system with a full year of hourly data (8760 timesteps):

In [None]:
from data.generate_example_systems import create_district_heating_system

flow_system = create_district_heating_system(duration='quarter')
flow_system.connect_and_transform()

solver = fx.solvers.HighsSolver(mip_gap=0.01)
peak_series = ['HeatDemand(Q_th)|fixed_relative_profile']

flow_system

## Run Optimizations

Compare full resolution, different cluster counts, and segmentation:

In [None]:
results = {}

# Full resolution baseline
start = timeit.default_timer()
fs_full = flow_system.copy()
fs_full.name = 'Full'
fs_full.optimize(solver)
results['Full'] = {'fs': fs_full, 'time': timeit.default_timer() - start, 'timesteps': len(flow_system.timesteps)}

# Different cluster counts
for n_clusters in [4, 8, 12]:
    start = timeit.default_timer()
    fs = flow_system.transform.cluster(
        n_clusters=n_clusters,
        cluster_duration='1D',
        time_series_for_high_peaks=peak_series,
    )
    fs.name = f'{n_clusters} clusters'
    fs.optimize(solver)
    results[f'{n_clusters} clusters'] = {'fs': fs, 'time': timeit.default_timer() - start, 'timesteps': n_clusters * 24}

# Segmentation (8 clusters with 6 segments each)
start = timeit.default_timer()
fs_seg = flow_system.transform.cluster(
    n_clusters=16,
    cluster_duration='1D',
    n_segments=6,
    time_series_for_high_peaks=peak_series,
)
fs_seg.name = '16x6 segmented'
fs_seg.optimize(solver)
results['16x6 segmented'] = {'fs': fs_seg, 'time': timeit.default_timer() - start, 'timesteps': 8 * 6}

## Summary Table

In [None]:
baseline_cost = results['Full']['fs'].solution['costs'].item()
baseline_time = results['Full']['time']

summary = pd.DataFrame(
    {
        name: {
            'Timesteps': r['timesteps'],
            'Time [s]': r['time'],
            'Cost [EUR]': r['fs'].solution['costs'].item(),
            'Cost Gap [%]': (r['fs'].solution['costs'].item() - baseline_cost) / max(abs(baseline_cost), 1) * 100,
            'CHP [kW]': r['fs'].statistics.sizes['CHP(Q_th)'].item(),
            'Storage [kWh]': r['fs'].statistics.sizes['Storage'].item(),
            'Speedup': baseline_time / r['time'],
        }
        for name, r in results.items()
    }
).T

summary.style.format(
    {
        'Timesteps': '{:.0f}',
        'Time [s]': '{:.2f}',
        'Cost [EUR]': '{:.0f}',
        'Cost Gap [%]': '{:+.1f}',
        'CHP [kW]': '{:.1f}',
        'Storage [kWh]': '{:.0f}',
        'Speedup': '{:.1f}x',
    }
)

## Expand Solutions to Full Resolution

Before comparing time series, expand all clustered solutions back to the original timesteps:

In [None]:
# Expand all clustered/segmented solutions
expanded = {
    'Full': results['Full']['fs'],
    '4 clusters': results['4 clusters']['fs'].transform.expand(),
    '8 clusters': results['8 clusters']['fs'].transform.expand(),
    '12 clusters': results['12 clusters']['fs'].transform.expand(),
    '16x6 segmented': results['16x6 segmented']['fs'].transform.expand(),
}

# Rename for clarity
for name, fs in expanded.items():
    fs.name = name

## Compare Component Sizes

In [None]:
comparison = fx.Comparison(list(expanded.values()))
comparison.statistics.sizes

In [None]:
comparison.statistics.plot.sizes(color='case')

## Compare Heat Production

Visualize CHP and Boiler flow rates across all configurations:

In [None]:
comparison.solution['CHP(Q_th)|flow_rate'].fxplot.heatmap(title='Heat Production by Configuration')

In [None]:
comparison.inputs['HeatDemand(Q_th)|fixed_relative_profile'].fxplot.line(
    title='Heat Demand by Configuration', colors='viridis'
)

## Compare Storage Operation

In [None]:
comparison.solution['Storage|charge_state'].fxplot.line(color='case', title='Storage State of Charge')

In [None]:
comparison.statistics.plot.storage('Storage').data.sum('time').to_pandas()

## Clustering Quality Metrics

RMSE and MAE show how well clustering preserves time series patterns:

In [None]:
# Collect metrics from clustered systems
metrics_list = []
for name in ['4 clusters', '8 clusters', '12 clusters']:
    fs = results[name]['fs']
    df = fs.clustering.metrics.to_dataframe()
    df['Config'] = name
    metrics_list.append(df)

metrics_df = pd.concat(metrics_list)
metrics_df.index.name = 'Time Series'
metrics_df = metrics_df.reset_index()

# Pivot for display
metrics_df.pivot(index='Time Series', columns='Config', values='RMSE').style.format('{:.4f}').background_gradient(
    cmap='RdYlGn_r', axis=1
)

## Visualize Clustering Structure

In [None]:
results['8 clusters']['fs'].clustering.plot.compare(kind='duration_curve')

In [None]:
results['8 clusters']['fs'].clustering.plot.heatmap()

## Segmentation: Variable Segment Durations

Segmentation creates variable-length segments that adapt to time series patterns:

In [None]:
fs_seg = results['16x6 segmented']['fs']

# Show segment durations (hours per segment per cluster)
fs_seg.timestep_duration.to_pandas().style.format('{:.0f}').background_gradient(cmap='Blues', axis=None)

In [None]:
# Visualize segment durations
fs_seg.timestep_duration.fxplot.bar(facet_col='cluster', facet_col_wrap=4, title='Segment Durations per Cluster')

## Recommendations

Based on this comparison:

1. **8 clusters** provides good accuracy (~7% cost gap) with 5x speedup
2. **Segmentation** provides additional reduction with acceptable accuracy loss
3. **4 clusters** may miss demand patterns, leading to undersized or oversized components

### When to use segmentation:

- Large problems where even clustered optimization is slow
- Preliminary design studies where speed matters more than precision
- Sensitivity analyses requiring many optimization runs

### Best practice:

- Always use `time_series_for_high_peaks` to capture extreme demand days
- Use `expand_solution()` to validate results at full resolution