# Clustering Internals: Weights, TSAM, and Cost Scaling

A deep dive into how time series clustering works under the hood.

This notebook covers:

- **Cluster weights**: How operational costs are scaled to represent the full time horizon
- **TSAM integration**: How the Time Series Aggregation Module performs clustering
- **Typical periods**: Visualizing representative vs original time series
- **Storage handling**: Inter-period linking and cyclic constraints
- **The `_aggregation_info` structure**: Internal data for expansion and analysis

!!! note "Prerequisites"
    This notebook assumes familiarity with [08c-clustering](08c-clustering.ipynb).

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import flixopt as fx

fx.CONFIG.notebook()

In [None]:
# Load the district heating system
data_file = Path('data/district_heating_system.nc4')
if not data_file.exists():
    from data.generate_example_systems import create_district_heating_system

    fs = create_district_heating_system()
    fs.to_netcdf(data_file)

flow_system = fx.FlowSystem.from_netcdf(data_file)
print(f'Loaded: {len(flow_system.timesteps)} timesteps ({len(flow_system.timesteps) / 96:.0f} days)')

In [None]:
# Create a clustered system for analysis
fs_clustered = flow_system.transform.aggregate(
    method='tsam',
    n_representatives=8,
    cluster_duration='1D',
    time_series_for_high_peaks=['HeatDemand(Q_th)|fixed_relative_profile'],
)

print(f'Clustered: {len(fs_clustered.timesteps)} timesteps')

## 1. The `_aggregation_info` Structure

After clustering, the FlowSystem stores metadata in `_aggregation_info` that enables:
- Expanding solutions back to full resolution
- Understanding which original days map to which clusters
- Weighting costs correctly in the objective function

In [None]:
info = fs_clustered._aggregation_info

print('AggregationInfo structure:')
print(f'  backend_name: {info.backend_name}')
print(f'  storage_inter_cluster_linking: {info.storage_inter_cluster_linking}')
print(f'  storage_cyclic: {info.storage_cyclic}')

cs = info.result.cluster_structure
print('\nClusterStructure:')
print(f'  n_clusters: {cs.n_clusters}')
print(f'  timesteps_per_cluster: {cs.timesteps_per_cluster}')
print(f'  cluster_order shape: {cs.cluster_order.shape}')
print(f'  cluster_occurrences: {dict(cs.cluster_occurrences)}')

### Cluster Order: Mapping Days to Clusters

The `cluster_order` array shows which cluster each original day belongs to:

In [None]:
info = fs_clustered._aggregation_info
cs = info.result.cluster_structure
cluster_order = cs.cluster_order.values
n_original_days = len(cluster_order)

# Create a DataFrame for visualization
days_df = pd.DataFrame(
    {
        'Day': range(1, n_original_days + 1),
        'Cluster': cluster_order,
        'Date': pd.date_range('2020-01-01', periods=n_original_days, freq='D'),
    }
)
days_df['Weekday'] = days_df['Date'].dt.day_name()

print(f'Original days: {n_original_days}')
print(f'Number of clusters: {cs.n_clusters}')
print('\nFirst 14 days:')
print(days_df.head(14).to_string(index=False))

In [None]:
# Visualize cluster assignment as a heatmap
fig = px.bar(
    days_df,
    x='Day',
    y=[1] * len(days_df),
    color='Cluster',
    color_continuous_scale='Viridis',
    title='Cluster Assignment by Day',
    labels={'y': ''},
)
fig.update_layout(height=250, yaxis_visible=False, coloraxis_colorbar_title='Cluster')
fig.show()

## 2. Cluster Weights: Scaling Operational Costs

When we optimize over 8 typical days instead of 31, the operational costs for each typical day
must be **scaled** to represent all the days it represents.

### The `cluster_weight` Property

The clustered FlowSystem has a `cluster_weight` that stores the weight for each timestep:

In [None]:
# The cluster_weight is stored on the FlowSystem
print('cluster_weight structure:')
print(fs_clustered.cluster_weight)
print(f'\nShape: {fs_clustered.cluster_weight.shape}')
print(f'Sum of weights: {fs_clustered.cluster_weight.sum().item():.0f}')
print(f'Expected (original timesteps): {len(flow_system.timesteps)}')

In [None]:
# Cluster occurrences (how many original days each cluster represents)
info = fs_clustered._aggregation_info
cs = info.result.cluster_structure
cluster_occurrences = dict(cs.cluster_occurrences)

print('Cluster occurrences (days represented by each typical day):')
for cluster_id, count in sorted(cluster_occurrences.items()):
    print(f'  Cluster {cluster_id}: {count} days (weight = {count})')

print(f'\nTotal: {sum(cluster_occurrences.values())} days')

In [None]:
# Visualize weights across the reduced timesteps
info = fs_clustered._aggregation_info
cs = info.result.cluster_structure
weights = fs_clustered.cluster_weight.values
timesteps_per_day = cs.timesteps_per_cluster

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=list(range(len(weights))),
        y=weights,
        mode='lines',
        name='Cluster Weight',
        line=dict(width=1),
    )
)

# Add vertical lines at day boundaries
for i in range(1, cs.n_clusters):
    fig.add_vline(x=i * timesteps_per_day, line_dash='dash', line_color='gray', opacity=0.5)

fig.update_layout(
    height=300,
    title='Cluster Weight per Timestep (Each Typical Day Has Uniform Weight)',
    xaxis_title='Timestep Index',
    yaxis_title='Weight',
)
fig.show()

### How Weights Affect the Objective Function

The objective function multiplies operational costs by the cluster weight:

$$\text{Objective} = \sum_{t \in \text{typical}} w_t \cdot c_t$$

Where:
- $w_t$ = cluster weight for timestep $t$ (= number of original days this cluster represents)
- $c_t$ = operational cost at timestep $t$

This ensures that a typical day representing 7 similar days contributes 7× more to the objective
than a typical day representing only 1 day (e.g., a peak day).

In [None]:
# Demonstrate how weights are applied (conceptually)
solver = fx.solvers.HighsSolver(mip_gap=0.01, log_to_console=False)
fs_clustered.optimize(solver)

# The 'costs' solution is already weighted
total_cost = fs_clustered.solution['costs'].item()

# We can also access the per-timestep costs
costs_per_timestep = fs_clustered.solution['costs(temporal)|per_timestep']

print(f'Total cost (weighted): {total_cost:,.0f} €')
print(f'\nCosts per timestep shape: {costs_per_timestep.shape}')
print(f'Sum of weighted costs: {(costs_per_timestep * fs_clustered.cluster_weight).sum().item():,.0f} €')

## 3. TSAM Integration: The Clustering Algorithm

flixopt uses the [TSAM](https://github.com/FZJ-IEK3-VSA/tsam) (Time Series Aggregation Module) 
package for clustering. TSAM uses k-means clustering to group similar time periods.

### The Clustering Object

In [None]:
# Access the TSAM clustering object
clustering = info['clustering']

print(f'Clustering type: {type(clustering).__name__}')
print(f'\nTSAM aggregation object: {type(clustering.tsam).__name__}')

In [None]:
# The TSAM object contains the clustering results
tsam = clustering.tsam

print('TSAM typical periods (centroids):')
print(tsam.typicalPeriods.head(10))

In [None]:
# Cluster centers vs original data
print('\nOriginal time series used for clustering:')
print(f'Shape: {tsam.normalizedPeriodlyProfiles.shape}')
print(f'Columns: {list(tsam.normalizedPeriodlyProfiles.columns)}')

### Visualizing Typical Periods vs Original Data

In [None]:
# Get heat demand from original and clustered systems
original_demand = flow_system.components['HeatDemand'].inputs[0].fixed_relative_profile.values
clustered_demand = fs_clustered.components['HeatDemand'].inputs[0].fixed_relative_profile.values

# Reshape original demand into days
timesteps_per_day = 96  # 15-minute resolution
n_days = len(original_demand) // timesteps_per_day
original_by_day = original_demand[: n_days * timesteps_per_day].reshape(n_days, timesteps_per_day)

# Create subplots
fig = make_subplots(
    rows=2,
    cols=1,
    subplot_titles=['Original: All 31 Days', f'Clustered: {info["n_clusters"]} Typical Days'],
    vertical_spacing=0.15,
)

# Plot all original days (faded)
hours = np.arange(timesteps_per_day) / 4  # Convert to hours
for day in range(n_days):
    fig.add_trace(
        go.Scatter(
            x=hours,
            y=original_by_day[day],
            mode='lines',
            line=dict(width=0.5, color='lightblue'),
            showlegend=False,
            hoverinfo='skip',
        ),
        row=1,
        col=1,
    )

# Plot typical days (bold colors)
colors = px.colors.qualitative.Set1
n_clusters = info['n_clusters']
clustered_by_day = clustered_demand.reshape(n_clusters, timesteps_per_day)

for cluster_id in range(n_clusters):
    weight = cluster_occurrences.get(cluster_id, cluster_occurrences.get(np.int32(cluster_id), 1))
    fig.add_trace(
        go.Scatter(
            x=hours,
            y=clustered_by_day[cluster_id],
            mode='lines',
            name=f'Cluster {cluster_id} (×{weight})',
            line=dict(width=2, color=colors[cluster_id % len(colors)]),
        ),
        row=2,
        col=1,
    )

fig.update_layout(height=600, title='Heat Demand: Original vs Typical Days')
fig.update_xaxes(title_text='Hour of Day', row=2, col=1)
fig.update_yaxes(title_text='MW', row=1, col=1)
fig.update_yaxes(title_text='MW', row=2, col=1)
fig.show()

## 4. Storage Handling in Clustering

Storage behavior across typical periods requires special handling:

### Cyclic Constraint (`storage_cyclic=True`)

When enabled (default), the storage state at the end of each typical period must equal 
the state at the beginning. This prevents the optimizer from "cheating" by starting 
with a full storage and ending empty.

### Inter-Period Linking

The `storage_inter_period_linking` option controls whether storage states are linked 
across typical periods to simulate long-term storage behavior.

In [None]:
print('Storage settings:')
print(f'  storage_cyclic: {info["storage_cyclic"]}')
print(f'  storage_inter_period_linking: {info["storage_inter_period_linking"]}')

# Show storage charge state in clustered solution
charge_state = fs_clustered.solution['Storage|charge_state']
print(f'\nCharge state shape: {charge_state.shape}')
print(f'Initial charge: {charge_state.values[0]:.1f} MWh')
print(f'Final charge: {charge_state.values[-1]:.1f} MWh')

In [None]:
# Visualize storage behavior across typical periods
fig = go.Figure()

timesteps_per_day = info['timesteps_per_cluster']
charge_values = charge_state.values

# Plot each typical day's storage trajectory
colors = px.colors.qualitative.Set1
for cluster_id in range(info['n_clusters']):
    start_idx = cluster_id * timesteps_per_day
    end_idx = start_idx + timesteps_per_day + 1  # Include endpoint

    if end_idx <= len(charge_values):
        hours = np.arange(timesteps_per_day + 1) / 4
        weight = cluster_occurrences.get(cluster_id, cluster_occurrences.get(np.int32(cluster_id), 1))

        fig.add_trace(
            go.Scatter(
                x=hours,
                y=charge_values[start_idx:end_idx],
                mode='lines',
                name=f'Cluster {cluster_id} (×{weight})',
                line=dict(width=2, color=colors[cluster_id % len(colors)]),
            )
        )

fig.update_layout(
    height=400,
    title='Storage Charge State by Typical Period (Cyclic: Start = End)',
    xaxis_title='Hour of Day',
    yaxis_title='Charge State [MWh]',
)
fig.show()

## 5. The `weights` Property: Unified Access

The FlowSystem provides a unified `weights` property that combines all weighting factors
(aggregation weights, scenario weights, period weights) into a single xarray structure:

In [None]:
# The weights property provides unified access
weights = fs_clustered.weights

print('FlowSystem weights structure:')
print(f'  Type: {type(weights).__name__}')
print(f'  temporal: {weights.temporal}')
print(f'  aggregation_weight: {weights.aggregation_weight}')

In [None]:
# Compare weights for original vs clustered systems
print('Original system weights:')
print(f'  temporal: {flow_system.weights.temporal}')
print(f'  aggregation_weight: {flow_system.weights.aggregation_weight}')

print('\nClustered system weights:')
print(f'  temporal: {fs_clustered.weights.temporal}')
print(f'  aggregation_weight (cluster_weight): sum = {fs_clustered.weights.aggregation_weight.sum().item():.0f}')

## 6. Time Series Weights in Clustering

You can influence which time series are prioritized during clustering using the `weights` parameter.
By default, all time series are weighted equally, but you may want to:

- Give higher weight to demand profiles (more important to capture accurately)
- Give lower weight to price signals (less critical for sizing)

### Automatic Weight Calculation

flixopt automatically calculates weights based on `clustering_group` attributes to avoid
double-counting correlated time series:

In [None]:
# Show the time series used for clustering and their weights
if hasattr(clustering, 'tsam') and hasattr(clustering.tsam, 'normalizedPeriodlyProfiles'):
    ts_names = list(clustering.tsam.normalizedPeriodlyProfiles.columns)
    print('Time series used for clustering:')
    for name in ts_names:
        print(f'  - {name}')

## 7. Peak Forcing: Ensuring Extreme Periods

The `time_series_for_high_peaks` parameter forces inclusion of periods containing peak values.
This is critical for proper component sizing.

In [None]:
# Find which cluster contains the peak demand day
original_demand = flow_system.components['HeatDemand'].inputs[0].fixed_relative_profile.values
daily_max = original_demand.reshape(-1, 96).max(axis=1)

peak_day = np.argmax(daily_max)
peak_cluster = cluster_order[peak_day]
peak_value = daily_max[peak_day]

# Get weight for the peak cluster
peak_weight = cluster_occurrences.get(peak_cluster, cluster_occurrences.get(np.int32(peak_cluster), 1))

print(f'Peak demand day: Day {peak_day + 1} (0-indexed: {peak_day})')
print(f'Peak value: {peak_value:.1f} MW')
print(f'Assigned to cluster: {peak_cluster}')
print(f'Cluster {peak_cluster} represents {peak_weight} day(s)')

# The peak day should be in a cluster with weight 1 (unique)
if peak_weight == 1:
    print('\\n✓ Peak day is isolated in its own cluster (weight=1) - good!')
else:
    print(f'\\n⚠ Peak day shares cluster with {peak_weight - 1} other day(s)')

## Summary

You learned about the internal mechanics of clustering:

1. **`_cluster_info`**: Contains all metadata for expansion and analysis
2. **Cluster weights**: Scale operational costs so each typical period represents its original days
3. **TSAM integration**: k-means clustering groups similar time periods
4. **Storage handling**: Cyclic constraints ensure realistic storage behavior
5. **Peak forcing**: Guarantees extreme periods are captured for proper sizing

### Key Formulas

**Weighted objective:**
$$\text{Objective} = \sum_{t \in \text{typical}} w_t \cdot c_t$$

**Weight conservation:**
$$\sum_{t \in \text{typical}} w_t = |\text{original timesteps}|$$

### When to Customize

| Scenario | Solution |
|----------|----------|
| Peak days not captured | Add `time_series_for_high_peaks` |
| Minimum periods important | Add `time_series_for_low_peaks` |
| Specific profiles more important | Use custom `weights` dict |
| Storage behaves unrealistically | Check `storage_cyclic` setting |