# Clustering Internals

Understanding the data structures behind time series clustering.

This notebook demonstrates:

- **`Clustering` dataclass**: The simple data structure storing clustering metadata
- **Key properties**: `n_clusters`, `timesteps_per_cluster`, `cluster_assignments`, `cluster_weights`
- **Visualization**: Using `clustering.plot.heatmap()` to understand cluster structure
- **Data expansion**: Using `expand_data()` to map aggregated data back to original timesteps

!!! note "Prerequisites"
    This notebook assumes familiarity with [08c-clustering](08c-clustering.ipynb).

In [None]:
from data.generate_example_systems import create_district_heating_system

import flixopt as fx

fx.CONFIG.notebook()

flow_system = create_district_heating_system()
flow_system.connect_and_transform()

## Create Clustering

After calling `cluster()`, metadata is stored in `fs.clustering`:

In [None]:
import tsam

fs_clustered = flow_system.transform.cluster(
    n_clusters=8,
    cluster_duration='1D',
    extremes=tsam.ExtremeConfig(max_value=['HeatDemand(Q_th)|fixed_relative_profile']),
)

fs_clustered.clustering

## The `Clustering` Dataclass

The `Clustering` class is a simple Python dataclass that stores all clustering metadata:

| Field | Type | Description |
|-------|------|-------------|
| `cluster_assignments` | `xr.DataArray` | Maps original segments → cluster ID |
| `cluster_weights` | `xr.DataArray` | Count of original segments per cluster |
| `original_timesteps` | `pd.DatetimeIndex` | Original full-resolution time index |
| `predefined` | `PredefinedConfig` | Config to transfer clustering to another system |
| `metrics` | `xr.Dataset` | RMSE, MAE per time series |

All other properties (like `n_clusters`, `timesteps_per_cluster`) are **derived from the data**.

In [None]:
# Core data: cluster assignments
# Shape: [original_cluster] - maps each original segment (e.g., day) to a cluster ID
clustering = fs_clustered.clustering
print('cluster_assignments:')
print(clustering.cluster_assignments)
print(f'\nExample: Day 0 → Cluster {int(clustering.cluster_assignments[0].item())}')

In [None]:
# Core data: cluster weights (occurrence counts)
# Shape: [cluster] - how many original segments each cluster represents
print('cluster_weights:')
print(clustering.cluster_weights)
print(
    f'\nSum of weights: {int(clustering.cluster_weights.sum().item())} (= {clustering.n_original_clusters} original days)'
)

## Derived Properties

Properties are computed from the data, not stored separately:

In [None]:
print(f'n_clusters:            {clustering.n_clusters}')
print(f'n_original_clusters:   {clustering.n_original_clusters}')
print(f'timesteps_per_cluster: {clustering.timesteps_per_cluster}')
print(f'original_timesteps:    {len(clustering.original_timesteps)} timesteps')

## Clustering Quality Metrics

The `metrics` property contains RMSE and MAE per time series:

In [None]:
# View metrics as a DataFrame
clustering.metrics.to_dataframe().style.format('{:.4f}')

## Visualizing Cluster Structure

The `.plot.heatmap()` method visualizes which days belong to which cluster:

In [None]:
# Heatmap shows cluster assignments for each original period
clustering.plot.heatmap()

## Expanding Aggregated Data

The `Clustering.expand_data()` method maps clustered (aggregated) data back to original timesteps.
This is useful for:
- Comparing clustering results before optimization
- Expanding solution variables after optimization (via `transform.expand()`)

The method uses the `cluster_assignments` to repeat each cluster's data for all original segments it represents.

In [None]:
import numpy as np
import xarray as xr

# Create some example clustered data (n_clusters × timesteps_per_cluster)
n_clusters = clustering.n_clusters
timesteps_per_cluster = clustering.timesteps_per_cluster

# Example: random values for each cluster
clustered_data = xr.DataArray(
    np.random.rand(n_clusters, timesteps_per_cluster),
    dims=['cluster', 'time'],
    coords={
        'cluster': range(n_clusters),
        'time': fs_clustered.timesteps,
    },
)

print(f'Clustered data shape: {clustered_data.shape}')
print(f'  - {n_clusters} clusters')
print(f'  - {timesteps_per_cluster} timesteps per cluster')

# Expand back to original resolution
expanded = clustering.expand_data(clustered_data)

print(f'\nExpanded data shape: {expanded.shape}')
print(f'  - {len(expanded.time)} total timesteps (original resolution)')

## Transferring Clustering

The `predefined` property contains a `PredefinedConfig` that can transfer the clustering to another FlowSystem:

In [None]:
# View the predefined config
print(f'predefined type: {type(clustering.predefined).__name__}')
print(clustering.predefined)

## Cluster Weights

Each representative timestep has a weight equal to the number of original periods it represents.
This ensures operational costs scale correctly:

$$\text{Objective} = \sum_{t \in \text{typical}} w_t \cdot c_t$$

The weights sum to the original segment count:

In [None]:
print(f'Sum of cluster_weights: {int(clustering.cluster_weights.sum().item())}')
print(f'Original segments:      {clustering.n_original_clusters}')
print('\nFlowSystem cluster_weight (per-timestep):')
print(f'  Shape: {fs_clustered.cluster_weight.shape}')
print(f'  Sum:   {fs_clustered.cluster_weight.sum().item():.0f}')

## Summary

### The `Clustering` Dataclass

| Field | Type | Description |
|-------|------|-------------|
| `cluster_assignments` | `xr.DataArray` | Maps original segments → cluster ID |
| `cluster_weights` | `xr.DataArray` | Count of original segments per cluster |
| `original_timesteps` | `pd.DatetimeIndex` | Original full-resolution time index |
| `predefined` | `PredefinedConfig` | Config to transfer clustering to another system |
| `metrics` | `xr.Dataset` | RMSE, MAE per time series |

### Derived Properties

| Property | Type | Description |
|----------|------|-------------|
| `n_clusters` | `int` | Number of clusters |
| `n_original_clusters` | `int` | Number of original segments |
| `timesteps_per_cluster` | `int` | Timesteps per cluster |
| `cluster_order` | `xr.DataArray` | Alias for `cluster_assignments` |

### Key Methods

| Method | Description |
|--------|-------------|
| `expand_data(data)` | Expand clustered data to original timesteps |
| `get_timestep_mapping()` | Get array mapping original → clustered timesteps |
| `to_reference()` | Serialize to dict + DataArrays for IO |
| `from_reference()` | Reconstruct from serialized form |
| `plot.heatmap()` | Visualize cluster assignments |

### IO Serialization

The `Clustering` class supports full roundtrip serialization:

```python
# Serialize
ref, arrays = clustering.to_reference()

# Deserialize
clustering = Clustering.from_reference(ref, arrays)
```

This is used internally by `FlowSystem.to_netcdf()` and `FlowSystem.from_netcdf()`.