# Risk-Aware Compression of Electricity Prices with $\delta$-Subsets

We will run a little experiment to display the meaning of the $\delta$-subset for lossy compression. Roughly, we will follow these steps: Electricity prices --> discretize into bins --> build an alphabet distribution --> compute entropy --> $\delta$-subset --> visualization of kept versus dropped. 

## Imports

We will treat binned electricity prices as an "alphabet" of symbols. Each symbol is a price range (e.g. $\left[50, 55\right)$ CHF/MWh). The probability of a symbol is the fraction of hours in that range. Then, for a chosen risk budget $\delta$, we compute the smallest set of bins whose total probability mass is at least $1-\delta$. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from entropy_lab.coding.alphabet import AlphabetDistribution
from entropy_lab.measures.entropy import compute_entropy


## Loading the daily price data

Electricity prices have a "boring bulk" (most hours) and a "long tail" (rare spikes). That long tail is exactly where $\delta$-subsets are intuitive: How much tail mass are we willing to ignore for a simpler model/encoding?

For this example, we will use Switzerland's wholesale hourly electricity prices. Let us load the dataset.

In [None]:
df = pd.read_csv("/Users/fdolci/projects/entropy_lab/data/Switzerland.csv")
prices = df["Price (EUR/MWhe)"].to_numpy()
df.head()

## Discretization

Now that we have a dataset which include all the prices, we will discretize these prices into bins. This will define the alphabet of symbols:

In [None]:
bin_width = 5.0 # EUR/MWhe
lo = np.floor(prices.min() / bin_width) * bin_width
hi = np.ceil(prices.max() / bin_width) * bin_width
edges = np.arange(lo, hi + bin_width, bin_width)

# Bin index for each price
idx = np.digitize(prices, edges) - 1
idx = np.clip(idx, 0, len(edges) - 2)

# Convert each bin to a readable symbol label
def bin_label(i: int) -> str:
    a = edges[i]
    b = edges[i + 1]
    return f"[{a:.0f},{b:.0f})"

symbols = [bin_label(i) for i in idx]

# Count symbols
counts = {}
for s in symbols:
    counts[s] = counts.get(s, 0) + 1

dist = AlphabetDistribution.from_counts(counts).ranked()
len(dist.symbols), dist.symbols[:5], dist.p[:5].sum()

Now we have a discrete distribution over price-bins. The most probable bins correspond to "normal market conditions". The least probable bins are rare spikes or rare negative prices.

## Entropy of the price-bin distribution

In [None]:
p = dist.p 
H = compute_entropy(p)
H

As explained in the theory, entropy measures the average suprise of the price-bin outcome. A distribution with many similarly-likely bins has higher entropy than one dominated by a few bins.

## Visualization of the full distribution (ranked by probability)

In [None]:
top_n = min(30, len(dist.symbols))
x = np.arange(top_n)

plt.figure(figsize=(12, 4))
plt.bar(x, dist.p[:top_n])
plt.xticks(x, dist.symbols[:top_n], rotation=70, ha="right")
plt.ylabel("Probability (share of hours)")
plt.title(f"Top {top_n} price bins by frequency (Entropy = {H:.2f} bits)")
plt.tight_layout()
plt.show()

This ranked plot is the visual "alphabet": The left side is your everyday market, the right side is the long tail. We will now compute the $\delta$-subset.

## $\delta$-subset

This means that, for a given risk level $\delta$, we will keep the $1-\delta$ mass, and therefore drop the tail:

In [None]:
delta = 0.05 # we accept 5% risk of landing in dropped bins
res = dist.delta_subset(delta)

res.threshold, res.kept_probability, res.dropped_probability, len(res.kept.symbols), len(res.dropped_symbols)

Interpretation: With $\delta = 0.05$, we build the smallest set of bins that cover at least $95\%$ of observed hours. The dropped bins represent "rare regimes" (e.g., spikes). 

# Plotting

In [None]:
ranked = dist  # already ranked
p_ranked = ranked.p
labels_ranked = ranked.symbols

k = len(res.kept.symbols)

plt.figure(figsize=(12, 4))
x = np.arange(len(p_ranked))
plt.bar(x[:k], p_ranked[:k], label=f"Kept bins (mass ≈ {res.kept_probability:.3f})")
plt.bar(x[k:], p_ranked[k:], label=f"Dropped bins (mass ≈ {res.dropped_probability:.3f})")
plt.axvline(k - 0.5)
plt.ylabel("Probability")
plt.title(f"δ-subset on price bins (δ={delta:.2f} → keep ≥ {1-delta:.2f} mass) | kept bins = {k}/{len(p_ranked)}")
plt.legend()
plt.tight_layout()
plt.show()

If $\delta$ increases (we accept more risk), we can observe the following:

In [None]:
deltas = np.linspace(0.0, 0.25, 26)  # 0% to 25% tail risk
kept_sizes = []
kept_mass = []
H_kept = []

for d in deltas:
    r = dist.delta_subset(float(d))
    kept_sizes.append(len(r.kept.symbols))
    kept_mass.append(r.kept_probability)
    H_kept.append(compute_entropy(r.kept.p))

plt.figure(figsize=(12, 4))
plt.plot(deltas, kept_sizes, marker="o")
plt.xlabel("δ (risk budget)")
plt.ylabel("Kept alphabet size |Aδ| (number of bins)")
plt.title("As we allow more risk δ, the required price alphabet shrinks")
plt.tight_layout()
plt.show()

## Comparing with Germany

In [None]:
df_de = pd.read_csv("/Users/fdolci/projects/entropy_lab/data/Germany.csv")
prices_de = df_de["Price (EUR/MWhe)"].to_numpy()

# For a fair comparison, use ONE shared set of bin edges
all_prices = np.concatenate([prices, prices_de])

bin_width = 5.0
lo = np.floor(all_prices.min() / bin_width) * bin_width
hi = np.ceil(all_prices.max() / bin_width) * bin_width
edges = np.arange(lo, hi + bin_width, bin_width)

def prices_to_dist(prices_arr: np.ndarray, edges: np.ndarray) -> AlphabetDistribution:
    idx = np.digitize(prices_arr, edges) - 1
    idx = np.clip(idx, 0, len(edges) - 2)

    def bin_label(i: int) -> str:
        a = edges[i]
        b = edges[i + 1]
        return f"[{a:.0f},{b:.0f})"

    symbols = [bin_label(i) for i in idx]

    counts = {}
    for s in symbols:
        counts[s] = counts.get(s, 0) + 1

    return AlphabetDistribution.from_counts(counts).ranked()

dist_ch = prices_to_dist(prices, edges)
dist_de = prices_to_dist(prices_de, edges)

H_ch = compute_entropy(dist_ch.p)
H_de = compute_entropy(dist_de.p)

len(dist_ch.symbols), len(dist_de.symbols), H_ch, H_de

In [None]:
delta = 0.05

res_ch = dist_ch.delta_subset(delta)
res_de = dist_de.delta_subset(delta)

print("CH kept mass / bins:", res_ch.kept_probability, len(res_ch.kept.symbols), "of", len(dist_ch.symbols))
print("GER kept mass / bins:", res_de.kept_probability, len(res_de.kept.symbols), "of", len(dist_de.symbols))

H_ch_kept = compute_entropy(res_ch.kept.p)
H_de_kept = compute_entropy(res_de.kept.p)

print("Entropy full CH / kept:", H_ch, H_ch_kept)
print("Entropy full GER / kept:", H_de, H_de_kept)

In [None]:
deltas = np.linspace(0.0, 0.25, 26)

def compression_profile(dist: AlphabetDistribution, deltas: np.ndarray):
    kept_sizes = []
    kept_mass = []
    kept_entropy = []
    for d in deltas:
        r = dist.delta_subset(float(d))
        kept_sizes.append(len(r.kept.symbols))
        kept_mass.append(r.kept_probability)
        kept_entropy.append(compute_entropy(r.kept.p))
    return np.array(kept_sizes), np.array(kept_mass), np.array(kept_entropy)

sizes_ch, mass_ch, Hk_ch = compression_profile(dist_ch, deltas)
sizes_de, mass_de, Hk_de = compression_profile(dist_de, deltas)

plt.figure(figsize=(12, 5))
plt.plot(deltas, sizes_ch, marker="o", label=f"Switzerland | H={H_ch:.2f} bits")
plt.plot(deltas, sizes_de, marker="o", label=f"Germany      | H={H_de:.2f} bits")

# Highlight the chosen delta
plt.axvline(delta, linestyle="--")
plt.scatter([delta], [len(res_ch.kept.symbols)], s=80)
plt.scatter([delta], [len(res_de.kept.symbols)], s=80)

plt.xlabel("δ (risk budget: probability mass you allow to drop)")
plt.ylabel("Kept alphabet size |Aδ| (number of price bins)")
plt.title("Risk-aware lossy compression of electricity prices: CH vs GER")
plt.legend()
plt.tight_layout()
plt.show()