# Hilbert–Schmidt Independence Criterion (HSIC): A Clear Numerical Example

This section provides a **step-by-step numerical explanation** of **HSIC** using a nonlinear example. HSIC is a **kernel-based, nonparametric dependence measure** that is widely used for **feature selection**, especially in **time series and high-dimensional settings**.

Unlike Mutual Information (MI), HSIC:

- Does **not** estimate probability densities
- Does **not** require discretization or kNN
- Measures dependence via **kernels and geometry in RKHS**

## Example Setup

We observe one feature $X$ and one target $Y$ over four time points:

| i   | $X_i$ | $Y_i$ |
| --- | ----- | ----- |
| 1   | 1     | 1     |
| 2   | 2     | 4     |
| 3   | 3     | 9     |
| 4   | 4     | 16    |

This is again a **nonlinear relationship**:  
$Y = X^2$

## Step 1 — Choose Kernels

HSIC requires a kernel for $X$ and a kernel for $Y$.

For simplicity, we use the **linear kernel**:

$$
k(x_i, x_j) = x_i x_j
$$

$$
l(y_i, y_j) = y_i y_j
$$

> Note: In practice, **Gaussian (RBF) kernels** are more common, but linear kernels make the computations transparent.

## Step 2 — Compute Kernel Matrices

### Kernel Matrix for $X$

$$
K_{ij} = X_i X_j
$$

| \(K\) | 1   | 2   | 3   | 4   |
| ----- | --- | --- | --- | --- |
| 1     | 1   | 2   | 3   | 4   |
| 2     | 2   | 4   | 6   | 8   |
| 3     | 3   | 6   | 9   | 12  |
| 4     | 4   | 8   | 12  | 16  |

### Kernel Matrix for $Y$

$$
L_{ij} = Y_i Y_j
$$

| \(L\) | 1   | 2   | 3   | 4   |
| ----- | --- | --- | --- | --- |
| 1     | 1   | 4   | 9   | 16  |
| 2     | 4   | 16  | 36  | 64  |
| 3     | 9   | 36  | 81  | 144 |
| 4     | 16  | 64  | 144 | 256 |

## Step 3 — Center the Kernel Matrices

HSIC operates on **centered kernel matrices**.

Define the centering matrix:

$$
H = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top
$$

For \(n = 4\):

$$
H =
\begin{bmatrix}
0.75 & -0.25 & -0.25 & -0.25 \\
-0.25 & 0.75 & -0.25 & -0.25 \\
-0.25 & -0.25 & 0.75 & -0.25 \\
-0.25 & -0.25 & -0.25 & 0.75
\end{bmatrix}
$$

Centered kernels:

$$
\tilde{K} = HKH
$$

$$
\tilde{L} = HLH
$$

### Centered Kernel Matrix for $X$

| $\tilde{K}$ | 1    | 2    | 3    | 4    |
| ----------- | ---- | ---- | ---- | ---- |
| 1           | 3.5  | 1.5  | -0.5 | -4.5 |
| 2           | 1.5  | 0.5  | -0.5 | -1.5 |
| 3           | -0.5 | -0.5 | 0.5  | 0.5  |
| 4           | -4.5 | -1.5 | 0.5  | 5.5  |

### Centered Kernel Matrix for $Y$

| $\tilde{L}$ | 1     | 2     | 3     | 4     |
| ----------- | ----- | ----- | ----- | ----- |
| 1           | 73.5  | 25.5  | -18.5 | -80.5 |
| 2           | 25.5  | 7.5   | -6.5  | -26.5 |
| 3           | -18.5 | -6.5  | 7.5   | 17.5  |
| 4           | -80.5 | -26.5 | 17.5  | 89.5  |

## Step 4 — Compute HSIC

The empirical HSIC is defined as:

$$
\mathrm{HSIC}(X,Y)
=
\frac{1}{(n-1)^2}
\mathrm{trace}(\tilde{K}\tilde{L})
$$

### Matrix Product and Trace

Compute:

$$
\mathrm{trace}(\tilde{K}\tilde{L})
=
\sum_{i,j} \tilde{K}_{ij}\tilde{L}_{ij}
$$

For this example:

$$
\sum_{i,j} \tilde{K}_{ij}\tilde{L}_{ij} = 2260
$$

With \(n = 4\):

$$
\mathrm{HSIC}(X,Y) = \frac{2260}{9} \approx 251.1
$$

## Interpretation

- HSIC $> 0$ indicates dependence
- HSIC $= 0$ **if and only if** $X$ and $Y$ are independent (with characteristic kernels)
- The large value reflects **strong nonlinear dependence**

## Why HSIC is Powerful for Time Series Feature Selection

- Fully **nonparametric**
- Captures **linear and nonlinear** dependencies
- Avoids density estimation (unlike MI)
- Works naturally with **lagged variables**
- Stable in **high dimensions**

## Comparison to MI and dCor

| Method               | Density Estimation | Captures Nonlinearity | Popular in TS FS |
| -------------------- | ------------------ | --------------------- | ---------------- |
| Mutual Information   | Yes (kNN / KDE)    | Yes                   | Very high        |
| Distance Correlation | No                 | Yes                   | High             |
| HSIC                 | No (kernel-based)  | Yes                   | Very high        |


`scikit‑learn` does not have a built‑in HSIC function, but you can implement HSIC in Python using kernel matrices (e.g., RBF/Gaussian kernels) and basic linear algebra.


In [1]:
import numpy as np
from sklearn.metrics.pairwise import pairwise_kernels, pairwise_distances
from sklearn.preprocessing import StandardScaler

In [2]:
def hsic(X, Y, kernel="rbf", gamma=None):
    """
    Compute the Hilbert-Schmidt Independence Criterion (HSIC)
    between X and Y using kernel matrices.

    Args:
        X: array of shape (n_samples, n_features)
        Y: array of shape (n_samples, m_features)
        kernel: kernel type passed to sklearn.metrics.pairwise_kernels
        gamma: kernel parameter for RBF (if None, 1/n_features is used)

    Returns:
        HSIC value (float)
    """
    n = X.shape[0]

    # Compute kernel (Gram) matrices
    Kx = pairwise_kernels(X, metric=kernel, gamma=gamma)
    Ky = pairwise_kernels(Y, metric=kernel, gamma=gamma)

    H = np.eye(n) - np.ones((n, n)) / n

    Kx_c = H @ Kx @ H
    Ky_c = H @ Ky @ H

    hsic_val = np.trace(Kx_c @ Ky_c) / ((n - 1) ** 2)
    return hsic_val

In [3]:
np.random.seed(0)
n = 200

X = np.cumsum(np.random.randn(n))
X = X.reshape(-1, 1)
Y = np.sin(0.1 * X) + 0.1 * np.random.randn(n, 1)

hsic_value = hsic(X, Y, kernel="rbf", gamma=1.0 / X.shape[1])
print(f"HSIC value between X and Y: {hsic_value:.5f}")


HSIC value between X and Y: 0.01117


The magnitude of HSIC/NHSIC depends heavily on the kernel choice and bandwidth (gamma). If your data has large variance or oscillations, the RBF kernel may still produce small values because the pairwise distances in the kernel matrix are large, so $exp(-\gamma\times dist^2)$ is very small.

We need to:

- Normalize values
- Find gamma using pairwise Euclidean distances


In [4]:
scaler_X = StandardScaler()
scaler_Y = StandardScaler()

X_scaled = scaler_X.fit_transform(X)
Y_scaled = scaler_Y.fit_transform(Y)

In [None]:
# Compute pairwise Euclidean distances
dists = pairwise_distances(X_scaled)
median_dist = np.median(dists)
gamma = 1 / (2 * median_dist ** 2)
gamma

0.4867670629291414

In [16]:
hsic_value = hsic(X, Y, kernel="rbf", gamma=gamma)
print(f"HSIC value between X and Y: {hsic_value:.5f}")

HSIC value between X and Y: 0.00896


### HSIC for lag selection


In [7]:
lags = 5
hsic_scores = []
for lag in range(1, lags + 1):
    X_lag = np.roll(X, lag, axis=0)
    val = hsic(X_lag, Y, kernel="rbf", gamma=gamma)
    hsic_scores.append(val)
    print(f"Lag {lag}: HSIC = {val:.5f}")

Lag 1: HSIC = 0.00844
Lag 2: HSIC = 0.00812
Lag 3: HSIC = 0.00778
Lag 4: HSIC = 0.00748
Lag 5: HSIC = 0.00730


- Higher HSIC ⇒ stronger dependency between X and Y.

- Works well for nonlinear time series relationships where Pearson correlation fails.

> Key point: HSIC values are often much smaller than `MI` or `dCor`, even when variables are clearly dependent. This is normal. So you have to interpret them differently from `MI` or `dCor`. We need to rank lags and select the informative ones.
