# Distance Correlation (dCor): A Clear Numerical Example

Distance correlation (dCor) is a nonparametric measure of dependence between two random variables or vectors that can detect both linear and nonlinear relationships. This section provides a **step-by-step numerical explanation** of **distance correlation (dCor)** using a simple example. The goal is to clearly illustrate how dCor is computed and why it is effective at capturing **nonlinear dependence**, which traditional correlation measures may miss.

## Example Setup

We observe one feature $X$ and one target $Y$ over four time points:

| i   | $X_i$ | $Y_i$ |
| --- | ----- | ----- |
| 1   | 1     | 1     |
| 2   | 2     | 4     |
| 3   | 3     | 9     |
| 4   | 4     | 16    |

Here, $Y = X^2$, which represents a **nonlinear relationship**.

## Step 1 — Pairwise Distance Matrices

### Distance Matrix for $X$

Definition:

$$
D^X_{ij} = |X_i - X_j|
$$

| $D^X$ | 1   | 2   | 3   | 4   |
| ----- | --- | --- | --- | --- |
| 1     | 0   | 1   | 2   | 3   |
| 2     | 1   | 0   | 1   | 2   |
| 3     | 2   | 1   | 0   | 1   |
| 4     | 3   | 2   | 1   | 0   |

### Distance Matrix for $Y$

Definition:

$$
D^Y_{ij} = |Y_i - Y_j|
$$

| $D^Y$ | 1   | 2   | 3   | 4   |
| ----- | --- | --- | --- | --- |
| 1     | 0   | 3   | 8   | 15  |
| 2     | 3   | 0   | 5   | 12  |
| 3     | 8   | 5   | 0   | 7   |
| 4     | 15  | 12  | 7   | 0   |

## Step 2 — Double Centering of Distance Matrices

Distance correlation operates on **centered distance matrices**.

For matrix $D^X$, the centered version $A$ is defined as:

$$
A_{ij} = D^X_{ij}
- \bar{D}^X_{i\cdot}
- \bar{D}^X_{\cdot j}
+ \bar{D}^X_{\cdot\cdot}
$$

Where:

- $\bar{D}\_{i\cdot}$ is the mean of row \(i\)
- $\bar{D}\_{\cdot j}$ is the mean of column \(j\)
- $\bar{D}\_{\cdot\cdot}$ is the grand mean

### Means for $D^X$

Row means:

- Row 1: 1.5
- Row 2: 1.0
- Row 3: 1.0
- Row 4: 1.5

Grand mean:

$$
\bar{D}^X_{\cdot\cdot} = \frac{20}{16} = 1.25
$$

### Centered Distance Matrix for \(X\)

| \(A\) | 1     | 2     | 3     | 4     |
| ----- | ----- | ----- | ----- | ----- |
| 1     | -1.75 | -0.25 | 0.75  | 1.25  |
| 2     | -0.25 | -0.75 | 0.25  | 0.75  |
| 3     | 0.75  | 0.25  | -0.75 | -0.25 |
| 4     | 1.25  | 0.75  | -0.25 | -1.75 |

## Step 3 — Centered Distance Matrix for \(Y\)

Applying the same procedure to $D^Y$ yields matrix \(B\):

| \(B\) | 1    | 2    | 3    | 4     |
| ----- | ---- | ---- | ---- | ----- |
| 1     | -8.5 | -3.5 | 1.5  | 10.5  |
| 2     | -3.5 | -4.5 | -0.5 | 8.5   |
| 3     | 1.5  | -0.5 | -4.5 | 3.5   |
| 4     | 10.5 | 8.5  | 3.5  | -22.5 |

## Step 4 — Distance Covariance

The squared distance covariance is:

$$
\mathrm{dCov}^2(X,Y)
=
\frac{1}{n^2} \sum_{i,j} A_{ij} B_{ij}
$$

For this example:

$$
\sum_{i,j} A_{ij} B_{ij} = 180
$$

With \(n = 4\):

$$
\mathrm{dCov}^2(X,Y) = \frac{180}{16} = 11.25
$$

## Step 5 — Distance Variances

Distance variances are computed as:

$$
\mathrm{dVar}^2(X) = \frac{1}{n^2} \sum_{i,j} A_{ij}^2
$$

$$
\mathrm{dVar}^2(Y) = \frac{1}{n^2} \sum_{i,j} B_{ij}^2
$$

Results:

- $\mathrm{dVar}(X) = 1.25$
- $\mathrm{dVar}(Y) = 10.61$

## Step 6 — Distance Correlation

The distance correlation is defined as:

$$
\mathrm{dCor}(X,Y)
=
\frac{\mathrm{dCov}(X,Y)}
{\sqrt{\mathrm{dVar}(X)\mathrm{dVar}(Y)}}
$$

Substituting the values:

$$
\mathrm{dCor}(X,Y)
=
\frac{\sqrt{11.25}}{\sqrt{1.25 \times 10.61}}
\approx 1
$$

## Interpretation

- A distance correlation close to **1** indicates strong dependence.
- dCor captures **both linear and nonlinear relationships**.
- No discretization, density estimation, or k-nearest neighbors are required.
- The method is fully **nonparametric** and suitable for time series data.


`scikit-learn` does not natively implement distance correlation, but you can use the dcor
Python package, which integrates nicely with NumPy arrays and works similarly to scikit-learn style.


In [None]:
import numpy as np
import dcor

np.random.seed(0)
N = 200

X = np.cumsum(np.random.randn(N))
Y = np.sin(0.1 * X) + 0.1 * np.random.randn(N)

dcor_value = dcor.distance_correlation(X, Y)
print(f"Distance Correlation between X and Y: {dcor_value:.4f}")


Distance Correlation between X and Y: 0.8982


**Time Series Lag Example**


In [3]:
max_lag = 5
for lag in range(1, max_lag+1):
    X_lag = np.roll(X, lag)
    dcor_val = dcor.distance_correlation(X_lag, Y)
    print(f"Lag {lag}: dCor = {dcor_val:.4f}")

Lag 1: dCor = 0.8665
Lag 2: dCor = 0.8393
Lag 3: dCor = 0.8224
Lag 4: dCor = 0.8073
Lag 5: dCor = 0.7983
