# Similarities amongst sub-indicators by distance correlation

We assume that two time-series of any pair of sub-indicators must converge somewhat similar over years to form a cause-effect relationship. Hence, we start with investigating similarities amongst the sub-indicators of the SDGs. The form of these similarities should take as few assumptions as possible. So, a Pearson linear correlation coefficient or a rank correlation coefficient are not our choice since they assume linearity or monotony, respectively.

We choose to compute the [distance correlation](https://projecteuclid.org/euclid.aos/1201012979) because it has the following properties:
1. we have an absolute measure of similarity ranging from 0 to 1, $0 \leq \mathcal{R}(X,Y) \leq 1$
2. $\mathcal{R}(X,Y) = 0$ if and only if $x$ and $Y$ are independent,
3. $\mathcal{R}(X,Y) = \mathcal{R}(Y,X)$
2. we are able to investigate non-linear and non-monotone relationships,
3. we can utilise our previously defined Gaussian processes,
4. we can find similarities between sub-indicators with differently many measurements,
5. the only assumptions we need to take is that probability distributions have finite first moments.

The distance correlation is defined as:

$$
\mathcal{R}(X,Y) = \begin{cases}
\frac{\mathcal{V}^2 (X,Y)}{\sqrt{\mathcal{V}^2 (X)\mathcal{V}^2 (Y)}} &\text{, if $\mathcal{V}^2 (X)\mathcal{V}^2 (Y) > 0$} \\
0 &\text{, if $\mathcal{V}^2 (X)\mathcal{V}^2 (Y) = 0$}
\end{cases}
$$


where


$$
\mathcal{V}^2 (X,Y) = \| f_{X,Y}(t) - f_X(t)f_Y(t) \|^2
$$


is the distance covariance with **characteristic functions** $f(t)$. Bear in mind that characteristic functions include the imaginary unit $i$, $i^2 = -1$:

$$
f_X(t) = \mathbb{E}[e^{itX}]
$$

Thus, we are in the space of complex numbers $\mathbb{C}$. Unfortunately, this means we can most likely not find exact results, but we'll get back to this later under Estimators.

## Distance covariance
Let's dismantle the distance covariance equation to know what we actually compute:

$$
\mathcal{V}^2 (X,Y) = \| f_{X,Y}(t) - f_X(t) \ f_Y(t) \|^2 = \frac{1}{c_p c_q} \int_{\mathbb{R}^{p+q}} \frac{| f_{X,Y}(t) - f_X(t)f_Y(t) |^2}{| t |_p^{1+p} \ | t |_q^{1+q}}
$$

where

$$
c_d = \frac{\pi^{(1+d)/2}}{\Gamma \Big( (1+d)/2 \Big)}
$$

where the (complete) Gamma function $\Gamma$ is

$$
\Gamma (z) = \int_0^{\infty} x^{z-1} \ e^{-x} \ dx
$$

with $z \in \mathbb{R}^{+}$. 

$p$ and $q$ are the dimensions of our time-series, i.e. the number of measurements we have available. We can write this as: 

$$X \ \text{in} \ \mathbb{R}^p$$

$$Y \ \text{in} \ \mathbb{R}^q$$


The terrific conclusion of this formulation: **we can, *in theory*, compute similarities between time-series with different numbers of measurements**. 

But we still have some terms in the distance covariance $\mathcal{V}^2 (X,Y)$ which we need to define:

$ | t |_p^{1+p} $ is the Euclidean distance of $t$ in $\mathbb{R}^p$, $ | t |_q^{1+q} $ is the Euclidean distance of $t$ in $\mathbb{R}^q$.

The numerator of $\mathcal{V}^2 (X,Y)$ is:
$$
| f_{X,Y}(t) - f_X(t) \ f_Y(t) |^2 = \Big( 1- |f_X(t) | ^2 \Big) \ \Big( 1- |f_Y(t) |^2 \Big)
$$

where $|f_X(t) |$ and $|f_Y(t) |$ are absolute random vectors of the characteristic functions $f(t)$ of dimensionality $p$ and $q$, respectively.


## Estimators

Since the characteristic functions include the imaginary unit $i$, we cannot recover the exact solution for the distance variance, hence distance correlation. But, we can estimate it by a quite simple form. We compute these estimators according to [Huo & Szekely, 2016](https://arxiv.org/abs/1410.1503).

We denote the pairwise distances of the $X$ observations by $a_{ij} := \|X_i - X_j \|$ and of the $Y$ observations by $b_{ij} = \|Y_i - Y_j \|$ for $i,j = 1, ..., n$, where $n$ is the **equal** number of measurements in $X$ and $Y$. The corresponding distance matrices are denoted by $(A_{ij})^n_{i,j=1}$ and $(B_{ij})^n_{i,j=1}$, where

$$
A_{ij} = \begin{cases}
a_{ij} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2} \sum_{k,l=1}^n a_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$

and

$$
B_{ij} = \begin{cases}
b_{ij} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2} \sum_{k,l=1}^n b_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$


Having computed these, we can estimate the sample distance covariance $\hat{\mathcal{V}}^2(X,Y)$ by

$$
\hat{\mathcal{V}}^2(X,Y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{ij} \ B_{ij}
$$

The corresponding sample variance $\hat{\mathcal{V}}^2(X)$ is consequently:

$$
\hat{\mathcal{V}}^2(X) = \frac{1}{n^2} \sum_{i,j=1}^n A^2_{ij}
$$


Then, we can standardise these covariances to finally arrive at the sample distance correlation $\hat{\mathcal{R}}(X,Y)$:

$$
\hat{\mathcal{R}}(X,Y) = \begin{cases}
\frac{\hat{\mathcal{V}}^2 (X,Y)}{\sqrt{\hat{\mathcal{V}}^2 (X)\hat{\mathcal{V}}^2 (Y)}} &\text{, if $\hat{\mathcal{V}}^2 (X)\mathcal{V}^2 (Y) > 0$} \\
0 &\text{, if $\hat{\mathcal{V}}^2 (X)\hat{\mathcal{V}}^2 (Y) = 0$}
\end{cases}
$$

### Unbiased estimators
As you can imagine, these estimators are biased, but we can define unbiased estimators of the distance covariance $\hat{\mathcal{V}}^2(X,Y)$ and call them $\Omega_n(x,y)$. We must first redefine our distance matrices $(A_{ij})^n_{i,j=1}$ and $(B_{ij})^n_{i,j=1}$, which we will call $(\tilde{A}_{ij})^n_{i,j=1}$ and $(\tilde{B}_{ij})^n_{i,j=1}$:

$$
\tilde{A}_{ij} = \begin{cases}
a_{ij} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)} \sum_{k,l=1}^n a_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$

and

$$
\tilde{B}_{ij} = \begin{cases}
b_{ij} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)} \sum_{k,l=1}^n b_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$

Finally, we can compute the unbiased estimator $\Omega_n(x,y)$ for $\mathcal{V}^2(X,Y)$:

$$
\Omega_n(x,y) = \frac{1}{n(n-3)} \sum_{i,j=1}^n \tilde{A}_{ij} \ \tilde{B}_{ij}
$$


Interestingly, [Lyons (2013)](https://arxiv.org/abs/1106.5758) found another solution how not only the sample distance correlation can be computed, but also the population distance correlation without characteristic functions. This is good to acknowledge, but it is not necessary to focus on it. We start implementing the estimators how we have just defined them.

## Implementation
Fortunately, we are not the first who want to compute distance correlations in python, so we'll use the package [`dcor`](https://github.com/vnmabus/dcor). Find its documentation [here](https://dcor.readthedocs.io/en/latest/?badge=latest).

In [1]:
import dcor
import numpy as np

In [3]:
# testing

a = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9]])

b = np.array([[9],[8],[7],[6],[5],[4],[3],[2],[1]])

c = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9]])

d = np.array([[1],[2],[3],[4],[5],[4],[3],[2],[1]])

e = np.array([[5],[4],[3],[2],[1],[2],[3],[4],[5]])

In [4]:
print('R between a and b: ', dcor.u_distance_stats_sqr(a,b, exponent=0.5))

print('R between a and c: ', dcor.u_distance_stats_sqr(a,c, exponent=0.5))

print('R between d and e: ', dcor.u_distance_stats_sqr(d,e, exponent=0.5))

R between a and b:  Stats(covariance_xy=0.2964996596350237, correlation_xy=1.0, variance_x=0.2964996596350237, variance_y=0.2964996596350237)
R between a and c:  Stats(covariance_xy=0.2964996596350237, correlation_xy=1.0, variance_x=0.2964996596350237, variance_y=0.2964996596350237)
R between d and e:  Stats(covariance_xy=0.2866037108248322, correlation_xy=1.0, variance_x=0.2866037108248322, variance_y=0.2866037108248322)


  return ((np.issubdtype(x.dtype, float) and
  (np.issubdtype(x.dtype, int) and


--> `a` and `b` must have same number of random variables $n$, i.e. same length. Consequently, we need imputations.