# Non-linear dependencies amongst the SDGs and climate change by distance correlation

We start with investigating dependencies amongst the SDGs on different levels. The method how we investigate these dependencies should take as few assumptions as possible. So, a Pearson linear correlation coefficient or a rank correlation coefficient are not our choice since they assume linearity and/or monotony, respectively.

We choose to compute the [distance correlation](https://projecteuclid.org/euclid.aos/1201012979), precisely the [partial distance correlation](https://projecteuclid.org/download/pdfview_1/euclid.aos/1413810731), because of the following properties:
1. we have an absolute measure of dependence ranging from $0$ to $1$, $0 \leq \mathcal{R}(X,Y) \leq 1$
2. $\mathcal{R}(X,Y) = 0$ if and only if $X$ and $Y$ are independent,
3. $\mathcal{R}(X,Y) = \mathcal{R}(Y,X)$
4. we are able to investigate non-linear and non-monotone relationships,
5. we can find dependencies between indicators with differently many measurements,
6. the only assumptions we need to take is that probability distributions have finite first moments.

The conditional distance correlation has the advantage that we ignore the influence of any other targets or goals when we compute the correlation between any two targets or goals. This procedure is also called controlling for confounders.

The **distance correlation** is defined as:

$$
\mathcal{R}^2(X,Y) = \begin{cases}
\frac{\mathcal{V}^2 (X,Y)}{\sqrt{\mathcal{V}^2 (X)\mathcal{V}^2 (Y)}} &\text{, if $\mathcal{V}^2 (X)\mathcal{V}^2 (Y) > 0$} \\
0 &\text{, if $\mathcal{V}^2 (X)\mathcal{V}^2 (Y) = 0$}
\end{cases}
$$


where


$$
\mathcal{V}^2 (X,Y) = \| f_{X,Y}(t) - f_X(t)f_Y(t) \|^2
$$


is the distance covariance with **characteristic functions** $f(t)$. Bear in mind that characteristic functions include the imaginary unit $i$, $i^2 = -1$:

$$
f_X(t) = \mathbb{E}[e^{itX}]
$$

Thus, we are in the space of complex numbers $\mathbb{C}$. Unfortunately, this means we can most likely not find exact results, but we'll get back to this later under Estimators.

The **conditional distance correlation** is defined as:

$$
\mathcal{R}^2(X,Y \ | \ Z) = \begin{cases}
\frac{\mathcal{R}^2 (X,Y) - \mathcal{R}^2 (X,Z) \mathcal{R}^2 (Y,Z)}{\sqrt{1 - \mathcal{R}^4 (X,Z)} \sqrt{1 - \mathcal{R}^4 (Y,Z)}} &\text{, if $\mathcal{R}^4 (X,Z) \neq 1$ and $\mathcal{R}^4 (Y,Z) \neq 1$} \\
0 &\text{, if $\mathcal{R}^4 (X,Z) = 1$ and $\mathcal{R}^4 (Y,Z) = 1$}
\end{cases}
$$

# Distance covariance
Let's dismantle the distance covariance equation to know what we actually compute in the distance correlation:

$$
\mathcal{V}^2 (X,Y) = \| f_{X,Y}(t) - f_X(t) \ f_Y(t) \|^2 = \frac{1}{c_p c_q} \int_{\mathbb{R}^{p+q}} \frac{| f_{X,Y}(t) - f_X(t)f_Y(t) |^2}{| t |_p^{1+p} \ | t |_q^{1+q}} dt
$$

where

$$
c_d = \frac{\pi^{(1+d)/2}}{\Gamma \Big( (1+d)/2 \Big)}
$$

where the (complete) Gamma function $\Gamma$ is

$$
\Gamma (z) = \int_0^{\infty} x^{z-1} \ e^{-x} \ dx
$$

with $z \in \mathbb{R}^{+}$. 

$p$ and $q$ are the samples of time-series. We can see this as a random vector with multiple samples available for each time point. However, the number of samples for time points must not vary over the same time-series. We can write this as: 

$$X \ \text{in} \ \mathbb{R}^p$$

$$Y \ \text{in} \ \mathbb{R}^q$$


A preliminary conclusion of this formulation: **we can compute dependencies between time-series with different numbers of samples**. 

But we still have some terms in the distance covariance $\mathcal{V}^2 (X,Y)$ which we need to define:

$ | t |_p^{1+p} $ is the Euclidean distance of $t$ in $\mathbb{R}^p$, $ | t |_q^{1+q} $ is the Euclidean distance of $t$ in $\mathbb{R}^q$.

The numerator in the integral of $\mathcal{V}^2 (X,Y)$ is:
$$
| f_{X,Y}(t) - f_X(t) \ f_Y(t) |^2 = \Big( 1- |f_X(t) | ^2 \Big) \ \Big( 1- |f_Y(t) |^2 \Big)
$$

where $|f_X(t) |$ and $|f_Y(t) |$ are absolute random vectors of the characteristic functions $f(t)$ with $p$ and $q$ samples, respectively.


## Estimators

Since the characteristic functions include the imaginary unit $i$, we cannot recover the exact solution for the distance covariance. However, we can estimate it by a quite simple form. We compute these estimators according to [Huo & Szekely, 2016](https://arxiv.org/abs/1410.1503).

We denote the pairwise distances of the $X$ observations by $a_{ij} := \|X_i - X_j \|$ and of the $Y$ observations by $b_{ij} = \|Y_i - Y_j \|$ for $i,j = 1, ..., n$, where $n$ is the number of measurements in $X$ and $Y$. The corresponding distance matrices are denoted by $(A_{ij})^n_{i,j=1}$ and $(B_{ij})^n_{i,j=1}$, where

$$
A_{ij} = \begin{cases}
a_{ij} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2} \sum_{k,l=1}^n a_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$

and

$$
B_{ij} = \begin{cases}
b_{ij} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2} \sum_{k,l=1}^n b_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$


Having computed these, we can estimate the sample distance covariance $\hat{\mathcal{V}}^2(X,Y)$ by

$$
\hat{\mathcal{V}}^2(X,Y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{ij} \ B_{ij}
$$

The corresponding sample variance $\hat{\mathcal{V}}^2(X)$ is consequently:

$$
\hat{\mathcal{V}}^2(X) = \frac{1}{n^2} \sum_{i,j=1}^n A^2_{ij}
$$


Then, we can scale these covariances to finally arrive at the sample distance correlation $\hat{\mathcal{R}}^2(X,Y)$:

$$
\hat{\mathcal{R}}^2(X,Y) = \begin{cases}
\frac{\hat{\mathcal{V}}^2 (X,Y)}{\sqrt{\hat{\mathcal{V}}^2 (X)\hat{\mathcal{V}}^2 (Y)}} &\text{, if $\hat{\mathcal{V}}^2 (X)\mathcal{V}^2 (Y) > 0$} \\
0 &\text{, if $\hat{\mathcal{V}}^2 (X)\hat{\mathcal{V}}^2 (Y) = 0$}
\end{cases}
$$

### Unbiased estimators
These estimators are biased, but we can define unbiased estimators of the distance covariance $\hat{\mathcal{V}}^2(X,Y)$ and call them $\Omega_n(x,y)$. We must first redefine our distance matrices $(A_{ij})^n_{i,j=1}$ and $(B_{ij})^n_{i,j=1}$, which we will call $(\tilde{A}_{ij})^n_{i,j=1}$ and $(\tilde{B}_{ij})^n_{i,j=1}$:

$$
\tilde{A}_{ij} = \begin{cases}
a_{ij} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)} \sum_{k,l=1}^n a_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$

and

$$
\tilde{B}_{ij} = \begin{cases}
b_{ij} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)} \sum_{k,l=1}^n b_{kl} & i \neq j; \\
0 & i = j.
\end{cases}
$$

Finally, we can compute the unbiased estimator $\Omega_n(X,Y)$ for $\mathcal{V}^2(X,Y)$ as the dot product $\langle \tilde{A}, \tilde{B} \rangle$:

$$
\Omega_n(X,Y) = \langle \tilde{A}, \tilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \tilde{A}_{ij} \ \tilde{B}_{ij}
$$


Interestingly, [Lyons (2013)](https://arxiv.org/abs/1106.5758) found another solution how not only the sample distance correlation can be computed, but also the population distance correlation without characteristic functions. This is good to acknowledge, but it is not necessary to focus on it. 

# Conditional distance covariance

We start with computing the unbiased distance matrices $(\tilde{A}_{ij})^n_{i,j=1}$, $(\tilde{B}_{ij})^n_{i,j=1}$, and $(\tilde{C}_{ij})^n_{i,j=1}$ for $X$, $Y$, and $Z$, respectively, as we have done previously for the distance covariance. We define the dot product

$$
\Omega_n(X,Y) = \langle \tilde{A}, \tilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \tilde{A}_{ij} \tilde{B}_{ij}
$$

and project the sample $x$ onto $z$ as 

$$
P_z (x) = \frac{\langle \tilde{A}, \tilde{C} \rangle}{\langle \tilde{C}, \tilde{C} \rangle} \tilde{C} .
$$

The complementary projection is consequently

$$
P_{z^{\bot}} (x) = \tilde{A} - P_z (x) = \tilde{A} - \frac{\langle \tilde{A}, \tilde{C} \rangle}{\langle \tilde{C}, \tilde{C} \rangle} \tilde{C} .
$$

Hence, the sample conditional distance covariance is

$$
\hat{\mathcal{V}}^2(X,Y \ | \ Z) = \langle P_{z^{\bot}} (x), P_{z^{\bot}} (y) \rangle .
$$

Then, we can scale these covariances to finally arrive at the sample conditional distance correlation $\hat{\mathcal{R}}^2(X,Y \ | \ Z)$:

$$
\hat{\mathcal{R}}^2(X,Y \ | \ Z) = \begin{cases}
\frac{\langle P_{z^{\bot}} (x), P_{z^{\bot}} (y) \rangle}{\| P_{z^{\bot}} (x) \| \ \| P_{z^{\bot}} (y) \|} &\text{, if} \ \| P_{z^{\bot}} (x) \| \ \| P_{z^{\bot}} (y) \| \neq 0 \\
0 &\text{, if} \ \| P_{z^{\bot}} (x) \| \ \| P_{z^{\bot}} (y) \| = 0
\end{cases}
$$

## Implementation
For our computations, we'll use the packages [`dcor`](https://dcor.readthedocs.io/en/latest/?badge=latest) for the partial distance correlation and [`community`](https://github.com/taynaud/python-louvain) for the clustering.

In [None]:
import dcor
import numpy as np
import pickle
import itertools
import pandas as pd
import os
import math
from tqdm import notebook as tqdm

import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

import community

from dcor._dcor_internals import _u_distance_matrix, u_complementary_projection
from sklearn.manifold import MDS

### Loading standardised imputed data set
We load first of all the standardised imputed data set which we have generated with the previous notebook.

In [None]:
indicators_values_i = pickle.load(open('utils/data/indicators_values_i_up_wb.pkl', 'rb'))
targets_values_i = pickle.load(open('utils/data/targets_values_i_up_wb.pkl', 'rb'))
goals_values_i = pickle.load(open('utils/data/goals_values_i_up_wb.pkl', 'rb'))

In [None]:
# check whether T appended
targets_values_i['Belgium'].index

Unfortunately, our temperature data is available from 1991 to 2016 only. Additionally, SDG 13 has data from 2005 to 2019 only. Thus, we can only compute the distance covariances for the 12-dimensional random vectors from 2005 to 2016.

In [None]:
period = ['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016']

In [None]:
# read amended csv file
c = pd.read_csv('utils/countries_wb.csv', dtype=str, delimiter=';', header=None)
countries = list(c[0])
continents = pd.read_csv(r'utils/continents.csv')
groups = pd.read_csv(r'utils/groups.csv')
groups.replace({"Democratic People's Republic of Korea": "Korea, Dem. People's Rep.", 'Gambia': 'Gambia, The', 'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom', 'Congo': 'Congo, Rep.', 'Democratic Republic of the Congo': 'Congo, Dem. Rep.', 'Czechia': 'Czech Republic', 'Iran (Islamic Republic of)': 'Iran, Islamic Rep.', "Côte d'Ivoire": "Cote d'Ivoire", 'Kyrgyzstan': 'Kyrgyz Republic', "Lao People's Democratic Republic": 'Lao PDR', 'Republic of Moldova': 'Moldova', 'Micronesia (Federated States of)': 'Micronesia, Fed. Sts.', 'Slovakia': 'Slovak Republic', 'Viet Nam': 'Vietnam', 'Egypt': 'Egypt, Arab Rep.', 'United Republic of Tanzania': 'Tanzania','United States of America': 'United States', 'Venezuela (Bolivarian Republic of)': 'Venezuela, RB', 'Yemen': 'Yemen, Rep.', 'Bahamas': 'Bahamas, The', 'Bolivia (Plurinational State of)': 'Bolivia'}, inplace=True)
continents.replace({"Democratic People's Republic of Korea": "Korea, Dem. People's Rep.", 'Gambia': 'Gambia, The', 'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom', 'Congo': 'Congo, Rep.', 'Democratic Republic of the Congo': 'Congo, Dem. Rep.', 'Czechia': 'Czech Republic', 'Iran (Islamic Republic of)': 'Iran, Islamic Rep.', "Côte d'Ivoire": "Cote d'Ivoire", 'Kyrgyzstan': 'Kyrgyz Republic', "Lao People's Democratic Republic": 'Lao PDR', 'Republic of Moldova': 'Moldova', 'Micronesia (Federated States of)': 'Micronesia, Fed. Sts.', 'Slovakia': 'Slovak Republic', 'Viet Nam': 'Vietnam', 'Egypt': 'Egypt, Arab Rep.', 'United Republic of Tanzania': 'Tanzania','United States of America': 'United States', 'Venezuela (Bolivarian Republic of)': 'Venezuela, RB', 'Yemen': 'Yemen, Rep.', 'Bahamas': 'Bahamas, The', 'Bolivia (Plurinational State of)': 'Bolivia'}, inplace=True)
info = pd.read_csv(r'utils/wb_info.csv')

We later compute the correlations on an indicator level, but this is too detailed for any network visualisation and for an overarching understanding. Hence, we group here all sub-indicators first on an indicator-level. Then, we compute the distance correlations for the indicators, targets and goals.

We work with the `info` file again, so we don't need to assign all of this by hand.

In [None]:
# check
info['Series Code'].shape

In [None]:
# check
targets_values_i['France'].tail()

We would like to have values for targets, so we must, first of all, generate a list of all unique **targets**.

In [None]:
targets = list(info['target'].unique())    # T for UN data set

Exactly as with the series codes, we need lists of indicators.

In [None]:
dict_targets = {}

for target in targets:
    t = info['Series Code'].where(info['target'] == target)    # replace Series Code with Indicator for UN data set

    dict_targets[target] = list(set([i for i in t if str(i) != 'nan']))

In [None]:
#check 
dict_targets['1.2']

Finally we also generate a list of all unique **goals**.

In [None]:
goals = list(info['SDG'].unique())    # replace SDG with Goal for UN data set
goals

And as above, lists of targets belonging to each goal.

In [None]:
dict_goals = {}

for goal in goals:
    g = info['target'].where(info['SDG'] == goal)

    dict_goals[goal] = list(set([t for t in g if str(t) != 'nan']))

In [None]:
#check 
print(dict_goals['13'])

--------------------------------------------------
#### Partial distance covariance interlude

In [None]:
A = np.random.uniform(-5,5,512).reshape(-1,1)
B = A + np.random.uniform(-1,1,512).reshape(-1,1)
C = 0.25*A**2 + np.random.uniform(-1,1,512).reshape(-1,1)

In [None]:
plt.scatter(B, C);

print('pairwise distance correlation:', dcor.distance_correlation(B, C))
print('partial distance correlation:', dcor.partial_distance_correlation(B, C, A))
print('Pearson correlation:', np.corrcoef(B, C, rowvar=False)[0][1])

We see that $B$ and $C$ are pairwise dependent, but $B$ is independent of $C$ given $A$.

------------------------------------

## Distance correlations between goals

The next step is to compute the distance correlations on a goal-level.

We work with the **concatenated time-series** to compute the conditioned distance correlation directly on goal-level data. Visually speaking, this means that we fit one non-linear function to the data for all targets of these two goals. Since goals often have diverse targets, this may end up in fitting a non-linear curve to very noisy data.

## Working with concatenated time-series

### Conditioning iteratively on subsets of joint distributions of all goals
We condition pairs of two goals iteratively on subsets of all remaining goals. We start with conditioning on the empty set, i.e. we compute the pairwise distance correlation first. Afterwards, we increase the set to condition on until we have reached the set of all remaining 15 goals to condition on. These sets are represented by the joint distributions of the goals entailed in them.

We need to condition on all **subsets** of these lists of SDGs we condition on to find the dependence which solely stems from either of the two SDGs we condition the others on:

In [None]:
def combinations(iterable, r):
    # combinations('ABCD', 2) --> AB AC AD BC BD CD
    # combinations(range(4), 3) --> 012 013 023 123
    pool = tuple(iterable)
    n = len(pool)
    if r > n:
        return
    indices = list(range(r))
    yield list(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        yield list(pool[i] for i in indices)

In [None]:
def combinations_tuple(iterable, r):
    # combinations('ABCD', 2) --> AB AC AD BC BD CD
    # combinations(range(4), 3) --> 012 013 023 123
    pool = tuple(iterable)
    n = len(pool)
    if r > n:
        return
    indices = list(range(r))
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        yield tuple(pool[i] for i in indices)

In [None]:
def product(pool_0, pool_1):
    #result = [[x, y]+[z] for x, y in pool_0 for z in pool_1 if x not in z and y not in z]    # ~ 10 Mio rows
    result = [[x, y]+[z] for x, y in pool_0 for z in pool_1]    # ~ 40 Mio rows
    for prod in result:
        yield tuple(prod)

In [None]:
# create list out of all unique combinations of goals
g_combinations = list(combinations(goals, 2))
conditions_g = []
conditions_g_tuple = []
for i in range(1, 18):
    conditions_g.extend(list(combinations(goals, i)))
    conditions_g_tuple.extend(tuple(combinations_tuple(goals, i)))

pairs = list(product(g_combinations, conditions_g))
pairs_g0 = pd.DataFrame.from_records(pairs, columns=['pair_0', 'pair_1', 'condition'])

In [None]:
# adding empty condition set for pairwise dcor
pairs_g1 = pd.DataFrame.from_records(data=g_combinations, columns=['pair_0', 'pair_1'])
pairs_g1['condition'] = 0
pairs_g = pd.concat([pairs_g0, pairs_g1], ignore_index=True)

In [None]:
# check
pairs_g.tail()

In [None]:
# shapes
print(len(conditions_g))
print(pairs_g0.shape)
print(pairs_g1.shape)
print(pairs_g.shape)

For the dcor computations, we need the same number of samples in each year. We bootstrap the under-represented years to the same number of samples of the year with the maximum samples. We need bootstrapped data for all goals, so we can call them afterwards when conditioning.

In [None]:
# check
goals_values_i['France']

# Continents

In [None]:
# CHECKPOINT
dict_cov_goals_continents_2 = pickle.load(open('distance_cor/goals/dict_cov_goals_continents_2.pkl', 'rb'))
dict_cor_goals_continents_2 = pickle.load(open('distance_cor/goals/dict_cor_goals_continents_2.pkl', 'rb'))

In [None]:
# data preparation
continents_prep_g = {}
continents_prep_boot_g = {}

for continent in continents:
    print(continent)
    
    continents_prep_g[continent] = pd.DataFrame(columns=period, index=goals, dtype=object)
    continents_prep_boot_g[continent] = pd.DataFrame(columns=period, index=goals, dtype=object)
    
    for goal in goals:
        max_year_g = 0
        
        for year in period:
            y_g = []

            for country in continents[continent].dropna():
                y_g.extend(goals_values_i[country].loc[str(goal), year])
            
            continents_prep_g[continent].loc[goal, year] = np.array(y_g)

            # finding year with most measurements
            len_year_g = len(y_g)
            if len_year_g > max_year_g:
                max_year_g = len_year_g
                
        # bootstrap to have same number of samples in each year
        for year in period:
            # not considering the years without samples
            if len(continents_prep_g[continent].loc[goal, year]) == 0:
                continue    
            else:
                continents_prep_boot_g[continent].loc[goal, year] = np.random.choice(continents_prep_g[continent].loc[goal, year], size=max_year_g, replace=True)

Now we call these data in our `dcor` computations. We first compute the pairwise distance covariance and correlation, then the partial ones with conditioning on all the previously defined sets in `pairs_g`.

### Preparations
Filtering out the conditions that contain goals already being $X$ (`pair_0`) or $Y$ (`pair_1`):

In [None]:
 # repeating conditions 153 times to bring to same length as pair_0 and pair_1
conditions_g_tuple_all = conditions_g_tuple * len(pairs_g1)
conditions_g_all = conditions_g * len(pairs_g1)

# create lists which include rows that do not appear in both pair_0 or pair_1 and conditions
pair_0 = pairs_g0['pair_0'].to_list()
pair_1 = pairs_g0['pair_1'].to_list()

idx_to_delete = []
idx_to_keep = []

for c in tqdm.tqdm(range(len(pairs_g1))):
    for i in range(len(conditions_g)):
        if pair_0[i*c] in conditions_g_all[i*c]:
            idx_to_delete.append(i*c)
        elif pair_1[i*c] in conditions_g_all[i*c]:
            idx_to_delete.append(i*c)
        else:
            idx_to_keep.append(i*c)    # which rows to keep

# deleting rows with same goals in pairs_0 or pairs_1 and condition
pair_0_left = list(itertools.compress(pair_0, idx_to_keep))
pair_1_left = list(itertools.compress(pair_1, idx_to_keep))
conditions_tuple_left = list(itertools.compress(conditions_g_tuple_all, idx_to_keep))
conditions_left = list(itertools.compress(conditions_g_all, idx_to_keep))
pairs_g_left = pd.DataFrame.from_dict({'pair_0': pair_0_left, 'pair_1': pair_1_left, 'condition': conditions_tuple_left})

# Without parallelisation

*(don't even try running it, takes 800 hours per continent)*

In [None]:
# continents
# have conditions in joint random vector
dict_cov_goals_continents_2 = {}
dict_cor_goals_continents_2 = {}

for continent in continents:
    print(continent)
    
    dict_cov_goa_c = pairs_g_left.copy()
    dict_cor_goa_c = pairs_g_left.copy()
    
    for i in tqdm.tqdm(range(len(pairs_g_left))):
        
        # pairwise distance correlation
        if pairs_g_left.iloc[i]['condition'] == 0:
            dict_cov_goa_c.loc[i, 'dcov'] = dcor.distance_covariance(np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_0']].to_list()), np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_1']].to_list()))**2
            dict_cor_goa_c.loc[i, 'dcor'] = dcor.distance_correlation(np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_0']].to_list()), np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_1']].to_list()))**2

        # partial distance correlation
        else:
            # build conditional set
            condition = pd.DataFrame(index=period, columns=pairs_g_left.iloc[i]['condition'] + ['combined'])
            for y in period:
                condition.loc[y, 'combined'] = []
            for c in pairs_g_left.iloc[i]['condition']:
                condition[c] = continents_prep_boot_g[continent].loc[c]
                for y in period:
                    condition.loc[y, 'combined'].extend(condition.loc[y, c])
                
                # drop every column except 'combined' to save memory
                condition.drop(columns=c, inplace=True)

            # partial distance correlation  
            dict_cov_goa_c.loc[i, 'dcov'] = dcor.partial_distance_covariance(np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_0']].to_list()), np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_1']].to_list()), np.array(condition['combined'].to_list()))**2
            dict_cor_goa_c.loc[i, 'dcor'] = dcor.partial_distance_correlation(np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_0']].to_list()), np.array(continents_prep_boot_g[continent].loc[pairs_g_left.iloc[i]['pair_1']].to_list()), np.array(condition['combined'].to_list()))**2
        

    # find minimum distance correlation between any two goals
    dict_cov_goa_con = dict_cov_goa_c.groupby(['pair_0', 'pair_1'])['dcov'].apply(list).reset_index(name='list_dcov')
    dict_cor_goa_con = dict_cor_goa_c.groupby(['pair_0', 'pair_1'])['dcor'].apply(list).reset_index(name='list_dcor')
    
    for i, row_c in dict_cov_goa_con.iterrows():
        dict_cov_goa_con.loc[i, 'min_dcov'] = min(dict_cov_goa_con.loc[i, 'list_dcov'])
        dict_cor_goa_con.loc[i, 'min_dcor'] = min(dict_cor_goa_con.loc[i, 'list_dcor'])
    
    dict_cov_goals_continents_2[continent] = dict_cov_goa_con
    dict_cor_goals_continents_2[continent] = dict_cor_goa_con

In [None]:
# save
g_cov = open('distance_cor/goals/dict_cov_goals_continents_2_wp.pkl', 'wb')
pickle.dump(dict_cov_goals_continents_2, g_cov)
g_cov.close()

g_cor = open('distance_cor/goals/dict_cor_goals_continents_2_wp.pkl', 'wb')
pickle.dump(dict_cor_goals_continents_2, g_cor)
g_cor.close()

# With  `multiprocessing`  parallelisation


 
### Partial distance correlation

In [None]:
import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())

In [None]:
def partial_distance_cov(i, pair_0, pair_1, cond):
    pair_0_array = np.array(continents_prep_boot_g[continent].loc[pair_0].to_list())
    pair_1_array = np.array(continents_prep_boot_g[continent].loc[pair_1].to_list())
    condition_array = np.array(conditions_df.loc[[cond]])
    
    return dcor.partial_distance_covariance(pair_0_array, pair_1_array, condition_array)**2

def partial_distance_cor(i, pair_0, pair_1, cond):
    pair_0_array = np.array(continents_prep_boot_g[continent].loc[pair_0].to_list())
    pair_1_array = np.array(continents_prep_boot_g[continent].loc[pair_1].to_list())
    condition_array = np.array(conditions_df.loc[[cond]])
    
    return dcor.partial_distance_correlation(pair_0_array, pair_1_array, condition_array)**2

In [None]:
# continents

dict_cov_goals_continents_2_cond = {}
dict_cor_goals_continents_2_cond = {}

for continent in continents:
    print(continent)
    
    dict_cov_goa_c = pairs_g_left.copy(deep=True)     # pairs_g_left has all non-empty conditional sets, pairs_g1 has empty conditional sets for pairwise dcor
    dict_cor_goa_c = pairs_g_left.copy(deep=True)
    
    # preparing conditional set
    conditions_dict = {}

    for cond in conditions_g_tuple:
        conditions = np.empty([len(period), len(cond)], dtype=object)    
        conditions_comb = np.empty([len(period), ], dtype=object)
        conditions_comb_boot = np.empty([len(period), ], dtype=object)

        for i_y, y in enumerate(period):
            max_len_year = 0
            year = []
            for i_c, c in enumerate(cond):
                conditions[i_y, i_c] = continents_prep_boot_g[continent].loc[c, y]
                year.extend(conditions[i_y, i_c])

            # finding year with most measurements
            if len(year) > max_len_year:
                max_len_year = len(year)

            conditions_comb[i_y] = np.array(year)
        # bootstrap to have same number of samples in each year
        for i_y, y in enumerate(period):
            conditions_comb_boot[i_y] = np.random.choice(conditions_comb[i_y], size=max_len_year, replace=True)

        conditions_dict[cond] = conditions_comb_boot   # combined

    # make dataframe out of dictionary
    conditions_df = pd.DataFrame.from_dict(conditions_dict, orient='index', columns=period)
    
    print('start dcor...')
    
    # partial distance correlation
    pool = mp.Pool(mp.cpu_count())
    
    cov_results = []
    cor_results = []
    
    # pairwise distance correlation
    for row in dict_cov_goa_c.itertuples(name=None):
        cov_results.append(pool.apply_async(partial_distance_cov, args=row).get())
        cor_results.append(pool.apply_async(partial_distance_cor, args=row).get())
        
    pool.close()
    pool.join()
    
    dict_cov_goa_c['dcov'] = cov_results
    dict_cor_goa_c['dcor'] = cor_results
    
    print('...dcor done')

    # find minimum distance correlation between any two goals
    dict_cov_goa_con = dict_cov_goa_c.groupby(['pair_0', 'pair_1'])['dcov'].apply(list).reset_index(name='list_dcov')
    dict_cor_goa_con = dict_cor_goa_c.groupby(['pair_0', 'pair_1'])['dcor'].apply(list).reset_index(name='list_dcor')
    
    for i, row_c in dict_cov_goa_con.iterrows():
        dict_cov_goa_con.loc[i, 'min_dcov'] = min(dict_cov_goa_con.loc[i, 'list_dcov'])
        dict_cor_goa_con.loc[i, 'min_dcor'] = min(dict_cor_goa_con.loc[i, 'list_dcor'])
    
    dict_cov_goals_continents_2_cond[continent] = dict_cov_goa_con
    dict_cor_goals_continents_2_cond[continent] = dict_cor_goa_con

In [None]:
# check
dict_cor_goals_continents_2['Northern Africa']

### Pairwise distance correlation

In [None]:
def distance_cov(i, pair_0, pair_1):
    pair_0_array = np.array(continents_prep_boot_g[continent].loc[pair_0].to_list())
    pair_1_array = np.array(continents_prep_boot_g[continent].loc[pair_1].to_list())
    
    return dcor.distance_covariance(pair_0_array, pair_1_array)**2

def distance_cor(i, pair_0, pair_1):
    pair_0_array = np.array(continents_prep_boot_g[continent].loc[pair_0].to_list())
    pair_1_array = np.array(continents_prep_boot_g[continent].loc[pair_1].to_list())
    
    return dcor.distance_correlation(pair_0_array, pair_1_array)**2

In [None]:
# continents

dict_cov_goals_continents_2_pair = {}
dict_cor_goals_continents_2_pair = {}

for continent in continents:
    print(continent)
    
    dict_cov_goa_c_pair = pairs_g1.drop(columns=['condition']).copy()     # pairs_g1 has empty conditional sets for pairwise dcor
    dict_cor_goa_c_pair = pairs_g1.drop(columns=['condition']).copy()
    
    pool = mp.Pool(mp.cpu_count())
    
    cov_results = []
    cor_results = []
    
    # pairwise distance correlation
    for row in dict_cov_goa_c_pair.itertuples(name=None):
        cov_results.append(pool.apply_async(distance_cov, args=row).get())
        cor_results.append(pool.apply_async(distance_cor, args=row).get())
        
    pool.close()
    pool.join()
    
    dict_cov_goa_c_pair['dcov'] = cov_results
    dict_cor_goa_c_pair['dcor'] = cor_results
    
    print('dcor done...')
    
    # find minimum distance correlation between any two goals
    dict_cov_goa_con_pair = dict_cov_goa_c_pair.groupby(['pair_0', 'pair_1'])['dcov'].apply(list).reset_index(name='list_dcov')
    dict_cor_goa_con_pair = dict_cor_goa_c_pair.groupby(['pair_0', 'pair_1'])['dcor'].apply(list).reset_index(name='list_dcor')
    
    for i, row_c in dict_cov_goa_c_pair.iterrows():
        dict_cov_goa_con_pair.loc[i, 'min_dcov'] = min(dict_cov_goa_con_pair.loc[i, 'list_dcov'])
        dict_cor_goa_con_pair.loc[i, 'min_dcor'] = min(dict_cor_goa_con_pair.loc[i, 'list_dcor'])
    
    dict_cov_goals_continents_2_pair[continent] = dict_cov_goa_con_pair
    dict_cor_goals_continents_2_pair[continent] = dict_cor_goa_con_pair

In [None]:
# check
dict_cor_goals_continents_2_pair['Northern Africa']

In [None]:
# merge dictionaries
dict_cov_goals_continents_2 = dict_cov_goals_continents_2_cond.update(dict_cov_goals_continents_2_pair)
dict_cor_goals_continents_2 = dict_cor_goals_continents_2_cond.update(dict_cor_goals_continents_2_pair)

In [None]:
# save
g_cov = open('distance_cor/goals/dict_cov_goals_continents_2.pkl', 'wb')
pickle.dump(dict_cov_goals_continents_2, g_cov)
g_cov.close()

g_cor = open('distance_cor/goals/dict_cor_goals_continents_2.pkl', 'wb')
pickle.dump(dict_cor_goals_continents_2, g_cor)
g_cor.close()

### Independence tests
We want to eliminate spurious distance correlations by performing independence tests of the smallest partial distance correlations. As suggested in [Szekely et al. (2007)](https://projecteuclid.org/download/pdfview_1/euclid.aos/1201012979), we use the permutation test to do so. Note that the partial distance covariance test takes the inner product of the double-centered distance matrices of $\mathbf{U}$ and $\mathbf{V}$ as the test statistic. $\mathbf{U}$ and $\mathbf{V}$ are the metric multi-dimensional scalings (MDS) of the projection matrices $P_z(x)$ and $P_z(y)$.

In [None]:
# pairwise distance covariance test
def dcov_test(X, Y, permutation, alpha):
    
    statistic = dcor.distance_covariance(X, Y)

    dcov_arr = np.zeros(permutation)

    # create permutations by reshuffling random vector Y
    for per in range(permutation):
        Y_perm = np.random.permutation(Y)
        dcov_arr[per] = dcor.distance_covariance(X, Y_perm)

    dcov_arr_sort = np.sort(dcov_arr)

    # computing 1-alpha threshold
    threshold = dcov_arr_sort[round((1-alpha)*permutation)]

    if statistic > threshold:
        dcov_ = dcor.distance_covariance(X, Y)
        dcor_ = dcor.distance_correlation(X, Y)

    else:
        dcov_ = 0
        dcor_ = 0
    
    return dcov_, dcor_

# partial distance covariance test
def pdcov_test(X, Y, Z, permutation, alpha):
    
    a = _u_distance_matrix(X)
    b = _u_distance_matrix(Y)
    c = _u_distance_matrix(Z)

    proj = u_complementary_projection(c)

    a_p = proj(a)
    b_p = proj(b)

    embedding = MDS(n_components=2)
    U = embedding.fit_transform(a_p)
    V = embedding.fit_transform(b_p)

    # test statistic
    statistic = dcor.u_product(dcor.double_centered(U), dcor.double_centered(V))

    # initiating dcor
    dcov_arr = np.zeros(permutation)

    # create permutations by reshuffling X
    for per in range(permutation):
        index_perm = np.random.permutation(X.shape[0])
        X_perm = X[index_perm]
        a_perm = _u_distance_matrix(X_perm)
        a_p_perm = proj(a_perm)
        U_perm = embedding.fit_transform(a_p_perm)
        dcov_arr[per] = dcor.u_product(dcor.double_centered(U_perm), dcor.double_centered(V))

    dcov_arr_sort = np.sort(dcov_arr)

    # computing 1-alpha threshold
    threshold = dcov_arr_sort[round((1-alpha)*permutation)]

    if statistic > threshold:
        dcov_ = dcor.partial_distance_covariance(X, Y, Z)
        dcor_ = dcor.partial_distance_correlation(X, Y, Z)

    else:
        dcov_ = 0
        dcor_ = 0
    
    return dcov_, dcor_

In [None]:
# continents
alpha = 0.05

# have conditions in joint random vector
dict_cov_goals_continents_2 = {}
dict_cor_goals_continents_2 = {}

for continent in continents:
    print(continent)
    
    # finding which set to test for independence
    to_test_cov = dict_cov_goa_c.loc[dict_cov_goa_c['dcov'] == dict_cov_goa_con.loc[i, 'min_dcov']]
    to_test_cor = dict_cor_goa_c.loc[dict_cor_goa_c['dcor'] == dict_cor_goa_con.loc[i, 'min_dcor']]

    # empty conditional set
    if to_test_cov['condition'].values[0] == 0:
        dict_cov_goa_con.loc[i, 'min_dcov_test'] = dcov_test(np.array(continents_prep_boot_g[continent].loc[to_test_cov['pair_0'].values[0]].to_list()), np.array(continents_prep_boot_g[continent].loc[to_test_cov['pair_1'].values[0]].to_list()), permutation=1000, alpha=alpha)[0]
        dict_cor_goa_con.loc[i, 'min_dcor_test'] = dcov_test(np.array(continents_prep_boot_g[continent].loc[to_test_cor['pair_0'].values[0]].to_list()), np.array(continents_prep_boot_g[continent].loc[to_test_cor['pair_1'].values[0]].to_list()), permutation=1000, alpha=alpha)[1]

    # non-empty conditional set
    else:
        dict_cov_goa_con.loc[i, 'min_dcov_test'] = pdcov_test(np.array(continents_prep_boot_g[continent].loc[to_test_cov['pair_0'].values[0]].to_list()), np.array(continents_prep_boot_g[continent].loc[to_test_cov['pair_1'].values[0]].to_list()), np.array(condition_test['combined'].to_list()), permutation=1000, alpha=alpha)[0]
        dict_cor_goa_con.loc[i, 'min_dcor_test'] = pdcov_test(np.array(continents_prep_boot_g[continent].loc[to_test_cor['pair_0'].values[0]].to_list()), np.array(continents_prep_boot_g[continent].loc[to_test_cor['pair_1'].values[0]].to_list()), np.array(condition_test['combined'].to_list()), permutation=1000, alpha=alpha)[1]

dict_cov_goals_continents_2_tests[continent] = dict_cov_goa_con
dict_cor_goals_continents_2_tests[continent] = dict_cor_goa_con

In [None]:
# save
g_cov = open('distance_cor/goals/dict_cov_goals_continents_2_tests.pkl', 'wb')
pickle.dump(dict_cov_goals_continents_2_tests, g_cov)
g_cov.close()

g_cor = open('distance_cor/goals/dict_cor_goals_continents_2_tests.pkl', 'wb')
pickle.dump(dict_cor_goals_continents_2_tests, g_cor)
g_cor.close()

We want to keep the minimum distance correlation of each pair of two goals, pairwise or conditioned on any potential subset.

The last step is to insert these values into the right cell in a matrix.

In [None]:
for continent in continents:
    for i, row in dict_cov_goals_continents_2[continent].iterrows():
        dict_cov_goals_continents_2[continent].loc[i, 'max_dcov'] = max(dict_cov_goals_continents_2[continent].loc[i, 'list_dcov'])
        dict_cor_goals_continents_2[continent].loc[i, 'max_dcor'] = max(dict_cor_goals_continents_2[continent].loc[i, 'list_dcor'])
        dict_cov_goals_continents_2[continent].loc[i, 'avg_dcov'] = np.mean(dict_cov_goals_continents_2[continent].loc[i, 'list_dcov'])
        dict_cor_goals_continents_2[continent].loc[i, 'avg_dcor'] = np.mean(dict_cor_goals_continents_2[continent].loc[i, 'list_dcor'])

In [None]:
cov_goals_continents_2 = {}
cor_goals_continents_2 = {}

for continent in continents:
    cov_goals_continents_2[continent] = pd.DataFrame(index=goals, columns=goals)
    cor_goals_continents_2[continent] = pd.DataFrame(index=goals, columns=goals)

    for i in tqdm(list(dict_cov_goals_continents_2[continent].index)):
        goal_0 = dict_cov_goals_continents_2[continent].loc[i, 'pair_0']
        goal_1 = dict_cov_goals_continents_2[continent].loc[i, 'pair_1']
        
        cov_goals_continents_2[continent].loc[goal_1, goal_0] = np.sqrt(dict_cov_goals_continents_2[continent].loc[i, 'min_dcov_test'])
        cor_goals_continents_2[continent].loc[goal_1, goal_0] = np.sqrt(dict_cor_goals_continents_2[continent].loc[i, 'min_dcor_test'])

In [None]:
# check
cor_goals_continents_2['Northern Africa']

In `cor_goals_continents_2` are the conditional distance correlations for all continents in a setting of 18 random vectors $X$, $Y$, and $Z_1, Z_2, ..., Z_{16}$, where $\boldsymbol{Z}$ is the array containing all random vectors we want to condition on.

In [None]:
# save

if not os.path.exists('distance_cor/goals'):
    os.mkdir('distance_cor/goals')
"""
for continent in continents:
    cov_goals_continents_2[continent].to_csv(r'distance_cor/goals/{}_dcov_goals.csv'.format(continent))
    cor_goals_continents_2[continent].to_csv(r'distance_cor/goals/{}_dcor_goals.csv'.format(continent))
"""
g_cov = open('distance_cor/goals/dcov_goals_continents_2.pkl', 'wb')
pickle.dump(cov_goals_continents_2, g_cov)
g_cov.close()

g_cor = open('distance_cor/goals/dcor_goals_continents_2.pkl', 'wb')
pickle.dump(cor_goals_continents_2, g_cor)
g_cor.close()

## Visualisation on goal-level
Additionally to the matrices with numbers, we would also like to visualise these matrices and plot these correlations as networks.

In [None]:
# continents
for continent in continents:
    # generate a mask for the upper triangle
    mask = np.zeros_like(cor_goals_continents_2[continent].fillna(0), dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # set up the matplotlib figure
    f, ax = plt.subplots(figsize=(25, 22))

    # generate a custom diverging colormap
    cmap = sns.color_palette("Reds", 100)

    # draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(cor_goals_continents_2[continent].fillna(0), mask=mask, cmap=cmap, vmax=1, center=0.5, vmin=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .8})
    
    plt.title('{}'.format(continent), fontdict={'fontsize': 52})
    plt.savefig('distance_cor/goals/{}_cor_goals.png'.format(continent))

In [None]:
# data preparation for networkX
dcor_dict_g = {}

for continent in cor_goals_continents_2.keys():
    dcor_dict_g[continent] = {}

    for goalcombination in g_combinations:
        dcor_dict_g[continent][goalcombination] = cor_goals_continents_2[continent].loc[goalcombination[1], goalcombination[0]]

In [None]:
for continent in cor_goals_continents_2.keys():
    for key in dcor_dict_g[continent].keys():
        if key[1] == 'T':
            dcor_dict_g[continent][tuple((key[0], '18'))] = dcor_dict_g[continent].pop(tuple((key[0], 'T')))
        elif key[0] == 'T':
            dcor_dict_g[continent][tuple(('18', key[1]))] = dcor_dict_g[continent].pop(tuple(('T', key[1])))

In [None]:
# plotting networks with weighted edges

layout = 'circular'

centrality_C = {}     # dictionary to save centralities
degree_C = {}    # dictionary to save degrees
density_C = {}    # dictionary to save weighted densities
p_C = {}    # auxiliary
partition_C = {}    # dictionary to save clusters

for continent in cor_goals_continents_2.keys():
    G_C = nx.Graph()

    for key, value in dcor_dict_g[continent].items():
        G_C.add_edge(int(key[0]), int(key[1]), weight=value, color=sns.color_palette("Reds", 100)[int(np.around(value*100))], alpha=value)
        
    if layout == 'circular':
        pos = nx.circular_layout(G_C)
    elif layout == 'spring':
        pos = nx.spring_layout(G_C)
    
    plt.figure(figsize=(24,16))

    # nodes
    nx.draw_networkx_nodes(G_C, pos, node_size=1000)

    # labels
    nx.draw_networkx_labels(G_C, pos, font_size=46, font_family='sans-serif')

    nodes = G_C.nodes()
    edges = G_C.edges()
    colors = [G_C[u][v]['color'] for u,v in edges]
    weights = [G_C[u][v]['weight'] for u,v in edges]

    nx.draw_networkx(G_C, pos, with_labels=False, edges=edges, edge_color=colors, node_color='white', node_size=1000, width=np.multiply(weights,20))

    ax=plt.gca()
    fig=plt.gcf()
    trans = ax.transData.transform
    trans_axes = fig.transFigure.inverted().transform
    imsize = 0.08    # this is the image size
    plt.title('{}'.format(continent), fontdict={'fontsize': 52})

    for node in G_C.nodes():
        (x,y) = pos[node]   
        xx,yy = trans((x,y)) # figure coordinates
        xa,ya = trans_axes((xx,yy)) # axes coordinates
        a = plt.axes([xa-imsize/2.0,ya-imsize/2.0, imsize, imsize])
        a.imshow(mpimg.imread('utils/images/E_SDG goals_icons-individual-rgb-{}.png'.format(node)))
        a.axis('off')


    plt.axis('off')
    ax.axis('off')
    
    plt.savefig('distance_cor/goals/{}_{}_network_logos.png'.format(continent, layout), format='png')

    plt.show()
    
    # weighted centrality
    centr = nx.eigenvector_centrality(G_C, weight='weight')
    centrality_C[continent] = sorted((v, '{:0.2f}'.format(c)) for v, c in centr.items())
    
    degree_C[continent] = dict(G_C.degree(weight='weight'))
    
    # weighted density
    density_C[continent] = 2 * np.sum(weights) / (len(nodes) * (len(nodes) - 1))
    
    # weighted clustering with Louvain algorithm
    part_C = {}
    modularity_C = {}
    for i in range(100):
        part_C[i] = community.best_partition(G_C, random_state=i)
        modularity_C[i] = community.modularity(part_C[i], G_C)
    
    p_C[continent] = part_C[max(modularity_C, key=modularity_C.get)]

    # having lists with nodes being in different clusters
    partition_C[continent] = {}
    for com in set(p_C[continent].values()) :
        partition_C[continent][com] = [nodes for nodes in p_C[continent].keys() if p_C[continent][nodes] == com]

In [None]:
# clusters
for continent in continents:
    print(continent)
    print(partition_C[continent])
    print('-------------------------')

In [None]:
# centralities
for continent in continents:
    print(continent)
    print(centrality_C[continent])
    print('-------------------------')

In [None]:
# degrees
for continent in continents:
    print(continent)
    print(degree_C[continent])
    print('-------------------------')

In [None]:
# densities
for continent in continents:
    print(continent)
    print(density_C[continent])
    print('-------------------------')

### Eigenvector visualisation

In [None]:
def get_image(goal):
    return OffsetImage(plt.imread('utils/images/E_SDG goals_icons-individual-rgb-{}.png'.format(goal)), zoom=0.04)

In [None]:
for continent in cor_goals_continents_2.keys():
    # separating goals from their centralities
    x = []
    y = []
    for cent in centrality_C[continent]:
        x.append(cent[0])
        y.append(float(cent[1]))

    fig, ax = plt.subplots(figsize=(18,12))
    plt.title('{}'.format(continent), fontdict={'fontsize': 52})
    ax.scatter(x, y) 
    
    # adding images
    for x0, y0, goal in zip(x, y, list(nodes)):
        ab = AnnotationBbox(get_image(goal), (x0, y0), frameon=False)
        ax.add_artist(ab)

    ax.set_xticks([])
    ax.yaxis.grid()
    ax.ylim(0, 0.6)
    ax.ylabel('eigenvector centrality')
    ax.xlabel('SDGs')
    
    plt.savefig('distance_cor/goals/{}_eigenvector_centrality.png'.format(continent), format='png')
    
    plt.show()

### Cluster visualisation

In [None]:
# plotting clusters in networks with weighted edges

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection

layout = 'spring'

for continent in cor_goals_continents_2.keys():
    G_C = nx.Graph()

    for key, value in dcor_dict_g[continent].items():
        G_C.add_edge(int(key[0]), int(key[1]), weight=value, color=sns.color_palette("Reds", 100)[int(np.around(value*100))], alpha=value)
        
    if layout == 'circular':
        pos = nx.circular_layout(G_C)
    elif layout == 'spring':
        pos = nx.spring_layout(G_C, iterations=100, seed=42)
    
    plt.figure(figsize=(24,16))

    # nodes
    nx.draw_networkx_nodes(G_C, pos, node_size=1000)

    # labels
    nx.draw_networkx_labels(G_C, pos, font_size=46, font_family='sans-serif')

    nodes = G_C.nodes()
    edges = G_C.edges()
    colors = [G_C[u][v]['color'] for u,v in edges]
    weights = [G_C[u][v]['weight'] for u,v in edges]

    nx.draw_networkx(G_C, pos, with_labels=False, edges=edges, edge_color=colors, node_color='white', node_size=1000, width=np.multiply(weights,20))

    ax=plt.gca()
    fig=plt.gcf()
    trans = ax.transData.transform
    trans_axes = fig.transFigure.inverted().transform
    imsize = 0.08    # this is the image size
    plt.title('{}'.format(continent), fontdict={'fontsize': 52})

    for node in G_C.nodes():
        x,y = pos[node]   
        xx,yy = trans((x,y)) # figure coordinates
        xa,ya = trans_axes((xx,yy)) # axes coordinates
        a = plt.axes([xa-imsize/2.0,ya-imsize/2.0, imsize, imsize])
        a.imshow(mpimg.imread('utils/images/E_SDG goals_icons-individual-rgb-{}.png'.format(node)))
        a.axis('off')
    
    # clusters as patches
    patches = []
    for com, goals in partition_C[continent].items():
        position = []
        for goal in goals:
            x,y = pos[goal]
            position.append((x,y))
        
        positions = []
        for i in range(6000):
            np.random.shuffle(position)
            positions.extend(position)
        
        # polygens
        polygon = Polygon(positions, closed=False)
        patches.append(polygon)
    
    np.random.seed(72)
    colors = 100*np.random.rand(len(patches))
    p = PatchCollection(patches, alpha=0.4)
    p.set_array(np.array(colors))
    ax.add_collection(p)
        
    plt.axis('off')
    ax.axis('off')
    
    plt.savefig('distance_cor/goals/{}_{}_network_logos_patches.png'.format(continent, layout), format='png')

    plt.show()

# Groups

# ALIGN WITH CHANGED CODE FOR CONTINENTS

In [None]:
# CHECKPOINT
dict_cov_goals_groups_2 = pickle.load(open('distance_cor/goals/dict_cov_goals_groups_2.pkl', 'rb'))
dict_cor_goals_groups_2 = pickle.load(open('distance_cor/goals/dict_cor_goals_groups_2.pkl', 'rb'))

In [None]:
# data preparation
groups_prep_g = {}
groups_prep_boot_g = {}

for group in groups:
    print(group)
    
    groups_prep_g[group] = pd.DataFrame(columns=period, index=goals)
    groups_prep_boot_g[group] = pd.DataFrame(columns=period, index=goals)
    
    for goal in goals:
        max_year_g = 0
        
        for year in period:
            y_g = []

            for country in groups[group].dropna():
                y_g.extend(goals_values_i[country].loc[str(goal), year])
            
            groups_prep_g[group].loc[goal, year] = y_g
    
            # finding year with most measurements
            len_year_g = len(y_g)
            if len_year_g > max_year_g:
                max_year_g = len_year_g
                
        # bootstrap to have same number of samples in each year
        for year in period:
            # not considering the years without samples
            if len(groups_prep_g[group].loc[goal, year]) == 0:
                continue    
            else:
                groups_prep_boot_g[group].loc[goal, year] = np.random.choice(groups_prep_g[group].loc[goal, year], size=max_year_g, replace=True).tolist()

In [None]:
# groups
# have conditions in joint random vector
dict_cov_goals_groups_2 = {}
dict_cor_goals_groups_2 = {}

for group in groups:
    print(group)
    
    dict_cov_goa_g = pairs_g0.copy()
    dict_cor_goa_g = pairs_g0.copy()
    
    for i, row in pairs_g0.iterrows():
        #print(row)
        
        if row['condition'] == 0:            
            # pairwise distance correlation
            dict_cov_goa_g.loc[i, 'dcov'] = dcor.distance_covariance(np.array(groups_prep_boot_g[group].loc[row['pair_0']].to_list()), np.array(groups_prep_boot_g[group].loc[row['pair_1']].to_list()))**2
            dict_cor_goa_g.loc[i, 'dcor'] = dcor.distance_correlation(np.array(groups_prep_boot_g[group].loc[row['pair_0']].to_list()), np.array(groups_prep_boot_g[group].loc[row['pair_1']].to_list()))**2
        
        else:
            # partial distance correlation
            condition = pd.DataFrame(index=period, columns=row['condition'] + ['combined'])
            for y in period:
                condition.loc[y, 'combined'] = []
            for c in row['condition']:
                condition[c] = groups_prep_boot_g[group].loc[c]
                for y in period:
                    condition.loc[y, 'combined'].extend(condition.loc[y, c])
                    
            # square partial distance correlation to range [0, 1]    
            dict_cov_goa_g.loc[i, 'dcov'] = dcor.partial_distance_covariance(np.array(groups_prep_boot_g[group].loc[row['pair_0']].to_list()), np.array(groups_prep_boot_g[group].loc[row['pair_1']].to_list()), np.array(condition['combined'].to_list()))**2
            dict_cor_goa_g.loc[i, 'dcor'] = dcor.partial_distance_correlation(np.array(groups_prep_boot_g[group].loc[row['pair_0']].to_list()), np.array(groups_prep_boot_g[group].loc[row['pair_1']].to_list()), np.array(condition['combined'].to_list()))**2
    
    # find minimum distance correlation between any two goals
    dict_cov_goa_gr = dict_cov_goa_g.groupby(['pair_0', 'pair_1'])['dcov'].apply(list).reset_index(name='list_dcov')
    dict_cor_goa_gr = dict_cor_goa_g.groupby(['pair_0', 'pair_1'])['dcor'].apply(list).reset_index(name='list_dcor')
    
    for i, row_g in dict_cov_goa_gr.iterrows():
        dict_cov_goa_gr.loc[i, 'min_dcov'] = min(dict_cov_goa_gr.loc[i, 'list_dcov'])
        dict_cor_goa_gr.loc[i, 'min_dcor'] = min(dict_cor_goa_gr.loc[i, 'list_dcor'])
       
    dict_cov_goals_groups_2[group] = dict_cov_goa_gr
    dict_cor_goals_groups_2[group] = dict_cor_goa_gr

In [None]:
# save
g_cov = open('distance_cor/goals/dict_cov_goals_groups_2.pkl', 'wb')
pickle.dump(dict_cov_goals_groups_2, g_cov)
g_cov.close()

g_cor = open('distance_cor/goals/dict_cor_goals_groups_2.pkl', 'wb')
pickle.dump(dict_cor_goals_groups_2, g_cor)
g_cor.close()

In [None]:
for group in groups:
    for i, row in dict_cov_goals_groups_2[group].iterrows():
        dict_cov_goals_groups_2[group].loc[i, 'max_dcov'] = max(dict_cov_goals_groups_2[group].loc[i, 'list_dcov'])
        dict_cor_goals_groups_2[group].loc[i, 'max_dcor'] = max(dict_cor_goals_groups_2[group].loc[i, 'list_dcor'])
        dict_cov_goals_groups_2[group].loc[i, 'avg_dcov'] = np.mean(dict_cov_goals_groups_2[group].loc[i, 'list_dcov'])
        dict_cor_goals_groups_2[group].loc[i, 'avg_dcor'] = np.mean(dict_cor_goals_groups_2[group].loc[i, 'list_dcor'])

In [None]:
cov_goals_groups_2 = {}
cor_goals_groups_2 = {}

for group in groups:
    cov_goals_groups_2[group] = pd.DataFrame(index=goals, columns=goals)
    cor_goals_groups_2[group] = pd.DataFrame(index=goals, columns=goals)

    for i in list(dict_cov_goals_groups_2[group].index):
        goal_0 = dict_cov_goals_groups_2[group].loc[i, 'pair_0']
        goal_1 = dict_cov_goals_groups_2[group].loc[i, 'pair_1']
        
        cov_goals_groups_2[group].loc[goal_1, goal_0] = np.sqrt(dict_cov_goals_groups_2[group].loc[i, 'min_dcov'])
        cor_goals_groups_2[group].loc[goal_1, goal_0] = np.sqrt(dict_cor_goals_groups_2[group].loc[i, 'min_dcor'])

In [None]:
# save

if not os.path.exists('distance_cor/goals'):
    os.mkdir('distance_cor/goals')
"""
for group in groups:
    cov_goals_groups_2[group].to_csv(r'distance_cor/goals/{}_dcov_goals.csv'.format(group))
    cor_goals_groups_2[group].to_csv(r'distance_cor/goals/{}_dcor_goals.csv'.format(group))
"""    
g_cov = open('distance_cor/goals/dcov_goals_groups_2.pkl', 'wb')
pickle.dump(cov_goals_groups_2, g_cov)
g_cov.close()

g_cor = open('distance_cor/goals/dcor_goals_groups_2.pkl', 'wb')
pickle.dump(cor_goals_groups_2, g_cor)
g_cor.close()

## Visualisation on goal-level
Additionally to the matrices with numbers, we would also like to visualise these matrices and plot these correlations as networks.

In [None]:
# groups
for group in groups:
    # generate a mask for the upper triangle
    mask = np.zeros_like(cor_goals_groups_2[group].fillna(0), dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # set up the matplotlib figure
    f, ax = plt.subplots(figsize=(25, 22))

    # generate a custom diverging colormap
    cmap = sns.color_palette("Reds", 100)

    # draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(cor_goals_groups_2[group].fillna(0), mask=mask, cmap=cmap, vmax=1, center=0.5, vmin=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .8})
    
    plt.title('{}'.format(group), fontdict={'fontsize': 52})
    plt.savefig('distance_cor/goals/{}_cor_goals.png'.format(group))

In [None]:
# data preparation for networkX
dcor_dict_g = {}

for group in cor_goals_groups_2.keys():
    dcor_dict_g[group] = {}

    for goalcombination in g_combinations:
        dcor_dict_g[group][goalcombination] = cor_goals_groups_2[group].loc[goalcombination[1], goalcombination[0]]

In [None]:
for group in cor_goals_groups_2.keys():
    for key in dcor_dict_g[group].keys():
        if key[1] == 'T':
            dcor_dict_g[group][tuple((key[0], '18'))] = dcor_dict_g[group].pop(tuple((key[0], 'T')))
        elif key[0] == 'T':
            dcor_dict_g[group][tuple(('18', key[1]))] = dcor_dict_g[group].pop(tuple(('T', key[1])))

In [None]:
# plotting networks with weighted edges

layout = 'circular'

centrality_G = {}    # dictionary to save centralities
degree_G = {}     # dictionary to save degrees
density_G = {}    # dictionary to save weighted densities
p_G = {}    # auxiliary
partition_G = {}    # dictionary to save clusters

for group in cor_goals_groups_2.keys():
    G_G = nx.Graph()

    for key, value in dcor_dict_g[group].items():
        G_G.add_edge(int(key[0]), int(key[1]), weight=value, color=sns.color_palette("Reds", 100)[int(np.around(100*value))], alpha=value)
        
    if layout == 'circular':
        pos = nx.circular_layout(G_G)
    elif layout == 'spring':
        pos = nx.spring_layout(G_G)
    
    plt.figure(figsize=(24,16))

    # nodes
    nx.draw_networkx_nodes(G_G, pos, node_size=1000)

    # labels
    nx.draw_networkx_labels(G_G, pos, font_size=46, font_family='sans-serif')

    nodes = G_G.nodes()
    edges = G_G.edges()
    colors = [G_G[u][v]['color'] for u,v in edges]
    weights = [G_G[u][v]['weight'] for u,v in edges]

    nx.draw_networkx(G_G, pos, with_labels=False, edges=edges, edge_color=colors, node_color='white', node_size=1000, width=np.multiply(weights,20))

    ax=plt.gca()
    fig=plt.gcf()
    trans = ax.transData.transform
    trans_axes = fig.transFigure.inverted().transform
    imsize = 0.08    # this is the image size
    plt.title('{}'.format(group), fontdict={'fontsize': 52})

    for node in G_G.nodes():
        (x,y) = pos[node]   
        xx,yy = trans((x,y)) # figure coordinates
        xa,ya = trans_axes((xx,yy)) # axes coordinates
        a = plt.axes([xa-imsize/2.0,ya-imsize/2.0, imsize, imsize])
        a.imshow(mpimg.imread('utils/images/E_SDG goals_icons-individual-rgb-{}.png'.format(node)))
        a.axis('off')


    plt.axis('off')
    ax.axis('off')
    
    plt.savefig('distance_cor/goals/{}_{}_network_logos.png'.format(group, layout), format='png')

    plt.show()
    
    # centrality
    centr = nx.eigenvector_centrality(G_G, weight='weight')
    centrality_G[group] = sorted((v, '{:0.2f}'.format(c)) for v, c in centr.items())
    
    degree_G[group] = dict(G_G.degree(weight='weight'))
    
    # weighted density
    density_G[group] = 2 * np.sum(weights) / (len(nodes) * (len(nodes) - 1))
    
    # weighted clustering with Louvain algorithm
    p_G[group] = community.best_partition(G_G)

    # having lists with nodes being in different clusters
    partition_G[group] = {}
    for com in set(p_G[group].values()) :
        partition_G[group][com] = [nodes for nodes in p_G[group].keys() if p_G[group][nodes] == com]

In [None]:
# clusters
for group in groups:
    print(group)
    print(partition_G[group])
    print('-------------------------')

In [None]:
for group in groups:
    print(group)
    print(centrality_G[group])
    print('-------------------------')

In [None]:
for group in groups:
    print(group)
    print(degree_G[group])
    print('-------------------------')

In [None]:
for group in groups:
    print(group)
    print(density_G[group])
    print('-------------------------')

### Eigenvector visualisation

In [None]:
for group in cor_goals_groups_2.keys():
    # separating goals from their centralities
    x = []
    y = []
    for cent in centrality_G[group]:
        x.append(cent[0])
        y.append(float(cent[1]))

    fig, ax = plt.subplots(figsize=(18,12))
    plt.title('{}'.format(group), fontdict={'fontsize': 52})
    ax.scatter(x, y) 
    
    # adding images
    for x0, y0, goal in zip(x, y, list(nodes)):
        ab = AnnotationBbox(get_image(goal), (x0, y0), frameon=False)
        ax.add_artist(ab)

    ax.set_xticks([])
    ax.yaxis.grid()
    ax.ylim(0, 0.6)
    ax.ylabel('eigenvector centrality')
    ax.xlabel('SDGs')
    
    plt.savefig('distance_cor/goals/{}_eigenvector_centrality.png'.format(group), format='png')
    
    plt.show()

### Cluster visualisation

In [None]:
# plotting clusters in networks with weighted edges

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection

layout = 'spring'

for group in cor_goals_groups_2.keys():
    G_G = nx.Graph()

    for key, value in dcor_dict_g[group].items():
        G_G.add_edge(int(key[0]), int(key[1]), weight=value, color=sns.color_palette("Reds", 100)[int(np.around(100*value))], alpha=value)
        
    if layout == 'circular':
        pos = nx.circular_layout(G_G)
    elif layout == 'spring':
        pos = nx.spring_layout(G_G, iterations=100, seed=42)
    
    plt.figure(figsize=(24,16))

    # nodes
    nx.draw_networkx_nodes(G_G, pos, node_size=1000)

    # labels
    nx.draw_networkx_labels(G_G, pos, font_size=46, font_family='sans-serif')

    nodes = G_G.nodes()
    edges = G_G.edges()
    colors = [G_G[u][v]['color'] for u,v in edges]
    weights = [G_G[u][v]['weight'] for u,v in edges]

    nx.draw_networkx(G_G, pos, with_labels=False, edges=edges, edge_color=colors, node_color='white', node_size=1000, width=np.multiply(weights,20))

    ax=plt.gca()
    fig=plt.gcf()
    trans = ax.transData.transform
    trans_axes = fig.transFigure.inverted().transform
    imsize = 0.08    # this is the image size
    plt.title('{}'.format(group), fontdict={'fontsize': 52})

    for node in G_G.nodes():
        (x,y) = pos[node]   
        xx,yy = trans((x,y)) # figure coordinates
        xa,ya = trans_axes((xx,yy)) # axes coordinates
        a = plt.axes([xa-imsize/2.0,ya-imsize/2.0, imsize, imsize])
        a.imshow(mpimg.imread('utils/images/E_SDG goals_icons-individual-rgb-{}.png'.format(node)))
        a.axis('off')
    
    # clusters as patches
    patches = []
    for com, goals in partition_G[group].items():
        position = []
        for goal in goals:
            x,y = pos[goal]
            position.append((x,y))
        
        positions = []
        for i in range(6000):
            np.random.shuffle(position)
            positions.extend(position)
        
        # polygens
        polygon = Polygon(positions, closed=False)
        patches.append(polygon)
    
    np.random.seed(72)
    colors = 100*np.random.rand(len(patches))
    p = PatchCollection(patches, alpha=0.4)
    p.set_array(np.array(colors))
    ax.add_collection(p)

    plt.axis('off')
    ax.axis('off')
    
    plt.savefig('distance_cor/goals/{}_{}_network_logos_patches.png'.format(group, layout), format='png')

    plt.show()