The purpose of this notebook is to start a quick analysis of the degrees for the graphs studied.

# Packages & Utils
The order and transform functions are set to standardize the degree distribution into a Pareto distribution, in order to get the extreme nodes. We find a threshold based on a perentile on the Pareto standardized variable.

In [1]:
import numpy as np
import os 
rnd_seed = 42

np.random.seed(rnd_seed)

In [2]:
def order(X):
    """Return the order statistic of each sample in X, features by features
    """
    n, d = np.shape(X)
    R = np.sort(X, axis=0)
    return R


def transform_(R, x):
    """Common transformation of each marginal in standard Pareto
    Parameters
    ----------
    R : numpy array
        The order of training data. Used to transform new data.
    x : numpy array with shape (n, d)
        the data to transform (sample by sample, independently) according to
        training transform
    Returns
    -------
    a : numpy array with shape (n, d)
        transformation of x
    """
    n, d = np.shape(x)
    n_R = np.shape(R)[0]
    a = np.zeros((n, d))
    for i in range(d):
        a[:, i] = np.searchsorted(R[:, i], x[:, i]) / float(n_R + 1)
    return 1. / (1-a)

# Analysis

In [3]:
dataset_dict = dict()

In [4]:
percentile = 90
for file_name in os.listdir("./degrees/"):
    table = np.loadtxt('./degrees/'+file_name, delimiter=",")
    V = transform_(order(table[:,1].reshape(-1,1)), table[:,1].reshape(-1,1))
    dataset_dict[file_name] = np.percentile(V, q=percentile)

In [5]:
dataset_dict

{'deg_BlogCatalog.txt': 9.4103225806451576,
 'deg_CA-AstroPh.txt': 9.2915384615384653,
 'deg_CA-HepPh.txt': 9.415806451612907,
 'deg_Flickr.txt': 9.5,
 'deg_Wiki-Vote.txt': 9.1800586510264015,
 'deg_cora.txt': 6.8399999999999981,
 'deg_mag.txt': 9.8093505039193811}

The dictionnary above provides thresholds for each dataset based on the quantile on the Pareto standardized variable. For a new node $n^{new}$, one can assess if $n^{new}$ is extreme if the Pareto standardized degree of $n^{new}$ is greater than the threshold set for this dataset.