## Edge Weight Shuffle

Adam has the cool idea for a null model to shuffle the edge weights that we filter on to test whether the observed persistence diagram is a result of the shape of the network or the specific combination of shape and filtration parameter. We implement that here.

#### Preliminaries

In [None]:
# load some packages
import Gavin.utils.random_complexes as rc
import Gavin.utils.make_network as mn
from time import time
import oatpy as oat

# config
DATA_PATH = 'datasets/concept_network/'
CONCEPT_FILE = 'articles_category_for_2l_abstracts_concepts_processed_v1_EX_102.csv.gz' # Applied Mathematics
# CONCEPT_FILE = 'concepts_Applied Economics_1402.csv.gz' # Applied Econ
# CONCEPT_FILE = 'concepts_Zoology_608.csv' # Zoology
MIN_RELEVANCE= 0.7
MIN_FREQ = 0.00006 # 0.006%
MAX_FREQ = 0.0005 # 0.05%
MIN_YEAR = 1920

#### Original Network
Use the data file to create the original network and calculate homology.

In [None]:
# create the network
G_orig = mn.gen_concept_network(
        DATA_PATH + CONCEPT_FILE,
        min_relevance=MIN_RELEVANCE,
        min_year=MIN_YEAR,
        min_articles=MIN_FREQ,
        max_articles=MAX_FREQ,
        normalize_year=True
    )
adj_orig = mn.adj_matrix(G_orig, 'norm_year', True, 0.) # fill in diagnal with 0s since the shuffled version likley won't work without that

In [None]:
# homology calculation
start = time()

# setup the problem
factored_orig = oat.rust.FactoredBoundaryMatrixVr( # two functions that do this, idk what the other one is
        dissimilarity_matrix=adj_orig,
        homology_dimension_max=2
    )

# solve homology
homology_orig = factored_orig.homology( # solve homology
        return_cycle_representatives=True, # These need to be true to be able to make a barcode, makes the problem take ~30% longer (1:30ish)
        return_bounding_chains=True
    )

f'Homology calculation took {time() - start} secs'

In [None]:
# persistance diagram
fig = oat.plot.pd(homology_orig)
fig.update_layout(
        width=600, 
        height=500,
        margin=dict(l=20, r=20, t=20, b=20)
    )
fig.show()

In [None]:
# Barcode diagram
fig = oat.plot.barcode(homology_orig)
fig.update_layout(
        width=1000, 
        height=500,
        margin=dict(l=20, r=20, t=20, b=20)
    )
fig.show()

#### Shuffled Network
Shuffle the edge weights and, again, calculate homology.

In [None]:
# create the network
G_shuffled = rc.shuffle_edge_weights(G_orig, seed=10)
adj_shuffled = mn.adj_matrix(G_shuffled, 'norm_year', True, 0.)

assert adj_orig.shape == adj_shuffled.shape

In [None]:
# homology calculation
start = time()

# setup the problem
factored_shuffled = oat.rust.FactoredBoundaryMatrixVr( # two functions that do this, idk what the other one is
        dissimilarity_matrix=adj_shuffled,
        homology_dimension_max=2
    )

# solve homology
homology_shuffled = factored_shuffled.homology( # solve homology
        return_cycle_representatives=True, # These need to be true to be able to make a barcode, makes the problem take ~30% longer (1:30ish)
        return_bounding_chains=True
    )

f'Homology calculation took {time() - start} secs'

In [None]:
# persistance diagram
fig = oat.plot.pd(homology_shuffled)
fig.update_layout(
        width=600, 
        height=500,
        margin=dict(l=20, r=20, t=20, b=20)
    )
fig.show()

In [None]:
# Barcode diagram
fig = oat.plot.barcode(homology_shuffled)
fig.update_layout(
        width=1000, 
        height=500,
        margin=dict(l=20, r=20, t=20, b=20)
    )
fig.show()

#### Results
We see our null model has a very different structure than our real network. Most obviously, it has substantially more features. The original network has 56,237 total features (9,553 dim 0; 36,112 dim 1; 10,572 dim 2) while the shuffled network has 88,068 total features (9,553 dim 0; 63,171 dim 1; 15,344 dim 2). That's more than 50% more features, including almost double the number of dimension 1 features. These features tend to be born later, especially at higher dimensions (the earliest dim 1 feature in the shuffled network is 0.1 later than the easrliest dim 1 features in the regular network), so this pattern might stop as we look at higher dimensional features. A higher percentage of the features die, although this is likley because the same features should exist at the end of both networks.

In [None]:
homology_orig[['dimension', 'birth', 'death']].groupby('dimension').describe()

In [None]:
homology_shuffled[['dimension', 'birth', 'death']].groupby('dimension').describe()