# Importing libs

In [13]:
import sys, os
import numpy as np
import pandas as pd
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(''), os.path.pardir))) # if you run this notebook from its current folder on our GitHub, you may need this line
from utils.genlink import DataProcessor # core class, all graphs are processed there
from utils.genlink import Trainer # you can train your nets using our algorithms with this class
from utils.genlink import NullSimulator # you can simulate graphs with that class
from utils.models import GL_TAGConv_3l_512h_w_k3_gnorm # here is one of our best gnn, see all available classes in utils.models, they all can be used absolutely identical

In [2]:
# let's also fix our seed globally
seed = 42

Disclamer: there are not so many internal code asserts across current GENLINK repository, so, please, follow instructions as all this code is not designed to work with what it's not designed to work for :)

# Graph simulation and visualization

Let's simulate very simple graph with 4 populations. Initially, you need to pass 3 arguments to the `NullSimulator` class:

* `num_classes` - how many populations you want to simulat (int)
* `edge_probs` - edge probability between every population, should be matrix with the shape `(num_classes, num_classes)`, each values must be less than 1.0, there are no any additional reuirements
* `mean_weight` - $\lambda$ paramener (mean) in exponential distribution, should be matrix with the shape `(num_classes, num_classes)`, we will sample edge weights from exponential distribution

Remember that `edge_probs[i, j]` relate to `mean_weight[i, j]` and both matrices must be symmetrical

In [26]:
ep = np.array([[0.48, 0.02, 0.01, 0.02],
               [0.02, 0.12, 0.1, 0.08],
               [0.01, 0.1, 0.56, 0.32],
               [0.02, 0.08, 0.32, 0.61]])

mw = np.array([[29.16, 10.77, 10.05, 11.54],
               [10.77, 14.13, 12.49, 12.21],
               [10.05, 12.49, 24.76, 19.13],
               [11.54, 12.21, 19.13, 31.08]])

In [27]:
assert np.all(ep == ep.T)
assert np.all(mw == mw.T)

In [28]:
ns = NullSimulator(num_classes=4, 
                   edge_probs=ep, 
                   mean_weight=mw)

Now you need to specify population sizes and generate internal simulator objects (`counts, means, pop_index`). Assuming `ns` is you `NullSimulator` object, you should call `generate_matrices` method that takes:

* `population_sizes` - a numpy array with the desired number of individuals in each population (its like `population_sizes[i]` relate to `edge_probs[i, i]` and `mean_weights[i, i]`)
* `rng` - just numpy random number generator, fixed by seed

In [29]:
ps = np.array([12, 20, 16, 8])

In [30]:
counts, means, pop_index = ns.generate_matrices(population_sizes=ps,
                                                rng=np.random.default_rng(seed))

Finally, call `simulate_graph` method. Specify the path you want to save you graph to

In [36]:
graph_file_path = f'{os.environ.get("HOME")}/GENLINK/data/tutorial_graph.csv'
ns.simulate_graph(means=means,
                  counts=counts, 
                  pop_index=pop_index,
                  path=graph_file_path)

Now you get the graph in readable `.csv` format where each row contains edge with its properties. Here is column breakdown:

* `node_id1` - name of node in simulated graph
* `node_id2` - name of node in simulated graph
* `label_id1` - name of population in simulated graph that node in `node_id1` belongs to
* `label_id2` - name of population in simulated graph that node in `node_id2` belongs to
* `ibd_sum` - as each row is an edge, so this is an edge weight
* `ibd_n` - number of IBD segments, it's always 1 because we can't simulate them (keep just for consistency with real data)

In [37]:
pd.read_csv(graph_file_path)

Unnamed: 0,node_id1,node_id2,label_id1,label_id2,ibd_sum,ibd_n
0,node_1,node_0,P0,P0,79.181347,1
1,node_2,node_0,P0,P0,9.701131,1
2,node_4,node_0,P0,P0,28.309291,1
3,node_5,node_0,P0,P0,14.826807,1
4,node_5,node_2,P0,P0,11.424783,1
...,...,...,...,...,...,...
230,node_55,node_47,P3,P2,6.396516,1
231,node_55,node_49,P3,P3,46.897711,1
232,node_55,node_50,P3,P3,25.709908,1
233,node_55,node_52,P3,P3,156.257226,1


It's time to visualisation! Here we are going to use our `DataProcessor` class. Here are its arguments:

* `path`
* `is_path_object=False`
* `disable_printing=True`
* `dataset_name=None`
* `no_mask_class_in_df=True` 