# NEExT

### Network Embedding Exploration Tool

NEExT is a tool for exploring and building graph embeddings. This tool allows for:
* Cleansing and standardizing a collection of graph data.
* Creating node and structural features for nodes in the graph collection.
* Creating embeddings for graphs.

### Installation Process
NEExT uses Python 3.x (currently tested using Python 3.11).
You can install NEExT using the following:
```console
pip install NEExT
```

### Graph Data Format
You can use a few different data formats to upload data into NEExT. Currently, it allows for:
* CSV files
* NetworkX Objects (comming soon)
See below for examples of using different data formats.

#### Using CSV Files
Data can be categorized into the following groups:
* Edge File (captures which nodes are connected to which nodes)
* Node Graph Mapping (captures which belongs to which graph)
* Graph Label Mapping [optional] (captures labels for each graph)
* Node Features [optional] (captures the features for each node)

Below we show example of how each of the above files should be formatted:

##### Edge File:
|node_a|node_b|
|---|---|
|1|2|
|3|2|
|.|.|

#### Node Graph Mapping:
|node_id|graph_id|
|---|---|
|0|1|
|1|1|
|2|1|
|3|2|
|4|2|
|.|.|

#### Graph Label Mapping:
|graph_id|graph_label|
|---|---|
|0|0|
|1|0|
|2|1|
|3|0|
|4|1|
|.|.|

#### Node Features:
|node_id|node_feat_0|node_feat_1|...|
|---|---|---|---|
|0|0.34| 3.2| .|
|1|0.1| 2.9| .|
|2|1.9| 1.3| .|
|3|0.0| 2.2| .|
|4|11.2| 12.3| .|
|.|.| .| .|

Note that NEExT can not handle non-numerical features. Some feature engineering on the node features must be done by the end-user.
Data standardization, however, will be done.






# NEExT Tutorial [Getting Started]

In this notebook, we showcase how to use NEExT to analyze graph embeddings.

In [1]:
from NEExT.NEExT import NEExT

The following are link to some graph data, which we will use in this tutorial.
Note that we have Graph Labels in this dataset, which are optional data, for using NEExT. The datasets were genearted using the ABCD Framework found here (https://github.com/bkamins/ABCDGraphGenerator.jl)

## Loading Data

First we deine a path to the datasets. They are `csv` files, with format as defined in the README file.

In [2]:
edge_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/edge_file.csv"
graph_label_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/graph_label_mapping_file.csv"
node_graph_mapping_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/node_graph_mapping_file.csv"

Now we can instantiate a NEExT object.

In [3]:
nxt = NEExT(quiet_mode="on")

You can load data using the `load_data_from_csv` method:

In [4]:
nxt.load_data_from_csv(edge_file=edge_file, node_graph_mapping_file=node_graph_mapping_file, graph_label_file=graph_label_file)

## Building Features

You can now compute various features on nodes of the subgraphs in the graph collection loaded above.<br>
This can be done using the method `compute_graph_feature`. <br>
To get the list of available node features, you can use the function `get_list_of_graph_features`.

In [5]:
nxt.get_list_of_graph_features()

['lsme',
 'self_walk',
 'basic_expansion',
 'basic_node_features',
 'page_rank',
 'degree_centrality',
 'closeness_centrality',
 'load_centrality',
 'eigenvector_centrality',
 'anomaly_score_CADA',
 'normalized_anomaly_score_CADA',
 'community_association_strength',
 'normalized_within_module_degree',
 'participation_coefficient']

These are the type of node features you can compute on every node on each graph in the graph collection. <br>
So for example, let's compute `page_rank`. We also need to defined what the feature vector size should be.

In [6]:
nxt.compute_graph_feature(feat_name="page_rank", feat_vect_len=4)

To compute additional features, simply use the same function, and provide the length of the vector size.<br>
Let's add degree centrality to the list of computed features.

In [7]:
nxt.compute_graph_feature(feat_name="degree_centrality", feat_vect_len=4)

## Building Global Feature Object

Right now, we have 2 features computed on every node, for every graph. We can use these features to construct a overall pooled feature vector, which can be used to construct graph embeddings. <br>
To do this, we can pool the features using the `pool_grpah_features` method.

In [8]:
nxt.pool_graph_features(pool_method="concat")

The overall feature (which we call global feature) is a concatenated vector of whatever features you have computed on the graph. In this example it would be a 8 dimensional vector of `page_rank` and `degree_centrality`.<br>
You can access the global vector by using the `get_global_feature_vector` method.

In [9]:
df = nxt.get_global_feature_vector()
df.head(3)

Unnamed: 0,node_id,graph_id,feat_page_rank_0,feat_page_rank_1,feat_page_rank_2,feat_page_rank_3,feat_degree_centrality_0,feat_degree_centrality_1,feat_degree_centrality_2,feat_degree_centrality_3
0,0,0,4.014656,1.825315,2.003575,2.062771,4.094288,1.723672,2.023497,2.122162
1,1,0,2.651835,1.745939,2.042548,2.045115,2.682074,1.689427,2.023497,2.082461
2,2,0,2.672592,1.696518,2.058271,2.071704,2.682074,1.578132,2.120736,2.131372


## Dimensionality Reduction

We may wish to reduce the number of dimensions of our data, which could help downstream tasks such as Embedding generation or machine learning tasks. This can be done using the `apply_dim_reduc_to_graph_feats`.

In [10]:
nxt.apply_dim_reduc_to_graph_feats(dim_size=4, reducer_type="pca")

If we take a look at the `global feature vector` we can see that it is upaded with the new size of dimension.

In [11]:
df = nxt.get_global_feature_vector()
df.head()

Unnamed: 0,node_id,graph_id,feat_0,feat_1,feat_2,feat_3
0,0,0,2.560267,3.463473,1.190519,1.075334
1,1,0,2.219077,1.481326,1.365794,0.532872
2,2,0,2.225506,1.488865,1.925751,0.467088
3,3,0,2.129568,0.363245,-0.311636,1.752476
4,4,0,2.009336,0.529704,3.251858,-1.665345


You still have access to the pre-dimensionality reduction global vector by using the method `get_archived_global_feature_vector`.

In [12]:
df = nxt.get_archived_global_feature_vector()
df.head()

Unnamed: 0,node_id,graph_id,feat_page_rank_0,feat_page_rank_1,feat_page_rank_2,feat_page_rank_3,feat_degree_centrality_0,feat_degree_centrality_1,feat_degree_centrality_2,feat_degree_centrality_3
0,0,0,4.014656,1.825315,2.003575,2.062771,4.094288,1.723672,2.023497,2.122162
1,1,0,2.651835,1.745939,2.042548,2.045115,2.682074,1.689427,2.023497,2.082461
2,2,0,2.672592,1.696518,2.058271,2.071704,2.682074,1.578132,2.120736,2.131372
3,3,0,1.968745,2.028736,1.879435,2.154684,1.975967,2.082671,1.851304,2.161749
4,4,0,1.940827,1.3845,2.274468,1.972817,1.975967,1.355541,2.346133,1.966471


## Building Graph Embeddings

This function returns a Pandas DataFrame, with the collection features and how they map to the graphs and nodes. <br>
One thing to note is that the data is standardized across all graphs.

We can use the features computed on the graphs to build graph embeddings. To see what graph embedding engines are available to use, we can use the `get_list_of_graph_embedding_engines` function.

In [13]:
nxt.get_list_of_graph_embedding_engines()

['approx_wasserstein', 'wasserstein', 'sinkhornvectorizer']

Now, let's build a 3 dimensional embedding for every graph in graph collection using the Approximate Wasserstein embedding engine. This can be done by using the method `build_graph_embedding`.

In [14]:
nxt.build_graph_embedding(emb_dim_len=3, emb_engine="approx_wasserstein")

You can access the embedding results by using the method `get_graph_embeddings`.

In [15]:
df = nxt.get_graph_embeddings()
df.head()

Unnamed: 0,emb_0,emb_1,emb_2,graph_id
0,2.121117,1.715908,0.420738,0
1,0.927579,1.293987,1.120732,1
2,0.070954,1.024027,0.343173,2
3,-0.682049,0.990811,0.10676,3
4,-1.281254,0.773949,0.212976,4


## Visualize Embeddings

You can use the builtin visualization function to gain quick insights into the performance of your embeddings. This can be done by using the method `visualize_graph_embedding`. If you have labels for your graph (like the case here), we can color the embedding distributions using the labels. By default, embeddings are not colored.

## Using Sampled Sub-Graphs

We may often have to deal with large graphs, both in the number of sub-graphs in the collection, and also the size of each graph. To allow for faster computation, we can sample each sub-graph and compute metrics and features for a fraction of nodes on each sub-graph. This can be done by using the method `build_node_sample_collection`. It takes as input the fraction of sampled nodes. Once this method is called all further computation will use the sampled node collection.

In [16]:
nxt.build_node_sample_collection(sample_rate=0.1)

## Adding Custom Node Feature Function

You can define and load into `NEExT` you own node feature function. Below, we show an example of loading a custom function into `NEExT` and using it to compute node features. The only thing to keep in mind is the interface for the function, meaning the input and output format.

#### Input:
Your custom function should have the following inputs:
```
func(G, feat_vect_len):
    ...
```
Where `G` is a `NetworkX` graph object and `feat_vect_len` is an `int` indicating the length of the node feature vector.
#### Output:
Your function should have the following output:
```
func(G, feat_vect_len):
    ...
    feat_vect = {node_id : [v1, v2, v3, ...], ...}
    return feat_vect
```
Where the output is a `dict`, where the keys are the node ids and the values of list, with elements being the values of features for that node. The length of the feature vector should be the same as the input `feat_vect_len`.

One thing to note is that, if you have applied `sampling` (as we have done above), `NEExT` will automatically load a sampled version of the graphs into your custom function. The NetworkX graph G passed to your function is a sub-graph with only a fraction of nodes, as defined by the sampling rate.

Below, we show an example of:
* Creating a custom function, where we have a feature vector of only zero (you can do something more complicated)
* Loading the custom function into NEExT using `load_custom_node_feature_function` with two parameters (function and function_name)
* Calling the custom function to compute features.
* Concatinating the new features with the old one (from above)
* Displaying the new gloabl feature DataFrame.

In [21]:
def my_custom_node_feature(G, feat_vect_len):
    feat_vect = {}
    for i in G.nodes:
        feat_vect[i] = [0]*feat_vect_len
    return feat_vect

nxt.load_custom_node_feature_function(function=my_custom_node_feature, function_name="my_custom_node_feature")
nxt.get_list_of_graph_features()

['lsme',
 'self_walk',
 'basic_expansion',
 'basic_node_features',
 'page_rank',
 'degree_centrality',
 'closeness_centrality',
 'load_centrality',
 'eigenvector_centrality',
 'anomaly_score_CADA',
 'normalized_anomaly_score_CADA',
 'community_association_strength',
 'normalized_within_module_degree',
 'participation_coefficient',
 'my_custom_node_feature']

In [22]:
nxt.compute_graph_feature(feat_name="my_custom_node_feature", feat_vect_len=4)

In [23]:
nxt.pool_graph_features(pool_method="concat")

In [24]:
df = nxt.get_global_feature_vector()
df.head(3)

Unnamed: 0,node_id,graph_id,feat_page_rank_0,feat_page_rank_1,feat_page_rank_2,feat_page_rank_3,feat_my_custom_node_feature_0,feat_my_custom_node_feature_1,feat_my_custom_node_feature_2,feat_my_custom_node_feature_3,feat_degree_centrality_0,feat_degree_centrality_1,feat_degree_centrality_2,feat_degree_centrality_3
0,11,0,3.978217,1.732728,2.377843,1.84914,0.0,0.0,0.0,0.0,4.133798,1.745547,2.449535,1.847807
1,12,0,3.942675,1.65817,2.117825,1.998154,0.0,0.0,0.0,0.0,4.133798,1.642117,2.125281,2.004648
2,24,0,2.992725,1.334683,2.1739,2.100552,0.0,0.0,0.0,0.0,2.709091,1.07325,2.190615,2.223901
