# `OpeNTF` with `GNN`:
`OpeNTF` previously used traditional embedding methods (non-graph based) to provide skill embeddings as an input alternative to the one-hot encoded skills. With the advent of `GNN` methods, we now have graph-based **learned** skill embeddings usable as a form of more meaningful input. The gnns are now able to capture the synergistic collaborative ties within our transformed graph data to provide us with significant embeddings, resulting in even better recommendation of experts  

## Background

#### **Graph Neural Networks**

The basic breakthrough for graph neural networks is the message passing mechanism. Here we can see how each node representation gets updated to a newer one `(k-th)` with the aggregation and combination of the previous `((k - 1)-th)` neighboring node representations. The learned representations after gnn training result as a collection of node embeddings, usable in downstream tasks.

<p align="center"><img src='gnn_embedding-min.png' width="500" ></p>

#### **Graph Structures**

In OpeNTF applied with gnn, we aimed to cover as many variations possible in graph structures, while maintaining the boundaries of the given datasets. Currently we implement 3 different graph structures, adding to previous state-of-the-art works. We have `skill-expert` (s-e), `skill-team-expert` (s-t-e) and `skill-team-expert-location` (s-t-e-l) as various graph structures. Each of these variations represent the node types underlying their structures. 

<p align="center"><img src='graph_structures.png' width="500" ></p>

#### **Neighborhood Sampling**

The graph formed from our existing large scale dataset do not entirely get accomodated into a single GNN model. Therefore, we employ `mini-batching` strategy to extract smaller subgraphs as batches forming the entire graph to the model to train it for a link prediction task. The subgraphs are sampled based on surrounding neighborhood of every starting edge selected. The neighbors are selected from the nodes occupied by the edges that are selected in each hop. While sampling the neighborhood from a `seed edge`, we describe `k-hop` as the `k step` distanced node from a particular source node.  

In the link prediction task, the number of elements in one batch is calculated based on the number of sampled `links` or `edges`. Hence, a batch size of 128 would regard as 128 edges collected from a defined `k-hop` neighborhood. 

<div style="display: flex; justify-content: center;">
    <img src="Subgraph_1-min.png" width="500">
</div>
<div style="display: flex; justify-content: center;">
    <img src="Subgraph_2-min.png" width="500">
    <img src="Subgraph_3-min.png" width="500">
</div>

## **Transfer Learning with GNN**

Although we can successfully predict efficient teams with neural models trained on sparse matrices, our experiments found that using transfer learning with skill embeddings is more effective for team recommendation. The GNN approach involves learning vector representations through message-passing techniques.

<p align="center"><img src='gnn_pipeline.jpg' width="1000" ></p>


## Setup

<font color="green">This notebook should be run in the root folder of the `gnn` branch of the OpeNTF project. It is already included in the desired location. The `main` branch is yet to be merged with the gnn features, hence the special instructions. </font>

We need to have ``Python >= 3.10`` and install the required packages listed in [``requirements.txt``](requirements.txt):

Before installing any required packages, we can install `jupyter` and `ipykernel`
in order to be able to start the jupyter notebook inside our required virtual environment (created using `virtualenv` or `venv`)
We need to perform the below steps first, in order to integrate our virtualenv `kernel` with the jupyter instance
(We need to activate the virtualenv first)
```
pip install jupyter
pip install ipython
pip install ipykernel

# For example : name_of_the_env = "opentf"
ipython kernel install --user --name=<name_of_the_env>

# Start jupyter notebook for test run
jupyter notebook

# After starting the jupyter notebook, we can select our desired opentf kernel from any notebook
```

Using git, clone the codebase and using ``pip`` install the required packages:
```
git clone --recursive https://github.com/Fani-Lab/opentf
git checkout gnn
cd opentf
pip install -r requirements.txt
```

For installation of specific version of a python package due to, e.g., ``CUDA`` versions compatibility, one can edit [``requirements.txt``](requirements.txt) to include the appropriate `CUDA` versions (example : replace all instances of `torch-2.5.0+cpu` with `torch-2.5.0+cu124` for `CUDA 12.4`).

## Quickstart on `OpeNTF` with `GNN`

The entire codebase has two distinct pipelines:

1. ``./src/mdl/team2vec/main.py`` handling the embedding generation step in case of dense vector input for the neural team formation
2. ``./src/main.py`` handling the main pipeline of the neural team formation

The embedding generation pipeline consists of the models``d2v (Doc2Vec), m2v (Metapath2Vec), gs (GraphSAGE), gat (GraphAttention), gatv2 (GraphAttentionV2),
han (Heterogeneous Attention Network), gin (Graph Isomorphism Network) and gine (GIN-Edge feature enhanced).``
This pipeline accepts the following required arguments:
1) ``-teamsvecs``: The path to the teamsvecs.pkl and indexes.pkl files; e.g., ``-teamsvecs ../data/preprocessed/dblp/toy.dblp.v12.json/``
2) ``-model``: The embedding model; e.g., ``-model d2v, m2v, gs ...``

To generate GNN based embeddings, it is recommended to include additional arguments as follows:  

1) ``--agg``: The aggregation method used for the graph data; e.g : ``mean, none, max, min ...``
2) ``--d``: Embedding dimension; e.g : ``4, 8, 16, 32 ...``
3) ``--e``: Train epochs ; e.g : ``5, 20, 100 ...``

The neural network pipeline accepts three required list of values:
1) ``-data``: list of path to the raw datafiles, e.g., ``-data ./../data/raw/dblp/dblp.v12.json``, or the main file of a dataset, e.g., ``-data ./../data/raw/imdb/title.basics.tsv``
2) ``-domain``: list of domains of the raw data files that could be ``dblp``, ``imdb``, or `uspt`; e.g., ``-domain dblp imdb``.
3) ``-model``: list of baseline models that could be ``fnn``, ``bnn``; e.g., ``-model fnn bnn``.

If the input type is a dense vector from GNN methods, an additional list of arguments are needed as follows:
1) ``--emb_model``: The embedding model; e.g., ``--emb_model gs gat gatv2 han ...``
2)  ``--emb_graph_type`` The collaboration graph type used for embedding generation e.g., ``sm or stm``


Here is a brief explanation of the models:
- ``fnn``, ``bnn``, ``fnn_emb``, ``bnn_emb``: follows the standard machine learning training procedure.

In [1]:
%cd src/mdl/team2vec

C:\Users\Owner\Documents\Pycharm_Projects\OpeNTF\src\mdl\team2vec


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
!pwd

/c/Users/Owner/Documents/Pycharm_Projects/OpeNTF/src/mdl/team2vec


In [3]:
!python -u main.py -teamsvecs ./../../../data/preprocessed/dblp/toy.dblp.v12.json/ -model gnn.gat --agg mean --e 5 --d 8 --ns 5 --graph_type stm

Skipping for graph type : sm, 
Loading the data file ./../../../data/preprocessed/dblp/toy.dblp.v12.json//gnn/stm.undir.mean.data.pkl ...
mini batch loader for mode train
mini batch loader for mode val
mini batch loader for mode train
mini batch loader for mode val
Device: 'cpu'
Encoder(
  (model): GraphModule(
    (conv1): ModuleDict(
      (skill__to__team): GATConv((-1, -1), 8, heads=2)
      (member__to__team): GATConv((-1, -1), 8, heads=2)
      (team__rev_to__skill): GATConv((-1, -1), 8, heads=2)
      (team__rev_to__member): GATConv((-1, -1), 8, heads=2)
    )
    (conv2): ModuleDict(
      (skill__to__team): GATConv((-1, -1), 8, heads=2)
      (member__to__team): GATConv((-1, -1), 8, heads=2)
      (team__rev_to__skill): GATConv((-1, -1), 8, heads=2)
      (team__rev_to__member): GATConv((-1, -1), 8, heads=2)
    )
    (conv3): ModuleDict(
      (skill__to__team): GATConv((-1, -1), 8, heads=2)
      (member__to__team): GATConv((-1, -1), 8, heads=2)
      (team__rev_to__skill): 

In [4]:
%cd ../../
!pwd

C:\Users\Owner\Documents\Pycharm_Projects\OpeNTF\src
/c/Users/Owner/Documents/Pycharm_Projects/OpeNTF/src


In [5]:
!python -u main.py -data ../data/raw/dblp/toy.dblp.v12.json -domain dblp -model fnn --emb_model gat --emb_graph_type stm --emb_agg mean --emb_e 5 --emb_d 8 --emb_ns 5

Loading sparse matrices from ./../data/preprocessed/dblp/toy.dblp.v12.json/teamsvecs.pkl ...
Loading indexes pickle from ./../data/preprocessed/dblp/toy.dblp.v12.json/indexes.pkl ...
It took 0.0 seconds to load from the pickles.
It took 0.002470731735229492 seconds to load the sparse matrices.
loaded expert-skill co-occurrence matrix es_vecs
Running for (dataset, model): (dblp, fnn) ... 

.............. starting learn .................

Fold 0/2, Epoch 0/24, Minibatch 0/0, Phase train, Running Loss train 65.11357116699219, Time 0.03322553634643555, Overall 2.863619804382324 
Fold 0/2, Epoch 0/24, Running Loss train 3.830210068646599, Time 0.03322553634643555, Overall 2.863619804382324 
Fold 0/2, Epoch 0/24, Minibatch 0/0, Phase valid, Running Loss valid 29.27532386779785, Time 0.0432736873626709, Overall 2.8736679553985596 
Fold 0/2, Epoch 0/24, Running Loss valid 3.2528137630886502, Time 0.0432736873626709, Overall 2.8736679553985596 
Fold 0/2, Epoch 1/24, Minibatch 0/0, Phase train, 

  if per_epoch: modelfiles += [f'{model_path}/{_}' for _ in os.listdir(model_path) if re.match(f'state_dict_model.f{foldidx}.e\d+.pt', _)]
  predfiles = [f'{model_path}/{_}' for _ in os.listdir(model_path) if re.match('state_dict_model.f\d+.pt', _)]
  if per_epoch: predfiles += [f'{model_path}/{_}' for _ in os.listdir(model_path) if re.match('state_dict_model.f\d+.e\d+', _)]
  if per_epoch: modelfiles += [f'{model_path}/{_}' for _ in os.listdir(model_path) if re.match(f'state_dict_model.f{foldidx}.e\d+.pt', _)]
  re.match(f'state_dict_model.f{foldidx}.e\d+.pt', _)]
  emb_skill = torch.load(emb_filepath, map_location=torch.device('cpu'))['skill'].detach().numpy()
  self.load_state_dict(torch.load(modelfile))
  Y_ = torch.load(f'{model_path}/f{foldidx}.{pred_set}.{epoch}pred')

  0%|          | 0/5 [00:00<?, ?it/s]
100%|##########| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|##########| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|##########| 5/5 

## Setting Hyperparameters
`OpeNTF`'s codebase offers the following hyperparameter to be set for each neural team formation methods:

### `model`
- Contains the baseline hyperparameters in the form of `'model-name' : { params }`, which allows the models to be integrated into the baseline with their unique parameters.
- Allows the customization of which stages of the system to be executed through `cmd`.
- Contains other training parameter for the models (e.g., temporal).

### `data`
- Contains parameters for manipulating datasets, including dataset filters (e.g., minimum team size) and bucket size for sparse matrix parallel generation.

### `fair`
- Contains parameters for the fairness metrics used in consideration during team formation.

A snippet of the parameters used in `src/mdl/team2vec/params.py` is displayed as follows:

In [None]:
'''
During setting up the edge_types, if we want to include skill-skill or expert-expert connections,
the edge_type identifier "sm" or "stm" will have to be included as "sm.en" or "stm.en" [en = enhanced]
'''


settings = {
    'graph':{
        'edge_types':                   # this is an array holding the edge_type info in the form [('edge_type1', 'edge_type1_code'), ('edge_type2', 'edge_type2_code') ... ]
            # [('member', 'm')],
            # [([('skill', 'to', 'member')], 'sm')],
            # [([('skill', 'to', 'skill'), ('member', 'to', 'member'), ('skill', 'to', 'member')], 'sm')], # sm enhanced
            # [([('skill', 'to', 'team'), ('member', 'to', 'team')], 'stm')],
            # [([('skill', 'to', 'team'), ('member', 'to', 'team'), ('loc', 'to', 'team')], 'stml')],
            [([('skill', 'to', 'member')], 'sm'), ([('skill', 'to', 'team'), ('member', 'to', 'team')], 'stm'), ([('skill', 'to', 'team'), ('member', 'to', 'team'), ('loc', 'to', 'team')], 'stml')],  # sm, stm, stml
            # ([('skill', 'to', 'skill'), ('member', 'to', 'member'), ('skill', 'to', 'team'), ('member', 'to', 'team'), ('skill', 'to', 'member')], 'stm') # stm enhanced,
            # [([('skill', 'to', 'member')], 'sm'), ([('skill', 'to', 'team'), ('member', 'to', 'team')], 'stm')],
            # [([('skill', 'to', 'member')], 'sm'), ([('skill', 'to', 'team'), ('member', 'to', 'team'), ('location', 'to', 'team')], 'stml')],
            # [([('skill', 'to', 'skill'), ('member', 'to', 'member'), ('skill', 'to', 'member')], 'sm.en'), ([('skill', 'to', 'skill'), ('member', 'to', 'member'), ('skill', 'to', 'team'), ('member', 'to', 'team'), ('skill','to','member')], 'stm.en')], # sm stm strongly connected

        'custom_supervision' : False, # if false, it will take all the forward edge_types as supervision edges
        # 'supervision_edge_types': [([('skill', 'to', 'skill'), ('member', 'to', 'member'), ('skill', 'to', 'member')], 'sm'), ([('skill', 'to', 'skill'), ('member', 'to', 'member'), ('skill', 'to', 'team'), ('member', 'to', 'team'), ('skill', 'to', 'member')], 'stm')], # sm stm strongly connected
        'supervision_edge_types': [([('skill', 'to', 'member')], 'sm.en'), ([('skill', 'to', 'team'), ('member', 'to', 'team')], 'stm.en')],
        'dir': False,
        'dup_edge': ['add', 'mean', 'min', 'max', 'mul'],         #None: keep the duplicates, else: reduce by 'add', 'mean', 'min', 'max', 'mul'
    },
    'model': {
        'd' : 8,                    # embedding dim array
        'b' : 128,                  # batch_size for loaders
        'e' : 100,                  # num epochs
        'ns' : 5,                   # number of negative samples
        'lr': 0.001,
        'loader_shuffle': True,
        'num_workers': 0,
        'save_per_epoch': False,
        'pt' : 0,                   # 1 -> use pretrained d2v skill vectors as initial node features of graph data

        'gnn.gs': {
            'e' : 100,                # number of epochs
            'b' : 128,              # batch size
            'd' : 8,                # embedding dimension
            'ns' : 5,               # number of negative samples
            'h' : 2,                # number of attention heads (if applicable)
            'nn' : [30, 20],        # number of neighbors in each hop ([20, 10] -> 20 neighbors in first hop, 10 neighbors in second hop)
            'graph_types' : 'stm',   # graph type used for a single run (stm -> ste -> skill-team-expert)
            'agg' : 'mean',         # aggregation method used for merging multiple edges between the same source and destination node
            'dir' : False,          # whether the graph is directed
        },
        'gnn.gin': {
            'e': 100,
            'b': 128,
            'd': 8,
            'ns' : 5,
            'h': 2,
            'nn': [30, 20],
            'graph_types': 'stm',
            'agg': 'mean',
            'dir': False,
        },
        'gnn.gat': {
            'e': 100,
            'b': 128,
            'd': 8,
            'ns' : 5,
            'h': 2,
            'nn': [30, 20],
            'graph_types': 'stm',
            'agg': 'mean',
            'dir': False,
        },
    },
    'cmd' : ['graph', 'emb'],
}

## Structure and Inheritance

### Dataset Structure
<p align="center"><img src='dataset_hierarchy.png' width="500" ></p>

To integrate a new dataset into the baseline, follow the structure of the `team` class. Additional fields can be added, like its derived classes. Ideally, only the `read_data()` function should be overriden.



### Model Structure
<img src="./gnn_hierarchy.png" height=400px />

To integrate a new model into the GNN baseline, we can create the driver gnn model class and then apply that model into the `encoder` and `decoder` classes respectively

# Additional Resources
- [`OpeNTF` codebase](https://github.com/fani-lab/OpeNTF/tree/main)
- [`OpeNTF on GNN` codebase](https://github.com/fani-lab/OpeNTF/tree/gnn)
- [`Adila` codebase](https://github.com/fani-lab/adila)
- [`vivaFemme` codebase](https://github.com/fani-lab/OpeNTF/tree/vivaFemme)
- [Streaming Training Strategy codebase](https://github.com/fani-lab/OpeNTF/tree/ecir24)
- [Tutorial Website and Materials](https://fani-lab.github.io/OpeNTF/tutorial/umap24/)
    - [`OpeNTF` paper](https://doi.org/10.1145/3511808.3557526)
    - [`Adila` paper](https://doi.org/10.1007/978-3-031-37249-0_9)
    - [`vivaFemme` paper](https://hosseinfani.github.io/res/papers/2024_BIAS_SIGIR_vivaFemme_Mitigating_Gender_Bias_in_Neural_Team_Recommendation_via_Female-Advocate_Loss_Regularization.pdf)
    - [Streaming Training Strategy paper](https://link.springer.com/chapter/10.1007/978-3-031-56027-9_20)

<img src="qr-code.png" height=300px />