<a href="https://colab.research.google.com/github/fani-lab/OpeNTF/blob/main/ipynb/gnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

`OpeNTF-GNN` via `PyG`

`OpeNTF` previously used traditional embedding methods (non-graph based) like `doc2vec` to learn skill embeddings as an input alternative to the `1-hot` encoded skills. With graph neural networks (gnn) in `PyG`, we now have integrated graph-based skill embeddings. The gnns capture the synergistic collaborative ties within our transformed graph data to provide with significantly better embeddings for skills, or even direct recommendation of experts for a team via link prediction.

**Expert (Member) Graph Structures**

<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/refs/heads/main/docs/graph_structures.png' width="400" ></p>

`OpeNTF` applied with gnn aims to cover as many variations in graph structures for a given set of team instances. Currently, it implemented `heterogeneous`, `directed`, `unweighted` graph structures including `[[[skill, to, member]], sm]` bipartite, `[[[skill, to, team], [member, to, team]], stm]` tripartite and `[[[skill, to, team], [member, to, team], [loc, to, team]], stml]`, as seen in the figure, and can be set like:

`"+data.embedding.model.gnn.graph.structure=[[[skill, to, team], [member, to, team], [loc, to, team]], stml]"`

(see [`src/mdl/emb/__config__.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/emb/__config__.yaml#L27) for more details)








---



**Transfer vs. End-to-End Learning with GNN**

Gnn methods on an expert graph can be used in either of following ways:

<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/refs/heads/main/docs/transfer.png' width="500" ></p>

1.  **Transfer-based [[WISE24](https://doi.org/10.1007/978-981-96-0567-5_15), [IJCNN23](10.1109/IJCNN54540.2023.10191717), [SIGIR21](https://doi.org/10.1145/3404835.3463105)]**: A gnn method is mainly trained to learn `skill` embeddings, overlooking the embeddings for other node types, and then fed (transfer) into an underlying multilabel classifier, e.g., non-variational feedforward neural net ([`src/mdl/fnn.py`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/fnn.py)) or variational Bayesian ([`src/mdl/bnn.py`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/bnn.py)). In this case, `OpeNTF` runs in embedding mode by setting `data.embedding.class_method` like

    `data.embedding.class_method=mdl.emb.gnn.Gnn_n2v` for [Node2Vec](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.Node2Vec.html)
    `data.embedding.class_method=mdl.emb.gnn.Gnn_m2v` for [MetaPath2Vec](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.MetaPath2Vec.html)
    `data.embedding.class_method=mdl.emb.gnn.Gnn_gs` for [GraphSAGE](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.SAGEConv.html)


    (see [`src/__config__.yaml#L44`](https://github.com/fani-lab/OpeNTF/blob/main/src/__config__.yaml#L44) for more options)


    and the classifier model(s) is set by `models.instances` like

    `"models.instances=[mdl.fnn.Fnn, mdl.bnn.Bnn]"`

    (see [`src/__config__.yaml#L57`](https://github.com/fani-lab/OpeNTF/blob/main/src/__config__.yaml#L57) for more options)




```
python main.py  "cmd=[prep,train,test,eval]" \
                "models.instances=[mdl.fnn.Fnn, mdl.bnn.Bnn]" \
                data.domain=cmn.publication.Publication \
                data.source=../data/dblp/toy.dblp.v12.json \
                data.output=../output/dblp/toy.dblp.v12.json \
                ~data.filter \
                data.embedding.class_method=mdl.emb.gnn.Gnn_gs \
                "+data.embedding.model.gnn.graph.structure=[[[skill, to, team], [member, to, team], [loc, to, team]], stml]"
```





---



<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/refs/heads/main/docs/e2e.png' width="500" ></p>

2.   **Graph Neural Team Recommendation (End-to-End) [[WSDM26, Under Review](https://)]**: A gnn method is used to directly predict expert-team links to recommend top-k expert members of a team, skipping the underlying multilabel classifier, as shown above. In this case, `OpeNTF` runs in embedding mode by setting `data.embedding.class_method` like in transfer-based but the classifier model is set `fixed` by `"models.instances=[mdl.emb.gnn.Gnn]"`




```
python main.py  "cmd=[prep,train,test,eval]" \
                "models.instances=[mdl.emb.gnn.Gnn]" \
                data.domain=cmn.publication.Publication \
                data.source=../data/dblp/toy.dblp.v12.json \
                data.output=../output/dblp/toy.dblp.v12.json \
                ~data.filter \
                data.embedding.class_method=mdl.emb.gnn.Gnn_gs \
                "+data.embedding.model.gnn.graph.structure=[[[skill, to, team], [member, to, team], [loc, to, team]], stml]"
```






---



**Hyperparameters**

`OpeNTF` leverage `[hydra](https://hydra.cc/)` to manage models hyperparameters in hierarchy:

*   [`src/__config__.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/__config__.yaml): `OpeNTF`'s main settings for the pipeline execution like `data.*`, `models.*`, `train.*`, `test.*`, `eval.*`
    *   [`src/mdl/__config__.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.yaml): models' training hyperparameters like `fnn.*`, `bnn.*`, `tntf.*`, `lr`, `batch_size`, ...
        * [`src/mdl/emb/__config__.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/emb/__config__.yaml): training hyperparameters for embedding methods including gnns like `n2v.*`, `m2v.*`, ..., `dim`, ...

To set these hyperparameters,

- `Override` them in the running commands (recommended), or
- Change the defaults in the `__config__.yaml` files


```
python main.py  "cmd=[prep,train,test,eval]" \
                "models.instances=[mdl.rnd.Rnd, mdl.fnn.Fnn, mdl.bnn.Bnn, mdl.emb.gnn.Gnn]" \
                data.domain=cmn.publication.Publication data.source=../data/dblp/toy.dblp.v12.json data.output=../output/dblp/toy.dblp.v12.json ~data.filter \
                data.embedding.class_method=mdl.emb.gnn.Gnn_gatv2 \
                +models.batch_size=2 +models.nsd=unigram_b'

```





---



**Setup & Quickstart**

From the [`quickstart`](https://colab.research.google.com/github/fani-lab/OpeNTF/blob/main/ipynb/quickstart.ipynb) script:


In [1]:
# set up python 3.8
!sudo apt-get update -y
!sudo apt-get install -y python3.8 python3.8-venv python3.8-distutils python3-pip
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 10
!python --version

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 https://cli.github.com/packages stable InRelease
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,289 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [3,526 kB

In [2]:
# get OpeNTF
!rm -R opentf/
!git clone https://github.com/Fani-Lab/opentf
!pip install --upgrade pip setuptools
!pip install -r opentf/requirements.txt

rm: cannot remove 'opentf/': No such file or directory
Cloning into 'opentf'...
remote: Enumerating objects: 22706, done.[K
remote: Counting objects: 100% (123/123), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 22706 (delta 56), reused 50 (delta 26), pack-reused 22583 (from 3)[K
Receiving objects: 100% (22706/22706), 1004.21 MiB | 24.56 MiB/s, done.
Resolving deltas: 100% (11259/11259), done.
Updating files: 100% (1595/1595), done.
Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting setuptools
  Downloading setuptools-75.3.2-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 68.1.2
    Not u

Collecting hydra-core==1.3.2 (from -r opentf/requirements.txt (line 3))
  Downloading hydra_core-1.3.2-py3-none-any.whl.metadata (5.5 kB)
Collecting scipy==1.10.1 (from -r opentf/requirements.txt (line 4))
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
Collecting numpy==1.24.4 (from -r opentf/requirements.txt (line 5))
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting omegaconf<2.4,>=2.2 (from hydra-core==1.3.2->-r opentf/requirements.txt (line 3))
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting antlr4-python3-runtime==4.9.* (from hydra-core==1.3.2->-r opentf/requirements.txt (line 3))
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting packaging (from hydra-core==1.3.2->-r opentf/requirements.txt (line 3))
  Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Col

Graph Neural Team Recommendation

- Preprocessing raw data `cmd=prep` into `teamsvecs` sparse matrix each row of which is a team with its skills `teamsvecs['skill']`, members `teamsvecs['member']` as 1-hot vectors
- `data.domain`, `data.source`, `data.output` from `toy.dblp` dataset, available at the codebase [`OpeNTF/data/dblp`](https://github.com/fani-lab/OpeNTF/tree/main/data/dblp)
- No filtering for min team size `data.filter.min_nteam` and min number of teams per experts `data.filter.min_team_size`
- End-to-End GraphSAGE (`mdl.emb.gnn.Gnn_gs`) for team recommendation for `train`, `test`, and `eval` steps on the `skill-team-expert-location` graph structure.  


In [4]:
%cd opentf/src/
!python main.py "cmd=[prep,train,test,eval]" "models.instances=[mdl.emb.gnn.Gnn]" data.domain=cmn.publication.Publication data.source=../data/dblp/toy.dblp.v12.json data.output=../output/dblp/toy.dblp.v12.json ~data.filter data.embedding.class_method=mdl.emb.gnn.Gnn_gs "+data.embedding.model.gnn.graph.structure=[[[skill, to, team], [member, to, team], [loc, to, team]], stml]"


/content/opentf/src
[2025-11-06 21:21:38,170][cmn.team][INFO] - Loading teamsvecs matrices from ../output/dblp/toy.dblp.v12.json/teamsvecs.pkl ...
[2025-11-06 21:21:38,171][pkgmgr][INFO] - tqdm not found.
[2025-11-06 21:21:38,171][pkgmgr][INFO] - Installing tqdm...
[2025-11-06 21:21:39,239][pkgmgr][INFO] - Collecting tqdm==4.65.0
  Downloading tqdm-4.65.0-py3-none-any.whl.metadata (56 kB)
Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
Installing collected packages: tqdm
Successfully installed tqdm-4.65.0

[2025-11-06 21:21:39,244][cmn.team][INFO] - Loading indexes pickle from ../output/dblp/toy.dblp.v12.json/indexes.pkl ...
[2025-11-06 21:21:39,244][cmn.team][INFO] - Indexes pickle is loaded.
[2025-11-06 21:21:39,244][cmn.team][INFO] - Teamsvecs matrices and indexes for skills (31, 10), members (31, 13), and locations (31, 29) are loaded.
[2025-11-06 21:21:39,245][__main__][INFO] - Loading splits from ../output/dblp/toy.dblp.v12.json/splits.f3.r0.85.pkl ...
[2025-11-06 21:21:39,245][

In [8]:
!ls ../output/dblp/toy.dblp.v12.json

indexes.pkl		  splits.f3.r0.85.pkl  stml.mean.graph.pkl
prep.train.test.eval.log  stm.add.graph.pkl    stm.mean.graph.pkl
skillcoverage.pkl	  stm.dup.graph.pkl    teams.pkl
skill.docs.pkl		  stml.add.graph.pkl   teamsvecs.pkl
splits.f3.r0.85		  stml.dup.graph.pkl


In [10]:
!ls ../output/dblp/toy.dblp.v12.json/splits.f3.r0.85/gs.b1000.e100.ns5.lr0.001.es5.spe10.d128.add.stml.h128.nn30-20

f0.e0.pt			       f1.test.pred.eval.mean.csv
f0.pt				       f1.test.pred.eval.per_instance.csv
f0.test.e0.pred			       f2.e0.pt
f0.test.e0.pred.eval.mean.csv	       f2.pt
f0.test.e0.pred.eval.per_instance.csv  f2.test.e0.pred
f0.test.pred			       f2.test.e0.pred.eval.mean.csv
f0.test.pred.eval.mean.csv	       f2.test.e0.pred.eval.per_instance.csv
f0.test.pred.eval.per_instance.csv     f2.test.pred
f1.e0.pt			       f2.test.pred.eval.mean.csv
f1.pt				       f2.test.pred.eval.per_instance.csv
f1.test.e0.pred			       logs4tboard
f1.test.e0.pred.eval.mean.csv	       ntf.
f1.test.e0.pred.eval.per_instance.csv  test.pred.eval.mean.csv
f1.test.pred			       test.pred.eval.per_instance_mean.csv


In [11]:
import pandas as pd
pd.read_csv('/content/opentf/output/dblp/toy.dblp.v12.json/splits.f3.r0.85/gs.b1000.e100.ns5.lr0.001.es5.spe10.d128.add.stml.h128.nn30-20/test.pred.eval.mean.csv', index_col = 0)


Unnamed: 0,mean,std
P_2,0.2,0.2
P_5,0.16,0.069282
P_10,0.18,0.034641
recall_2,0.166667,0.166667
recall_5,0.333333,0.145297
recall_10,0.811111,0.15396
ndcg_cut_2,0.2,0.2
ndcg_cut_5,0.270506,0.164425
ndcg_cut_10,0.452995,0.163011
map_cut_2,0.15,0.169148
