<a href="https://colab.research.google.com/github/fani-lab/OpeNTF/blob/main/ipynb/nmt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/cd22f8e183cacbb22f43c2c1e54948cd876975ac/docs/figs/opentf-openmt-logo.png' width="100" ></p>

`OpeNTF-Seq2Seq` via `OpenNMT-py`

`OpeNTF` previously viewed the team formation problem as a multi-label classification task and integrated feedforward neural classifiers to map the vector representation of the required skills in the input layer to the to the `1-hot` encoded vector of experts in the ouput layer. However, the problem can also be viewed as a `seq-to-seq` prediction or `translation` task, mapping a dynamic-length input sequence of required skills onto a dynamic-length output sequence of predicted experts while leveraging the autoregression and global attention mechanisms, which capture dependencies beyond independent expert probabilities in multi-label classification. We integrated [`OpenNMT-py`](https://github.com/OpenNMT/OpenNMT-py) via a wrapper class [`mdl.nmt.Nmt`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/nmt.py) to utilize modern `transformers` and `encoder-decoder` recurrent models with `attention` mechanisms.



---



**Translative Team Recommendation [`[SIGIR25]`](https://dl.acm.org/doi/10.1145/3726302.3730259)**

<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/refs/heads/main/docs/figs/s2s.png' width="200" ></p>

To run `OpeNTF` in translation mode, the model instance should be set to `mdl.nmt.Nmt` in [`src/__config__.yaml#L59`](https://github.com/fani-lab/OpeNTF/blob/main/src/__config__.yaml#L59)

`"models.instances=[mdl.nmt.Nmt]"`


```
python main.py  "cmd=[prep,train,test,eval]" \
                "models.instances=[mdl.nmt.Nmt]" \
                data.domain=cmn.publication.Publication \
                data.source=../data/dblp/toy.dblp.v12.json \
                data.output=../output/dblp/toy.dblp.v12.json \
                ~data.filter
```




---



**Seq-to-Seq Model Selection and Hyperparameters**

In [`mdl.nmt.Nmt`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/nmt.py), we prepare the required source and target sets as if parallel datasets from `source language` whose tokens are `skills` to `target language` whose tokens are `experts` and call `OpenNMT-py`'s executables by spawning a new process via python's `subprocess`.

`OpeNTF` leverage [`OpenNMT-py`](https://opennmt.net/OpenNMT-py/quickstart.html) to manage model selection and its hyperparameters. For instance, to pick a `transformer`, the config file in [`src/mdl/__config__.nmt.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.nmt.yaml) should be set for the required hyperparameters like below:

```
encoder_type: transformer
decoder_type: transformer

position_encoding: False # w/o purticular order in input skills and output members.

enc_layers: 4
dec_layers: 4

hidden_size: 128
transformer_ff: 512
attention_dropout: 0.2
heads: 8

beam_size: 10
n_best: 1
min_length: 2
max_length: 100
```
(see default settings for common models in [`src/mdl/__config__.nmt.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.nmt.yaml))

(see [`OpenNMT-py`](https://opennmt.net/OpenNMT-py/quickstart.html)'s docs for more details)




---



**Setup & Quickstart**

From the [`quickstart`](https://colab.research.google.com/github/fani-lab/OpeNTF/blob/main/ipynb/quickstart.ipynb) script:


In [8]:
# set up python 3.8
!sudo apt-get update -y
!sudo apt-get install -y python3.8 python3.8-venv python3.8-distutils python3-pip
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 10
!python --version

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-pip is already the newest version (22.0.2+dfsg-1ubuntu0.7).

In [9]:
# get OpeNTF
!rm -R opentf/
!git clone https://github.com/Fani-Lab/opentf
!pip install --upgrade pip setuptools
!pip install -r opentf/requirements.txt

rm: cannot remove 'opentf/': No such file or directory
Cloning into 'opentf'...
remote: Enumerating objects: 26923, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 26923 (delta 61), reused 28 (delta 25), pack-reused 26825 (from 3)[K
Receiving objects: 100% (26923/26923), 1.32 GiB | 24.44 MiB/s, done.
Resolving deltas: 100% (13333/13333), done.
Updating files: 100% (4274/4274), done.


Translative Neural Team Recommendation

- Preprocessing raw data `cmd=prep` into `teamsvecs` sparse matrix each row of which is a team with its skills `teamsvecs['skill']`, members `teamsvecs['member']` as 1-hot vectors
- `data.domain`, `data.source`, `data.output` from `toy.dblp` dataset, available at the codebase [`OpeNTF/data/dblp`](https://github.com/fani-lab/OpeNTF/tree/main/data/dblp)
- No filtering for min team size `data.filter.min_nteam` and min number of teams per experts `data.filter.min_team_size`
- Neural machine translation (`mdl.nmt.Nmt`) for team recommendation for `train`, `test`, and `eval` steps
- The model and its hyperparameters in [`src/mdl/__config__.nmt.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.nmt.yaml)

In [10]:
# Due to uninstallation bug in colab for 'blinker'. In a container/server, OpeNTF installs it on-demand.
!pip install OpenNMT-py==3.3 --ignore-installed

Collecting OpenNMT-py==3.3
  Using cached OpenNMT_py-3.3-py3-none-any.whl.metadata (6.4 kB)
Collecting torch<2.1,>=1.13 (from OpenNMT-py==3.3)
  Using cached torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl.metadata (24 kB)
Collecting configargparse (from OpenNMT-py==3.3)
  Using cached configargparse-1.7.1-py3-none-any.whl.metadata (24 kB)
Collecting ctranslate2<4,>=3.2 (from OpenNMT-py==3.3)
  Using cached ctranslate2-3.24.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting tensorboard>=2.3 (from OpenNMT-py==3.3)
  Using cached tensorboard-2.14.0-py3-none-any.whl.metadata (1.8 kB)
Collecting flask (from OpenNMT-py==3.3)
  Using cached flask-3.0.3-py3-none-any.whl.metadata (3.2 kB)
Collecting waitress (from OpenNMT-py==3.3)
  Using cached waitress-3.0.0-py3-none-any.whl.metadata (4.2 kB)
Collecting pyonmttok<2,>=1.35 (from OpenNMT-py==3.3)
  Using cached pyonmttok-1.37.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting 

In [11]:
%cd opentf/src/
!python main.py "cmd=[prep,train,test,eval]" "models.instances=[mdl.rnd.Rnd, mdl.nmt.Nmt]" data.domain=cmn.publication.Publication data.source=../data/dblp/toy.dblp.v12.json data.output=../output/dblp/toy.dblp.v12.json ~data.filter


/content/opentf/src/opentf/src
[2026-01-22 15:45:35,236][cmn.team][INFO] - Loading teamsvecs matrices from ../output/dblp/toy.dblp.v12.json/teamsvecs.pkl ...
[2026-01-22 15:45:35,252][cmn.team][INFO] - Loading indexes pickle from ../output/dblp/toy.dblp.v12.json/indexes.pkl ...
[2026-01-22 15:45:35,253][cmn.team][INFO] - Indexes pickle is loaded.
[2026-01-22 15:45:35,253][cmn.team][INFO] - Teamsvecs matrices and indexes for skills (31, 10), members (31, 13), and locations (31, 29) are loaded.
[2026-01-22 15:45:35,254][__main__][INFO] - Loading splits from ../output/dblp/toy.dblp.v12.json/splits.f3.r0.85.pkl ...
[2026-01-22 15:45:35,256][cmn.team][INFO] - Loading member-skill co-occurrence matrix (13, 10) from ../output/dblp/toy.dblp.v12.json/splits.f3.r0.85/skillcoverage.pkl ...
[2026-01-22 15:45:39,808][__main__][INFO] - [94mTraining team recommender instance mdl.rnd.Rnd ... [0m
[2026-01-22 15:45:39,808][__main__][INFO] - [92mTesting team recommender instance mdl.rnd.Rnd ... [0m
[

In [12]:
!ls ../output/dblp/toy.dblp.v12.json

c2g.pkl      indexes.pkl     splits.f3.r0.85	  stm.add.graph.pkl
females.csv  skill.docs.pkl  splits.f3.r0.85.pkl  teamsvecs.pkl


In [17]:
!ls ../output/dblp/toy.dblp.v12.json/splits.f3.r0.85/nmt.b1000.e100.lr0.001.es5.spe10.enctransformer

f0.config.yml			   f1.test.pred.eval.mean.csv
f0.src-train.txt		   f1.tgt-train.txt
f0.src-valid.txt		   f1.tgt-valid.txt
f0._step_6.pt			   f1.vocab.src
f0.test.e6.pred			   f1.vocab.tgt
f0.test.e6.pred_		   f2.config.yml
f0.test.e6.pred.eval.instance.csv  f2.src-train.txt
f0.test.e6.pred.eval.mean.csv	   f2.src-valid.txt
f0.test.pred			   f2._step_6.pt
f0.test.pred.eval.instance.csv	   f2.test.e6.pred
f0.test.pred.eval.mean.csv	   f2.test.e6.pred_
f0.tgt-train.txt		   f2.test.e6.pred.eval.instance.csv
f0.tgt-valid.txt		   f2.test.e6.pred.eval.mean.csv
f0.vocab.src			   f2.test.pred
f0.vocab.tgt			   f2.test.pred.eval.instance.csv
f1.config.yml			   f2.test.pred.eval.mean.csv
f1.src-train.txt		   f2.tgt-train.txt
f1.src-valid.txt		   f2.tgt-valid.txt
f1._step_6.pt			   f2.vocab.src
f1.test.e6.pred			   f2.vocab.tgt
f1.test.e6.pred_		   src-test.txt
f1.test.e6.pred.eval.instance.csv  test.pred.eval.instance_mean.csv
f1.test.e6.pred.eval.mean.csv	   test.pred.eval.mean.csv
f1.test.pred	

In [15]:
import pandas as pd
pd.read_csv('/content/opentf/output/dblp/toy.dblp.v12.json/splits.f3.r0.85/nmt.b1000.e100.lr0.001.es5.spe10.enctransformer/test.pred.eval.mean.csv', index_col = 0)


Unnamed: 0_level_0,mean,std
metrics,Unnamed: 1_level_1,Unnamed: 2_level_1
P_2,0.166667,0.057735
P_5,0.146667,0.023094
P_10,0.153333,0.011547
recall_2,0.166667,0.057735
recall_5,0.333333,0.057735
recall_10,0.666667,0.057735
ndcg_cut_2,0.204382,0.0708
ndcg_cut_5,0.28813,0.0708
ndcg_cut_10,0.430639,0.0708
map_cut_2,0.166667,0.057735
