<a href="https://colab.research.google.com/github/fani-lab/OpeNTF/blob/main/nmt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/cd22f8e183cacbb22f43c2c1e54948cd876975ac/docs/figs/opentf-openmt-logo.png' width="100" ></p>

`OpeNTF-Seq2Seq` via `OpenNMT-py`

`OpeNTF` previously viewed the team formation problem as a multi-label classification task and integrated feedforward neural classifiers to map the vector representation of the required skills in the input layer to the to the `1-hot` encoded vector of experts in the ouput layer. However, the problem can also be viewed as a `seq-to-seq` prediction or `translation` task, mapping a dynamic-length input sequence of required skills onto a dynamic-length output sequence of predicted experts while leveraging the autoregression and global attention mechanisms, which capture dependencies beyond independent expert probabilities in multi-label classification. We integrated [`OpenNMT-py`](https://github.com/OpenNMT/OpenNMT-py) via a wrapper class [`mdl.nmt.Nmt`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/nmt.py) to utilize modern `transformers` and `encoder-decoder` recurrent models with `attention` mechanisms.



---



**Translative Team Recommendation [`[SIGIR25]`](https://dl.acm.org/doi/10.1145/3726302.3730259)**

<p align="center"><img src='https://raw.githubusercontent.com/fani-lab/OpeNTF/refs/heads/main/docs/figs/s2s.png' width="200" ></p>

To run `OpeNTF` in translation mode, the model instance should be set to `mdl.nmt.Nmt` in [`src/__config__.yaml#L59`](https://github.com/fani-lab/OpeNTF/blob/main/src/__config__.yaml#L59)

`"models.instances=[mdl.nmt.Nmt]"`


```
python main.py  "cmd=[prep,train,test,eval]" \
                "models.instances=[mdl.nmt.Nmt]" \
                data.domain=cmn.publication.Publication \
                data.source=../data/dblp/toy.dblp.v12.json \
                data.output=../output/dblp/toy.dblp.v12.json \
                ~data.filter
```




---



**Seq-to-Seq Model Selection and Hyperparameters**

In [`mdl.nmt.Nmt`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/nmt.py), we prepare the required source and target sets as if parallel datasets from `source language` whose tokens are `skills` to `target language` whose tokens are `experts` and call `OpenNMT-py`'s executables by spawning a new process via python's `subprocess`.

`OpeNTF` leverage [`OpenNMT-py`](https://opennmt.net/OpenNMT-py/quickstart.html) to manage model selection and its hyperparameters. For instance, to pick a `transformer`, the config file in [`src/mdl/__config__.nmt.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.nmt.yaml) should be set for the required hyperparameters like below:

```
encoder_type: transformer
decoder_type: transformer

position_encoding: False # w/o purticular order in input skills and output members.

enc_layers: 4
dec_layers: 4

hidden_size: 128
transformer_ff: 512
attention_dropout: 0.2
heads: 8

beam_size: 10
n_best: 1
min_length: 2
max_length: 100
```
(see default settings for common models in [`src/mdl/__config__.nmt.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.nmt.yaml))

(see [`OpenNMT-py`](https://opennmt.net/OpenNMT-py/quickstart.html)'s docs for more details)




---



**Setup & Quickstart**

From the [`quickstart`](https://colab.research.google.com/github/fani-lab/OpeNTF/blob/main/ipynb/quickstart.ipynb) script:


In [1]:
# set up python 3.8
!sudo apt-get update -y
!sudo apt-get install -y python3.8 python3.8-venv python3.8-distutils python3-pip
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 10
!python --version

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [83.8 kB]
Get:8 https://cli.github.com/packages stable/main amd64 Packages [356 B]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,878 kB]
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,637 kB]
Get:14 

In [None]:
# get OpeNTF
!rm -R opentf/
!git clone https://github.com/Fani-Lab/opentf
!pip install --upgrade pip setuptools
!pip install -r opentf/requirements.txt

rm: cannot remove 'opentf/': No such file or directory
Cloning into 'opentf'...
remote: Enumerating objects: 26917, done.[K
remote: Counting objects: 100% (92/92), done.[K
remote: Compressing objects: 100% (69/69), done.[K


Translative Neural Team Recommendation

- Preprocessing raw data `cmd=prep` into `teamsvecs` sparse matrix each row of which is a team with its skills `teamsvecs['skill']`, members `teamsvecs['member']` as 1-hot vectors
- `data.domain`, `data.source`, `data.output` from `toy.dblp` dataset, available at the codebase [`OpeNTF/data/dblp`](https://github.com/fani-lab/OpeNTF/tree/main/data/dblp)
- No filtering for min team size `data.filter.min_nteam` and min number of teams per experts `data.filter.min_team_size`
- Neural machine translation (`mdl.nmt.Nmt`) for team recommendation for `train`, `test`, and `eval` steps
- The model and its hyperparameters in [`src/mdl/__config__.nmt.yaml`](https://github.com/fani-lab/OpeNTF/blob/main/src/mdl/__config__.nmt.yaml)

In [None]:
%cd opentf/src/
!python main.py "cmd=[prep,train,test,eval]" "models.instances=[mdl.nmt.Nmt]" data.domain=cmn.publication.Publication data.source=../data/dblp/toy.dblp.v12.json data.output=../output/dblp/toy.dblp.v12.json ~data.filter


In [None]:
!ls ../output/dblp/toy.dblp.v12.json

In [None]:
!ls ../output/dblp/toy.dblp.v12.json/splits.f3.r0.85/gs.b1000.e100.ns5.lr0.001.es5.spe10.d64.add.stml.h128.nn30-20

f0.e0.pt			       f1.test.pred.eval.mean.csv
f0.pt				       f1.test.pred.eval.per_instance.csv
f0.test.e0.pred			       f2.e0.pt
f0.test.e0.pred.eval.mean.csv	       f2.pt
f0.test.e0.pred.eval.per_instance.csv  f2.test.e0.pred
f0.test.pred			       f2.test.e0.pred.eval.mean.csv
f0.test.pred.eval.mean.csv	       f2.test.e0.pred.eval.per_instance.csv
f0.test.pred.eval.per_instance.csv     f2.test.pred
f1.e0.pt			       f2.test.pred.eval.mean.csv
f1.pt				       f2.test.pred.eval.per_instance.csv
f1.test.e0.pred			       logs4tboard
f1.test.e0.pred.eval.mean.csv	       ntf.
f1.test.e0.pred.eval.per_instance.csv  test.pred.eval.mean.csv
f1.test.pred			       test.pred.eval.per_instance_mean.csv


In [None]:
import pandas as pd
pd.read_csv('/content/opentf/output/dblp/toy.dblp.v12.json/splits.f3.r0.85/gs.b1000.e100.ns5.lr0.001.es5.spe10.d64.add.stml.h128.nn30-20/test.pred.eval.mean.csv', index_col = 0)


Unnamed: 0,mean,std
P_2,0.166667,0.057735
P_5,0.213333,0.083267
P_10,0.186667,0.011547
recall_2,0.155556,0.050918
recall_5,0.488889,0.20367
recall_10,0.844444,0.050918
ndcg_cut_2,0.159124,0.0708
ndcg_cut_5,0.346042,0.085861
ndcg_cut_10,0.490005,0.02667
map_cut_2,0.111111,0.053576
