# What is OpeNTF?
OpeNTF is an open-source framework hosting large-scale training datasets and canonical neural team formation models that are trained using fairness-aware and time-sensitive methods.

## Prerequisite for OpeNTF
Before using OpeNTF, the following libraries are needed as a prerequisite, in addition to the libraries in `requirements.txt`:

In [None]:
pip install torch==1.9.0
pip install pytrec-eval-terrier==0.5.2
pip install gensim==3.8.3

In [None]:
git clone --recursive https://github.com/Fani-Lab/opentf
cd opentf
pip install -r requirements.txt

## Quickstart on OpeNTF

OpeNTF has the following required arguments:

- `-data`: the path of the input datasets.
- `-domain`: the domain the input dataset belongs in.
- `-model`: the neural team formation models to be used in the run.

As well, other optional arguments include:
- `-attribute`: the set of our sensitive attributes (e.g., popularity).
- `-fairness`: fairness metrics for reranking algorithms, used to minimize popularity bias.
- `-np-ratio`: desired ratio of non-popular experts after reranking.
- `-k_max`: cutoff for the reranking algorithms.
- `-filter`: remove outliers, if needed.
- `-future`: predict future, if needed.
- `-exp_id`: ID of the experiment.
- `-output`: path of the baseline output.

The following is a sample run of the OpeNTF codebase using a toy dataset `toy.dblp.v12.json`, which is modelled after the DBLP dataset: a dataset consisting of authorship and skill information on more than 4 million Computer Science research publications. Two neural models (`feedforward` and `Bayesian`) are used in this quickstart.

In [1]:
%cd src
!python -u main.py -data ../data/raw/dblp/toy.dblp.v12.json -domain dblp -model fnn bnn -fairness det_greedy -attribute popularity

c:\Users\tea-n_\Documents\GitHub\OpeNTF\src
Loading sparse matrices from ./../data/preprocessed/dblp/toy.dblp.v12.json/teamsvecs.pkl ...
Loading indexes pickle from ./../data/preprocessed/dblp/toy.dblp.v12.json/indexes.pkl ...
It took 0.010205984115600586 seconds to load from the pickles.
It took 0.020305871963500977 seconds to load the sparse matrices.
Running for (dataset, model): (dblp, fnn) ... 
Fold 0/2, Epoch 0/9, Minibatch 0/0, Phase train, Running Loss train 0.6056249141693115, Time 0.18520522117614746, Overall 4.156122207641602 
Fold 0/2, Epoch 0/9, Running Loss train 0.03562499495113597, Time 0.18520522117614746, Overall 4.156122207641602 
Fold 0/2, Epoch 0/9, Minibatch 0/0, Phase valid, Running Loss valid 0.5972533226013184, Time 0.19181203842163086, Overall 4.162729024887085 
Fold 0/2, Epoch 0/9, Running Loss valid 0.06636148028903538, Time 0.19181203842163086, Overall 4.162729024887085 
Fold 0/2, Epoch 1/9, Minibatch 0/0, Phase train, Running Loss train 0.5991436243057251,


  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<00:00, 72.59it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<00:00, 5518.82it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<00:00, 80.52it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<00:00, 92.82it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<00:00, 837.55it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<00:00, 61.57it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00

## Setting Hyperparameters
OpeNTF's codebase offers the following hyperparameter to be set for each neural team formation methods:

### `model`
- Contains the baseline hyperparameters in the form of `'model-name' : { params }`, which allows the models to be integrated into the baseline with their unique parameters.
- Allows the customization of which stages of the system to be executed through `cmd`.
- Contains other training parameter for the models.

### `data`
- Contains parameters for manipulating datasets, including dataset filters (e.g., minimum team size) and bucket size for sparse matrix parallel generation.

### `fair`
- Contains parameters for the fairness metrics used in consideration during team formation.

A snippet of the parameters used in `param.py` is displayed as follows:

In [None]:
import random
import torch
import numpy as np

random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

np.random.seed(0)

settings = {
    'model':{
        'baseline': {
            'random': {
                'b': 128
            },
            'fnn':{
                'l': [100],  # list of number of nodes in each layer
                'lr': 0.001,  # learning rate
                'b': 128,  # batch size
                'e': 10,  # epoch
                'nns': 3,  # number of negative samples
                'ns': 'none',  # 'none', 'uniform', 'unigram', 'unigram_b'
                'loss': 'SL',  # 'SL'-> superloss, 'DP' -> Data Parameters, 'normal' -> Binary Cross Entropy
            },
            'bnn':{
                'l': [128],  # list of number of nodes in each layer
                'lr': 0.1,  # learning rate
                'b': 128,  # batch size
                'e': 5,  # epoch
                'nns': 3,  # number of negative samples
                'ns': 'unigram_b',  # 'uniform', 'unigram', 'unigram_b'
                's': 1,  # # sample_elbo for bnn
                'loss': 'SL',  # 'SL'-> superloss, 'DP' -> Data Parameters, 'normal' -> Binary Cross Entropy
            },
        },
        'cmd': ['train', 'test', 'eval', 'fair'],  # 'train', 'test', 'eval', 'plot', 'agg', 'fair'
        'nfolds': 3,
        'train_test_split': 0.85,
        'step_ahead': 2,#for now, it means that whatever are in the last [step_ahead] time interval will be the test set!
    },
    'data':{
        'domain': {
            'dblp':{},
            'uspt':{},
            'imdb':{},
        },
        'location_type': 'country', #should be one of 'city', 'state', 'country' and represents the location of members in teams (not the location of teams)
        'filter': {
            'min_nteam': 5,
            'min_team_size': 2,
        },
        'parallel': 1,
        'ncore': 0,# <= 0 for all
        'bucket_size': 1000
    },
    'fair': {'np_ratio': None,
              'fairness': ['det_greedy',],
              'k_max': None,
              'fairness_metrics': {'ndkl'},
              'utility_metrics': {'map_cut_2,5,10'},
              'eq_op': False,
              'mode': 0,
              'core': -1,
              'attribute': ['gender', 'popularity']},
}

## Structure and Inheritance

### Dataset Structure
<p align="center"><img src='./src/cmn/dataset_hierarchy.png' width="500" ></p>

To integrate a new dataset into the baseline, follow the structure of the `team` class. Additional fields can be added, like its derived classes. Ideally, only the `read_data()` function should be overriden.



In [None]:
import json
from cmn.member import Member
from cmn.team import Team

class Review(Team):
    def _init_(self, id, title, year, fos, reviewers):
        super().__init__(id, reviewers, fos, year)
        self.title = title
    
    @staticmethod
    def read_data(datapath, output, index, filter, settings):
        try:
            return super(Review, Review).load_data(output, index)
        except (FileNotFoundError, EOFError) as e:
            print(f"Pickles not found! Reading raw data from {datapath} (progress in bytes) ...")
            teams = {}; candidates = {}

            with open(datapath, "r", encoding='utf-8') as jf:
                for line in jf:
                    try:
                        if not line: break
                        jsonline = json.loads(line.lower().lstrip(","))
                        id = jsonline['id']
                        title = jsonline['title']
                        year = jsonline['year']

                        # a team must have skills and members
                        try: fos = jsonline['fos']
                        except: continue
                        try: reviewers = jsonline['reviewers']
                        except: continue

                        members = []
                        for reviewer in reviewers:
                            member_id = reviewer['id']
                            member_name = reviewer['name'].replace(" ", "_")
                            if (idname := f'{member_id}_{member_name}') not in candidates:
                                candidates[idname] = Member(member_id, member_name)
                                candidates[idname].skills.update(set(reviewer['expertise']))
                            members.append(candidates[idname])
                            
                        team = Review(id, members, title, year, members)
                        teams[team.id] = team
                    except json.JSONDecodeError as e:  # ideally should happen only for the last line ']'
                        print(f'JSONDecodeError: There has been error in loading json line `{line}`!\n{e}')
                        continue
                    except Exception as e:
                        raise e
            return super(Review, Review).read_data(teams, output, filter, settings)
        except Exception as e: raise e

### Model Structure
![Class Diagram of the Model baseline.](./new-class-diagram.png)

To integrate a new model into the baseline, follow the `Ntf` class. Ideally, only the `learn()` method should be overriden, with `eval()` remaining the same for fair comparison.

In [None]:
import numpy as np
import keras
import pandas as pd

from mdl.ntf import Ntf
from mdl.cds import TFDataset
from cmn.team import Team
from cmn.tools import merge_teams_by_skills
from mdl.cds import SuperlossDataset
from mdl.superloss import SuperLoss

class Random(Ntf):
    def __init__(self):
        super(Random, self).__init__()
    
    def init(self):
        self.model = keras.Sequential()

    def learn(self, splits, indexes, vecs, params, prev_model, output):
        input_size = vecs['skill'].shape[1]
        output_size = len(indexes['i2c'])

        for foldidx in splits['folds'].keys():
            self.init(input_size=input_size, output_size=output_size)
            if prev_model: keras.saving.load_model((prev_model[foldidx]))
            
            keras.saving.save_model(self.model, f"{output}/state_dict_model.f{foldidx}.pt")