# eXtreme Multi-label Ranking (XMR) with Transformers
In many XMC applications, XR-Transformer is able to yield better performance than XR-Linear due to better extraction of semantic information. However, unlike the linear models, the training hyper-parameters need to be carefully set to achieve the best performance. Naively using the default setting will often lead to sub-optimal results.

In this section, we will discuss about crucial components in training a good XR-Transformer model.

### Install PECOS through Python PIP

In [None]:
! pip install libpecos

## 1. Overview: Multi-Resolution Fine-tuning

One important thing to note is that XR-Transformer leverages multi-resolution fine-tuning to allow tuning from easy to hard tasks. The training can be separated into three steps:

* **Step1**: Label features are computed (usually via PIFA) and is used to build preliminary hierarchical label tree (HLT) via hierarchical k-means.
* **Step2**: Fine-tune the transformer encoder on the chosen levels of the preliminary HLT.
* **Step3**: Concatenate final instance embeddings and sparse features and train the linear rankers on the refined HLT.

<div> <br/><img src="imgs/pecos_xrtransformer.png" width="80%"/> </div>



## 2. Parameter structure

Although we provide basic functionalities to supply training and prediction parameters in the CLI API `pecos.xmc.xtransformer.train`, `pecos.xmc.xtransformer.predict` and `pecos.xmc.xtransformer.encode`, for advanced users it is recommended to give parameters via JSON format.

You can generate a `.json` file with all of the parameters that you can edit and fill in via
```bash
python3 -m pecos.xmc.xtransformer.train --generate-params-skeleton &> params.json
```

After filling in the desired parameters into `params.json`, the training can be done end2end via:
```bash
python3 -m pecos.xmc.xtransformer.train -t ${T_path} -x ${X_path} -y ${Y_path} -m ${model_dir} --params-path params.json
```

The high-level structure of the training and prediction parameters for XR-Transformer:

In [1]:
import logging
from pecos.xmc.xtransformer.model import XTransformer
from pecos.utils import logging_util

LOGGER = logging.getLogger(__name__)

logging_util.setup_logging_config(level=1)

print(XTransformer.TrainParams.__doc__)
print(XTransformer.PredParams.__doc__)

Training Parameters of XTransformer.

        preliminary_indexer_params (HierarchicalKMeans.TrainParams): params to generate preliminary hierarchial label tree.
            ignored if clustering is given
        refined_indexer_params (HierarchicalKMeans.TrainParams): params to generate refined hierarchial label tree.
            ignored if fix_clustering is True
        matcher_params_chain (TransformerMatcher.TrainParams or list): chain of params for TransformerMatchers.
        ranker_params (XLinearModel.TrainParams): train params for linear ranker

        do_fine_tune (bool, optional): if False, skip fine-tuning steps and directly use pre-trained transformer models.
            Default True
        only_encoder (bool, optional): if True, skip linear ranker training. Default False
        fix_clustering (bool, optional): if True, use the same hierarchial label tree for fine-tuning and final prediction. Default false.
        max_match_clusters (int, optional): max number of clust

We provide the fexibility to control almost every aspect of XR-Transformer taining, let's cover the main components.

### 2.1 Specify the Label Hierarchy

The structure and construction of the preliminary HLT and the refined-HLT are controlled by `preliminary_indexer_params` and `refined_indexer_params`. 

In [2]:
from pecos.xmc.base import HierarchicalKMeans; print(HierarchicalKMeans.TrainParams.__doc__)

Training Parameters of Hierarchical K-means.

        nr_splits (int, optional): The out-degree of each internal node of the tree. Default is `16`.
        min_codes (int): The number of direct child nodes that the top level of the hierarchy should have.
        max_leaf_size (int, optional): The maximum size of each leaf node of the tree. Default is `100`.
        spherical (bool, optional): True will l2-normalize the centroids of k-means after each iteration. Default is `True`.
        seed (int, optional): Random seed. Default is `0`.
        kmeans_max_iter (int, optional): Maximum number of iterations for each k-means problem. Default is `20`.
        threads (int, optional): Number of threads to use. `-1` denotes all CPUs. Default is `-1`.
        


Here is an example of the parameters related to label hierarchy in `wiki10-31k` model:

In [3]:
import json
import pecos
import requests
import numpy as np
from pecos.utils import smat_util
from pecos.xmc import Indexer, LabelEmbeddingFactory

param_url = "https://raw.githubusercontent.com/amzn/pecos/mainline/examples/xr-transformer-neurips21/params/wiki10-31k/bert/params.json"
params = json.loads(requests.get(param_url).text)
    
wiki31k_train_params = XTransformer.TrainParams.from_dict(params["train_params"])
wiki31k_pred_params = XTransformer.PredParams.from_dict(params["pred_params"])

print(json.dumps(wiki31k_train_params.preliminary_indexer_params.to_dict(), indent=True))

{
 "__meta__": {
  "class_fullname": "pecos.xmc.base###HierarchicalKMeans.TrainParams"
 },
 "nr_splits": 16,
 "min_codes": 128,
 "max_leaf_size": 16,
 "spherical": true,
 "seed": 10001,
 "kmeans_max_iter": 20,
 "threads": -1
}


In [4]:
X_feat = smat_util.load_matrix("xmc-base/wiki10-31k/tfidf-attnxml/X.trn.npz", dtype=np.float32)
Y = smat_util.load_matrix("xmc-base/wiki10-31k/Y.trn.npz", dtype=np.float32)

with open("xmc-base/wiki10-31k/X.trn.txt", 'r') as fin:
    X_txt = [xx.strip() for xx in fin.readlines()]

preliminary_hlt = Indexer.gen(
    LabelEmbeddingFactory.create(Y, X_feat, method="pifa"),
    train_params=wiki31k_train_params.preliminary_indexer_params,
)

print(f"Preliminary HLT structure {[c.shape[0] for c in preliminary_hlt]}")

Preliminary HLT structure [128, 2048, 30938]


In this case the preliminiary HLT has 3 levels (128-2048-30938).
As we choose the `max_match_clusters` to be `32768`, the fine-tuning will happen on all 3 levels of preliminary HLT.

The preliminary HLT is usually constructed such that:
* The initial fine-tuning task has low enough label resolution (i.e. < 1000 labels, in this case 128). This is to ensure Transformers can start from simple task to 'warm-up'.
* The final fine-tuning task has high enough label resolution (controlled by `max_match_clusters`, in this case 32768). The is to ensure training efficiency.

### 2.2 Control fine-tuning at each level

At each level of the fine-tuning task, user can independently specify the training parameters such as `loss_function`, `batch_size` and etc.



In [5]:
from pecos.xmc.xtransformer.matcher import TransformerMatcher; print(TransformerMatcher.TrainParams.__doc__)

Training Parameters of MLModel

        model_shortcut (str): string of pre-trained model shortcut. Default 'bert-base-cased'
        negative_sampling (str): negative sampling types. Default tfn
        loss_function (str): type of loss function to use for transformer
            training. Default 'squared-hinge'
        bootstrap_method (str): algorithm to bootstrap text_model. If not None, initialize
            TransformerMatcher projection layer with one of:
                'linear' (default): linear model trained on final embeddings of parent layer
                'inherit': inherit weights from parent labels
        lr_schedule (str): learning rate schedule. See transformers.SchedulerType for details.
            Default 'linear'

        threshold (float): threshold to sparsify the model weights. Default 0.1
        hidden_dropout_prob (float): hidden dropout prob in deep transformer models. Default 0.1
        batch_size (int):  batch size for transformer training. Default 8
 

For the `wiki10-31k` model, we are fine-tuning the `bert-base-uncased` pre-trained model at 3 levels of the preliminary HLT:

In [6]:
print("="*10, f"matcher_params_chain[0] (len={len(wiki31k_train_params.matcher_params_chain)})", "="*10)
print(json.dumps(wiki31k_train_params.matcher_params_chain[0].to_dict(), sort_keys=True, indent=True))

{
 "__meta__": {
  "class_fullname": "pecos.xmc.xtransformer.matcher###TransformerMatcher.TrainParams"
 },
 "adam_epsilon": 1e-08,
 "batch_gen_workers": 16,
 "batch_size": 32,
 "bootstrap_method": "weighted-linear",
 "cache_dir": "",
 "checkpoint_dir": "",
 "cost_sensitive_ranker": true,
 "eval_by_true_shorlist": false,
 "gradient_accumulation_steps": 1,
 "hidden_dropout_prob": 0.1,
 "init_model_dir": "",
 "learning_rate": 5e-05,
 "logging_steps": 50,
 "loss_function": "weighted-squared-hinge",
 "lr_schedule": "linear",
 "max_active_matching_labels": 1000,
 "max_grad_norm": 1.0,
 "max_no_improve_cnt": -1,
 "max_num_labels_in_gpu": 65536,
 "max_steps": 1000,
 "model_shortcut": "bert-base-uncased",
 "negative_sampling": "tfn+man",
 "num_train_epochs": 10,
 "pre_tensorize_labels": true,
 "pre_tokenize": true,
 "save_steps": 200,
 "threshold": 0.001,
 "use_gpu": true,
 "warmup_steps": 100,
 "weight_decay": 0.0
}


Though the best parameters may vary a lot for different tasks, there are some common notes you should alwasy
* It's recommended to finish at least one epoch at each level. This will allow the model to visit the label matrix at least once.
  * i.e. `max_steps * batch_size * num_gpus > num_instances` (if `max_steps` is null, it will be infered from `num_train_epochs`)
* `model_shortcut` will only be used in the first fine-tuning layer, as the later ones will just continue on the same encoder.
* Learning rate and its schedule is controlled by `learning_rate`, `lr_schedule`, `warmup_steps`, `max_steps`. For more info, refer to: https://huggingface.co/docs/transformers/main_classes/optimizer_schedules

#### 2.2.1 Use pre-trained models

There are two ways to provide pre-trained Transformer encoder:
* **Download from huggingface repo** (https://huggingface.co/models): model name provided by `model_shortcut`. (e.x. `bert-base-uncased` or `w11wo/javanese-distilbert-small`)
* **Load your custom model from local disk**: model path provided by `init_model_dir`. Model should be loadable through `TransformerMatcher.load()`

Note that both `model_shortcut` and `init_model_dir` will only be used in the first fine-tuning layer, as the later ones will just continue on the final state from parent encoder.


A simple example if you want to construct your custom pre-trained model for XR-Transformer fine-tuning:

In [7]:
import os
import scipy.sparse as smat
from pecos.xmc.xtransformer.matcher import TransformerMatcher
from transformers import AutoTokenizer, AutoModelForSequenceClassification

init_model_dir = "work_dir/my_pre_trained_model"
os.makedirs("work_dir", exist_ok=True)

# example to use your own pre-trained model, here we use huggingface model as an example
my_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
my_encoder = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# do my own modification/tuning/etc
# ...

# save my own model to disk
my_tokenizer.save_pretrained(f"{init_model_dir}/text_tokenizer")
my_encoder.save_pretrained(f"{init_model_dir}/text_encoder")

# then the `work_dir` can be fed as `init_model_dir` as initial model.
# Sanity check: if this dir can be loaded via TransformerMatcher.load(*)
matcher = TransformerMatcher.load(init_model_dir)
print(f"{matcher.__class__} model loaded with encoder_type={matcher.model_type} num_labels={matcher.nr_labels}")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

<class 'pecos.xmc.xtransformer.matcher.TransformerMatcher'> model loaded with encoder_type=distilbert num_labels=2


Or you could download our released encoders via:

In [8]:
%%bash
DATASET="wiki10-31k"
wget -q https://archive.org/download/xr-transformer-encoders/${DATASET}.tar.gz -O ${DATASET}_encoder.tar.gz
mkdir -p ./work_dir/xr-transformer-encoder
tar -zxf ./${DATASET}_encoder.tar.gz -C ./work_dir/xr-transformer-encoder

In [9]:
matcher = TransformerMatcher.load("./work_dir/xr-transformer-encoder/wiki10-31k/bert/text_encoder")
print(f"{matcher.__class__} model loaded with encoder_type={matcher.model_type} num_labels={matcher.nr_labels}")

<class 'pecos.xmc.xtransformer.matcher.TransformerMatcher'> model loaded with encoder_type=bert num_labels=30938


#### 2.2.2 Bootstrapping and Cost Sensitive Leanring

We provide three options to boostrap the XMC head at child level (i.e. $W^{(t+1)}$) from parent level (i.e. $W^{(t)}$):
* `bootstrap_method=None`: No bootstrap, $W^{(t+1)}$ will be randomly initialized.
* `bootstrap_method='inherit'`: Bootstrap by inherit the weight vector from parent node. 
* `bootstrap_method='linear'`(default): linear model will be trained on final embeddings of parent layer and be used as initial point for $W^{(t+1)}$.

In most cases the default linear bootstrapper would give good enough initial point the XMC heads.
Compared with linear bootstrapper, the inherit bootstrapper has less memory/time overhead. 

XR-Transformer allows taking magnutute of label strength into consideration via cost-sensitive learning.
This is available even when input label matrix is binary. In this case, the cost (at non-leaf level) will be inferred via label aggregation.

To use cost-sensitive fine-tuning, use the `weighted-` version of loss functions. I.e.

In [10]:
print([lf for lf in TransformerMatcher.LOSS_FUNCTION_TYPES.keys() if 'weighted-' in lf])

['weighted-hinge', 'weighted-squared-hinge']


### 2.3 Linear models with concatenated feature

The training of linear models is controlled by the `ranker_params`, which is of the same format as PECOS XR-Linear.

User should pay special attention to the `threshold` which controls the sparsification of final linear models.
Unlike purely sparse features, the linear models trained on sparse+dense concatenated features are more sensitive to the sparsification.
Usually `threshold=0.01` or `0.001` is recommended for XR-Transformer.

In [11]:
from pecos.xmc.xtransformer.module import MLProblemWithText
prob = MLProblemWithText(X_txt, Y, X_feat=X_feat)

# disable fine-tuning, use pre-trained bert model from huggingface
wiki31k_train_params.do_fine_tune = False

# this will be slow on CPU only machine
xtf_pretrained = XTransformer.train(
    prob,
    clustering=preliminary_hlt,
    train_params=wiki31k_train_params,
    pred_params=wiki31k_pred_params,
)

X_feat_tst = smat_util.load_matrix("xmc-base/wiki10-31k/tfidf-attnxml/X.tst.npz", dtype=np.float32)
Y_tst = smat_util.load_matrix("xmc-base/wiki10-31k/Y.tst.npz", dtype=np.float32)

with open("xmc-base/wiki10-31k/X.tst.txt", 'r') as fin:
    X_txt_tst = [xx.strip() for xx in fin.readlines()]

P_pretrained = xtf_pretrained.predict(X_txt_tst, X_feat=X_feat_tst)
metrics = smat_util.Metrics.generate(Y_tst, P_pretrained, topk=10)
print(metrics)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForXMC: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


prec   = 85.22 82.55 77.26 72.15 67.42 63.13 59.33 56.08 53.02 50.24
recall = 5.05 9.76 13.58 16.74 19.41 21.68 23.64 25.41 26.92 28.22


In [12]:
# use fine-tuned bert model
wiki31k_train_params.matcher_params_chain[0].init_model_dir = "./work_dir/xr-transformer-encoder/wiki10-31k/bert/text_encoder"

# this will be slow on CPU only machine
xtf_fine_tuned = XTransformer.train(
    prob,
    clustering=preliminary_hlt,
    train_params=wiki31k_train_params,
    pred_params=wiki31k_pred_params,
)

P_fine_tuned = xtf_fine_tuned.predict(X_txt_tst, X_feat=X_feat_tst)
metrics = smat_util.Metrics.generate(Y_tst, P_fine_tuned, topk=10)
print(metrics)

prec   = 87.95 83.54 78.79 73.95 69.43 65.14 61.08 57.70 54.63 51.97
recall = 5.25 9.89 13.84 17.14 19.99 22.36 24.35 26.16 27.73 29.21


# (BETA) Distributed PECOS

`pecos.distributed` is a PECOS module that enables distributed training.

Currently the following sub-modules are implemented:

* Distributed X-Linear ([`pecos.distributed.xmc.xlinear`](xmc/xlinear/README.md))

We are working to implement more distributed algorithms for PECOS existing models, please watch out for our newest releases.

## 1. Distributed XR-Linear

`pecos.distributed.xmc.xlinear` enables distributed training for PECOS XLinear model (`pecos.xmc.xlinear`).

### Prerequisites

* **Hardware**: 
    * Cluster of machines connected by network which can password-less SSH to each other.
      * IP address of every machine in the cluster is known.
    * Shared network disk mounted on all machines.
      * For accessing data and saving trained models.

* **Software**: Install the following software on **every** machine of your cluster
    * MPI and mpi4py
    
Due to the hardware constraint during the tutorial, we only include a basic example in local mode here.

In [13]:
%%bash
# check the required dependencies
mpicc -v
which mpiexec
python3 -m pip install mpi4py
mpiexec -n 8 python3 -m mpi4py.bench helloworld

Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-libmpx --enable-libsanitizer --enable-gnu-indirect-function --enable-libcilkrts --enable-libatomic --enable-libquadmath --enable-libitm --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 7.3.1 20180712 (Red Hat 7.3.1-15) (GCC) 


/opt/amazon/openmpi/bin/mpiexec
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


You should consider upgrading via the '/home/ec2-user/repo/tutorial-env/bin/python3 -m pip install --upgrade pip' command.


Hello, World! I am process 0 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 1 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 2 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 3 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 4 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 5 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 6 of 8 on ip-[MASKED].ec2.internal.
Hello, World! I am process 7 of 8 on ip-[MASKED].ec2.internal.


### Basic Usage

Below is a simple showcase of the usage of `pecos.distributed.xmc.xlinear.train`.

In [14]:
%%bash
mpiexec -n 2 \
python3 -m pecos.distributed.xmc.xlinear.train \
-x xmc-base/wiki10-31k/tfidf-attnxml/X.trn.npz \
-y xmc-base/wiki10-31k/Y.trn.npz \
-m work_dir/dist_xlinear_model

08/05/2022 01:53:16 - INFO - pecos.utils.profile_util - psutil module installed, will print memory info.
08/05/2022 01:53:16 - INFO - pecos.utils.profile_util - psutil module installed, will print memory info.
08/05/2022 01:53:16 - INFO - __main__ - Started loading data on Rank 0 ... RSS 89.9 MB. Full mem info: pmem(rss=94277632, vms=805863424, shared=45461504, text=2732032, lib=0, data=156913664, dirty=0)
08/05/2022 01:53:16 - INFO - __main__ - Started loading data on Rank 1 ... RSS 89.7 MB. Full mem info: pmem(rss=94011392, vms=805867520, shared=45207552, text=2732032, lib=0, data=156917760, dirty=0)
08/05/2022 01:53:16 - INFO - __main__ - Done loading data on Rank 1. RSS 166.4 MB. Full mem info: pmem(rss=174432256, vms=884928512, shared=45535232, text=2732032, lib=0, data=235978752, dirty=0)
08/05/2022 01:53:16 - INFO - __main__ - Done loading data on Rank 0. RSS 166.5 MB. Full mem info: pmem(rss=174637056, vms=884924416, shared=45735936, text=2732032, lib=0, data=235974656, dirty=0

We didn't setup the multi-node cluster therefore only single machine is used here. In practice, you can store your machines' addresses in `hostfile` and run the distributed training via 
```
mpiexec -f hostfile -n ${NUM_MACHINE} python3 -m pecos.distributed.xmc.xlinear.train [..]
```

The distributed trained model is serialized in the same way as the single node trained model. We can use the same way to predict and evaluate the model:

In [15]:
%%bash
python3 -m pecos.xmc.xlinear.predict \
-x xmc-base/wiki10-31k/tfidf-attnxml/X.tst.npz \
-y xmc-base/wiki10-31k/Y.tst.npz \
-m work_dir/dist_xlinear_model

==== evaluation results ====
prec   = 84.05 78.20 72.57 67.90 63.90 60.17 56.86 53.87 51.31 48.82
recall = 4.97 9.17 12.63 15.62 18.26 20.52 22.48 24.25 25.88 27.25


## Distributed Training Algorithm

Because of the model separability of PECOS XR-Linear model, we can split the original problem into multiple independent problems:
* **One meta-problem**: XMC problem to match an input X to K clusters
* **K sub-problems**: XMC problem to rank the labels in one of the K clusters for an input X.



<img src="imgs/dist-xlinear.png" width=600 height=600 />

In addition to distributed training, `pecos.distributed.xmc.xlinear` also has the following features:

* **Distributed Hierarchical Clustering**: We leverage the same meta-sub problem split to build the Hierarchical label tree. Since that building label feature for a huge dataset could be memory intensive for meta node, we provide option to use simpler label embedding for meta-tree generation:`--meta-label-embedding-method pii`
* **Load Balancing**: Beacuse of the long tail distribution in most XMC problems, workload to train each sub-problem varies a lot. To address that, the distributed training algorithm does load balancing when K > #workers. The sub-tree number K can be controlled via `--min-n-sub-tree`.
