# Ensembling with Triton-FIL

## Introduction
This notebook will go through step-by-step the process of training an ensemble of models and deploying it to Triton's new FIL backend, building off of the [Fraud Detection Notebook](https://github.com/triton-inference-server/fil_backend/blob/main/notebooks/categorical-fraud-detection/Fraud_Detection_Example.ipynb). We will be utilizing Triton's Python Backend in order to combine the models. This requires the use of a Python model that executes inference requests on each model in the ensemble. This notebook will focus on the process of creating the Python model, as well as how to submit requests with the Triton Python client.

__NOTE__: Currently, GPU support is not available for the Python backend. This is due to a DLPack update that has yet to be implemented. However, the process would be virtually the same, except the Python backend Tensor objects would have to be converted to DLPack with the methods found [here](https://github.com/triton-inference-server/fil_backend/blob/main/notebooks/categorical-fraud-detection/Fraud_Detection_Example.ipynb).

## Pre-Requisites
This notebook assumes that you have Docker plus a few Python dependencies, as in the Fraud Detection Notebook. However, this notebook utilizes Treelite, as we will be using Scikit-Learn models (these are not natively supported by the FIL backend, unlike XGBoost models). To install all of these dependencies in a conda environment, you may make use of the following conda environment file:
```yaml
---
name: triton_ensemble_nb
channels:
  - conda-forge
  - nvidia
  - rapidsai
dependencies:
  - cudatoolkit=11.4
  - cudf=21.12
  - cuml=21.12
  - cupy
  - jupyter
  - kaggle
  - matplotlib
  - numpy
  - pandas
  - pip
  - python=3.8
  - scikit-learn
  - pip:
      - treelite=2.3.0
      - tritonclient[all]
      - xgboost>=1.5,<1.6
```

In [1]:
TRITON_IMAGE = 'nvcr.io/nvidia/tritonserver:22.05-py3'

In [2]:
!docker pull {TRITON_IMAGE}

22.05-py3: Pulling from nvidia/tritonserver
Digest: sha256:a85daa2907f46e70b3782818a0331df62d9b4e0b1f15f1530b2a52c8c782d46d
Status: Image is up to date for nvcr.io/nvidia/tritonserver:22.05-py3
nvcr.io/nvidia/tritonserver:22.05-py3


## Fetching Training Data
As in the Fraud Detection Notebook, we will make use of data from the [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/overview) Kaggle competition.

**NOTE**: You will need to make sure that your Kaggle credentials are [available](https://github.com/Kaggle/kaggle-api#api-credentials) either through a kaggle.json file or via environment variables.

In [3]:
!kaggle competitions download -c ieee-fraud-detection
!unzip -u ieee-fraud-detection.zip
train_csv = 'train_transaction.csv'

ieee-fraud-detection.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  ieee-fraud-detection.zip


## Training Example Models
We will be training two XGBoost models and one Scikit-Learn model for our ensemble. The first XGBoost model function is the same as the one from the fraud detection notebook. The second XGBoost function is similar to the first, except it implements random oversampling on the data. The third model is a Scikit-Learn Random Forest Model.

In [4]:
import cudf
import cupy as cp
from cuml.preprocessing import SimpleImputer, LabelEncoder
from sklearn.model_selection import train_test_split

SEED=0

In [5]:
# Reading data
data = cudf.read_csv(train_csv)

# Replace NaNs in data
nan_columns = data.columns[data.isna().any().to_pandas()]
float_nan_subset = data[nan_columns].select_dtypes(include='float64')

imputer = SimpleImputer(missing_values=cp.nan, strategy='mean')
data[float_nan_subset.columns] = imputer.fit_transform(float_nan_subset)

obj_nan_subset = data[nan_columns].select_dtypes(include='object')
data[obj_nan_subset.columns] = obj_nan_subset.fillna('UNKNOWN')

In [6]:
# Perform label encoding
cat_columns = data.select_dtypes(include='object')
for col in cat_columns.columns:
    data[col] = LabelEncoder().fit_transform(data[col])

In [7]:
# Split data into training and testing sets
X = data.drop('isFraud', axis=1)
y = data.isFraud.astype(int)
X_train, X_test, y_train, y_test = train_test_split(
    X.to_pandas(), y.to_pandas(), test_size=0.3, stratify=y.to_pandas(), random_state=SEED
)
# Copy data to avoid slowdowns due to fragmentation
X_train = X_train.copy()
X_test = X_test.copy()

In [8]:
import xgboost as xgb
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier

In [9]:
def train_model_logistic(num_trees, max_depth):
    model = xgb.XGBClassifier(
        tree_method='gpu_hist',
        enable_categorical=False,
        use_label_encoder=False,
        predictor='gpu_predictor',
        eval_metric='aucpr',
        objective='binary:logistic',
        max_depth=max_depth,
        n_estimators=num_trees
    )
    model.fit(
        X_train,
        y_train,
    )
    return model

In [10]:
def train_model_oversample(num_trees, max_depth):
    model = xgb.XGBClassifier(
        tree_method='gpu_hist',
        enable_categorical=False,
        use_label_encoder=False,
        predictor='gpu_predictor',
        eval_metric='aucpr',
        objective='binary:logistic',
        max_depth=max_depth,
        n_estimators=num_trees
    )
    
    oversample = RandomOverSampler(sampling_strategy=0.5) # Define oversampling strategy
    X_over, y_over = oversample.fit_resample(X_train, y_train)
    
    model.fit(
        X_over,
        y_over,
    )
    return model

In [11]:
def train_model_RFC(num_trees, max_depth):
    model = RandomForestClassifier(
        n_estimators=num_trees,
        max_depth=max_depth,
    )
    model.fit(
        X_train,
        y_train,
    )
    return model

In [12]:
model_logistic = train_model_logistic(1500, 14)

In [13]:
model_oversample = train_model_oversample(500, 14)

In [14]:
model_RFC = train_model_RFC(40, 16)

In [15]:
# Freeing up room on GPU
import gc
del data
del nan_columns
del float_nan_subset
del imputer
del obj_nan_subset
del cat_columns
del X
del y
gc.collect()

61

## Preparing Models for Deployment
The same process in the previous notebook will be used to prepare models to be deployed to Triton. First, we will serialize the models, then add the configuration files.

### Model Serialization
Once again, certain model formats are not natively supported by the FIL backend, so they must be directly serialized to Treelite's checkpoint format. Additionally, it is important to ensure that the correct filename is given to each model, depending on its format.

In [16]:
import os
import treelite
import pickle

# Create the model repository directory. The name of this directory is arbitrary.
REPO_PATH = os.path.abspath('model_repository')
os.makedirs(REPO_PATH, exist_ok=True)

# We will use the following variables to record information from the serialization
# process that we will require later
model_path = None
model_format = None

In [17]:
def serialize_model_xgb(model, model_name):
    model_dir = os.path.join(REPO_PATH, model_name) # Creating model repository
    version_dir = os.path.join(model_dir, '1') # Creating version 1 directory
    os.makedirs(version_dir, exist_ok=True)
    
    # This is the default filename for XGBoost models saved in json format. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_file = os.path.join(version_dir, 'xgboost.json')
    model.save_model(model_file)
    
    return model_dir

In [18]:
def serialize_model_skl(model, model_name):
    model_dir = os.path.join(REPO_PATH, model_name)
    version_dir = os.path.join(model_dir, '1')
    os.makedirs(version_dir, exist_ok=True)
    
    # Since Treelite provides no compatibility guarantees between different versions, it is recommended that you
    # save models in Pickle or Joblib.
    archival_path = os.path.join(version_dir, 'model.pkl')
    with open(archival_path,"wb") as f:
        pickle.dump(model, f)
    
    # This is the default filename expected for Treelite checkpoint models. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_file = os.path.join(version_dir, 'checkpoint.tl')
        
    tl_model = treelite.sklearn.import_model(model)
    tl_model.serialize(model_file)
    
    return model_dir

In [19]:
model_logistic_dir = serialize_model_xgb(model_logistic, 'model_logistic')
model_logistic_cpu_dir = serialize_model_xgb(model_logistic, 'model_logistic-cpu')
model_oversample_dir = serialize_model_xgb(model_oversample, 'model_oversample')
model_oversample_cpu_dir = serialize_model_xgb(model_oversample, 'model_oversample-cpu')
model_RFC_dir = serialize_model_skl(model_RFC, 'model_RFC')
model_RFC_cpu_dir = serialize_model_skl(model_RFC, 'model_RFC-cpu')



### The Configuration File
We will set up the configuration file in the same manner as the previous notebook, except we will add a parameter for model format, since we're using multiple formats (in this case, XGBoost and Scikit-Learn). 

Once again, you can read about the FIL backend's configuration options [here](https://github.com/triton-inference-server/fil_backend#configuration).

In [20]:
features = X_test.shape[1]
num_classes = cp.unique(y_test).size
bytes_per_sample = (features + num_classes) * 4
max_batch_size = 60_000_000 // bytes_per_sample

In [21]:
def generate_config(model_dir, model_format, storage_type, deployment_type='gpu'):
    if deployment_type.lower() == 'cpu':
        instance_kind = 'KIND_CPU'
    else:
        instance_kind = 'KIND_GPU'
        
    config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [                                 
 {{  
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {features} ]                    
  }} 
]
output [
 {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ {num_classes} ]
  }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
  {{
    key: "model_type"
    value: {{ string_value: "{model_format}" }}
  }},
  {{
    key: "predict_proba"
    value: {{ string_value: "true" }}
  }},
  {{
    key: "output_class"
    value: {{ string_value: "true" }}
  }},
  {{
    key: "threshold"
    value: {{ string_value: "0.5" }}
  }},
  {{
    key: "storage_type"
    value: {{ string_value: "{storage_type}" }}
  }}
]

dynamic_batching {{
  max_queue_delay_microseconds: 100
}}"""
    config_path = os.path.join(model_dir, 'config.pbtxt')
    with open(config_path, 'w') as file_:
        file_.write(config_text)

    return config_path

In [22]:
generate_config(model_logistic_dir, deployment_type='gpu', model_format='xgboost_json', storage_type='AUTO')
generate_config(model_logistic_cpu_dir, deployment_type='cpu', model_format='xgboost_json', storage_type='AUTO')
generate_config(model_oversample_dir, deployment_type='gpu', model_format='xgboost_json', storage_type='SPARSE')
generate_config(model_oversample_cpu_dir, deployment_type='cpu', model_format='xgboost_json', storage_type='SPARSE')
generate_config(model_RFC_dir, deployment_type='gpu', model_format='treelite_checkpoint', storage_type='AUTO')
generate_config(model_RFC_cpu_dir, deployment_type='cpu', model_format='treelite_checkpoint', storage_type='AUTO')

'/home/nfs/enarvades/fil_backend/notebooks/ensembling/model_repository/model_RFC-cpu/config.pbtxt'

## Ensembling
Triton's [Python backend](https://github.com/triton-inference-server/python_backend) is the what we will be utilizing in order to implement the actual ensembling. Therefore, the ensembling will occur on Triton instead of locally. 

### The Python Model
Here is where we will implement the ensembling for our models. We are able to execute inference requests on our other models while executing the Python model. We will be building off of this [example](https://github.com/triton-inference-server/python_backend/tree/main/examples/bls) in order to implement our Python model. 

For the ensembling process, we must first execute requests on each of our models. For each model, we send an asynchronous request to Triton. Each request is saved in `inference_response_awaits`, and then `inference_responses` when each request is complete.

Finally, we convert our output tensors to numpy and perform ensembling on the numpy arrays, just like we would do locally. The final ensembled tensor is then returned as a response.

In [23]:
python_text = """import triton_python_backend_utils as pb_utils
import json+
import asyncio
from torch.utils.dlpack import from_dlpack

class TritonPythonModel:
    def initialize(self, args): 
        self.model_config = json.loads(args['model_config'])

    async def execute(self, requests):

        responses = []

        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, "input__0")
            deployment_type = pb_utils.get_input_tensor_by_name(request, "deployment_type")

            GPU_models = ['model_logistic', 'model_oversample', 'model_RFC']
            CPU_models = ['model_logistic-cpu', 'model_oversample-cpu', 'model_RFC-cpu']

            models = GPU_models if deployment_type == "GPU" else CPU_models

            inference_response_awaits = []

            for model_name in models:
                infer_request = pb_utils.InferenceRequest(
                    model_name=model_name,
                    requested_output_names=["output__0"],
                    inputs=[in_0])

                inference_response_awaits.append(infer_request.async_exec())

            inference_responses = await asyncio.gather(
                *inference_response_awaits)

            for infer_response in inference_responses:
                if infer_response.has_error():
                    raise pb_utils.TritonModelException(
                        infer_response.error().message())

            logistic_tensor = pb_utils.get_output_tensor_by_name(
                inference_responses[0], "output__0")

            oversample_tensor = pb_utils.get_output_tensor_by_name(
                inference_responses[1], "output__0")
            
            RFC_tensor = pb_utils.get_output_tensor_by_name(
                inference_responses[2], "output__0")

            logistic_tensor = logistic_tensor.to_dlpack()
            oversample_tensor = oversample_tensor.to_dlpack()
            RFC_tensor = RFC_tensor.to_dlpack()

            class DLPack:
                def __init__(self, tensor):
                    self.tensor = tensor
                
                def __dlpack__(self):
                    return self.tensor
                
                def __dlpack_device__(self):
                    return self.tensor.memory_type_id

            logistic_tensor = DLPack(logistic_tensor)
            oversample_tensor = DLPack(oversample_tensor)
            RFC_tensor = DLPack(RFC_tensor)

            def DLPack_to_numpy(tensor):
                if tensor.is_cpu():
                    return tensor.as_numpy()
                else:
                    pytorch_tensor = from_dlpack(tensor)
                return pytorch_tensor.cpu().numpy()

            ensemble = (DLPack_to_numpy(logistic_tensor) + DLPack_to_numpy(oversample_tensor) + DLPack_to_numpy(RFC_tensor)) / 3
            ensembled_tensor = pb_utils.Tensor("output__0", ensemble)

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[ensembled_tensor])
            responses.append(inference_response)

        return responses

    def finalize(self):
        print('Cleaning up...')
        
"""

# We will create a new directory for this model called "model_ensemble".
python_dir = os.path.join(REPO_PATH, 'model_ensemble')
python_version_dir = os.path.join(python_dir, '1')
os.makedirs(python_version_dir, exist_ok=True)

# Next, we write out the python model to the directory we created. 
python_path = os.path.join(python_version_dir, 'model.py')
with open(python_path, 'w') as file_:
    file_.write(python_text)

### The Configuration File
The Python backend requires a configuration file in order to deploy on Triton. To use this backend, we set the `backend` field of this file to `"python"`.

In [38]:
features = (5, 393)
num_classes = cp.unique(y_test).size

In [39]:
config_text = f"""name: "model_ensemble"
backend: "python"

input [
  {{
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {features} ]
  }}
]

input [
  {{
    name: "deployment_type"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }}
]

output [
  {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ {num_classes} ]
  }}
]

parameters: {{
  key: "EXECUTION_ENV_PATH",
  value: {{string_value: "$$TRITON_MODEL_DIRECTORY/pytorch.tar.gz"}}
}}

instance_group [{{ kind: KIND_CPU }}]

"""

config_path = os.path.join(python_dir, 'config.pbtxt')
with open(config_path, 'w') as file_:
    file_.write(config_text)

### Starting the server
Like in the previous notebook, we will start the server, wait until it comes online, and check the logs for warnings or errors.

In [40]:
!docker run --gpus all -d -p 8000:8000 -p 8001:8001 -p 8002:8002 -v {REPO_PATH}:/models --name tritonserver {TRITON_IMAGE} tritonserver --model-repository=/models

docker: Error response from daemon: Conflict. The container name "/tritonserver" is already in use by container "8fcce29e2f3b3fdc0d1437b902b1ff5547836bfe65c4678c37222346c3254ba9". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


In [41]:
import tritonclient.grpc as triton_grpc
from tritonclient import utils as triton_utils
HOST = 'localhost'
PORT = 8001
TIMEOUT = 60

In [42]:
client = triton_grpc.InferenceServerClient(url=f'{HOST}:{PORT}')

In [43]:
import time
time.sleep(30) # Wait for server to come online

In [44]:
!docker logs tritonserver


== Triton Inference Server ==

NVIDIA Release 22.05 (build 38317651)
Triton Server Version 2.22.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

  Using driver version 495.29.05 which has support for CUDA 11.5.  This container
  was built with CUDA 11.7 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/

## Submitting inference requests
The process for submitting inference requests will be slightly different. 

In [None]:
# Taking care of categorical features
import pandas as pd
import numpy as np
def convert_to_numpy(df):
    df = df.copy()
    cat_cols = df.select_dtypes('category').columns
    for col in cat_cols:
        df[col] = df[col].cat.codes
    for col in df.columns:
        df[col] =  pd.to_numeric(df[col], downcast='float')
    return df.values

In [None]:
np_data = convert_to_numpy(X_test)

In [None]:
def triton_predict(model_name, arr, deployment_type='GPU'):
    triton_inputs = [triton_grpc.InferInput('input__0', arr.shape, 'FP32'), 
                     triton_grpc.InferInput(deployment_type, [1], 'STRING')]
    triton_inputs[0].set_data_from_numpy(arr)
    triton_output = triton_grpc.InferRequestedOutput('output__0')
    response = client.infer(model_name, inputs=triton_inputs, outputs=[triton_output])
    return response.as_numpy('output__0')

In [None]:
triton_result = triton_predict('model_ensemble', np_data[0:5])
local_result = (model_logistic.predict_proba(X_test[0:5]) + 
                model_oversample.predict_proba(X_test[0:5]) + 
                model_RFC.predict_proba(X_test[0:5])) / 3

print("Result computed on Triton: ")
print(triton_result)

print("Resulted computed locally: ")
print(local_result)

In [32]:
# Shut down the server
!docker rm -f tritonserver

tritonserver


## Conclusion

We have demonstrated how to use Triton's FIL backend and Python backend in order to handle an ensemble of models. While we focus on averaging in the example, any ensembling technique could be implemented with the steps above. Furthermore, any type of model can be utilized, including XGBoost, cuML, Scikit-Learn, LightGBM, and any format that can be converted to Treelite's checkpoint format. If your python model requires additional dependencies, the python backend supports [custom python execution environments](https://github.com/triton-inference-server/python_backend#using-custom-python-execution-environments).

For more information, we recommend viewing the [FIL backend documentation](https://github.com/triton-inference-server/fil_backend#triton-inference-server-fil-backend) as well as the [Python backend documentation](https://github.com/triton-inference-server/python_backend).