# Enabling Deep Learning in Gravitational Wave Physics With Inference-as-a-Service
## Alec Gunny

## Deep Learning in Gravitational Wave Astronomy
<img src="images/gw-dl-use-cases.png" height="auto" width="980px" style="top:10px;"></img>

## Overview
- [Introduction](#Introduction)
- A naive model
- Inference-as-a-service concepts
- Results on LIGO data
- Next steps

# Introduction
## Deep learning inference - concepts and challenges

### Deep Learning Inference
- Model has been trained - how do we use it to do the task it was trained for?
    - i.e. make **inferences** about the likelihood of some quantity conditioned on new data
- What are the demands of the environment in which the task needs to be performed?
    - Do inferences need to be returned quickly once the data becomes available?
        - Low-**latency** or **online** inference
        - E.g. event detection for MMA triggers
    - Do we need to perform a lot of inferences and only care how long it takes them _all_ to finish?
        - High-**throughput** or **offline** inference
        - E.g. searches through archival data or model validation
    - What are our constraints in the total **expense** we can afford to incur for processing?

### Deep Learning Inference - Requirements
- The set of operations that constitute the model
- The optimized (trained) set of model parameters or "weights"
- Hardware capable of meeting latency/throughput/expense requirements
    - CPUs, GPUs, TPUs, FPGA
- Data
- Pipeline which feeds data into the model and does something with the corresponding outputs
    - Model is just a _component_ of this pipeline

### Deep Learning Inference - Challenges
<div class="float200 padded">
    <img src="images/hardware-logos.png" width="300px" height="auto" class="right unpadded"/>
    <p class="left">Access to and familiarity with accelerated hardware</p>
</div>

<div>
    <img src="images/framework-logos.png" width="300px" height="auto" class="left padded" />
    <p class="right"> Managing, leveraging, and translating across deep learning software stacks
</div>

<div></div>
<div></div>
<div class="float100">
    <img src="images/distributing-nns.png" width="250px" height="auto" class="right unpadded"/>
    <p class="left">Distributing updated models to dependent users and applications</p>
</div>

### Not just inference problems
- Large-scale offline validation required to quantify advantages of new research and automate model re-training
    - Faster processing times mean faster iteration on novel ideas
- Deployment on "real", uncontrolled data critical to:
    - identifying failure modes and improving understanding of model behavior
    - evaluating the correlation between validation metrics and true success criterion
        - Identify and remove signal leakage in training pipelines
        - Align training and test settings - can our algorithm optimize our latency/throughput/expense cost function better than its alternative?
- O4 run will collect !X! amount of data per day
    - Answering these questions now will ensure that we have the right combination of systems/algorithms in place to extract as much physics as possible

# The traditional deep learning inference model
### Building an example pipeline using PyTorch
- Illustrate the challenges outlined above
- Motivate the requirements of an improved paradigm

Begin with a few imports

In [1]:
import time
from concurrent.futures import ThreadPoolExecutor
from queue import Empty, Queue

import numpy as np
import torch

# some functionality will be necessarily
# buried in here. If you're interested,
# feel free to take a look
import utils

Define our model: a simple multi-layer perceptron

In [2]:
class MLP(torch.nn.Module):
    def __init__(self, input_size, hidden_sizes):
        super().__init__()

        self.layers = torch.nn.ModuleList()
        for size in hidden_sizes:
            self.layers.append(torch.nn.Linear(input_size, size))
            self.layers.append(torch.nn.ReLU())
            input_size = size

        self.layers.append(torch.nn.Linear(input_size, 1))
        self.layers.append(torch.nn.Sigmoid())

    def forward(self, x: torch.tensor) -> torch.tensor:
        for layer in self.layers:
            x = layer(x)
        return x

During training:
- instantiate an instance of this model then optimize its parameters
- export these optimized parameters for use later

In [None]:
INPUT_SIZE = 64
HIDDEN_SIZES = [256, 128, 64]
model = MLP(INPUT_SIZE, HIDDEN_SIZES)

# typically do some training here
# for i in range(num_epochs):
#    for x in dataset:
#        do_a_gradient_step(model, x)
# now our model has optimized parameters

# export these parameter values somewhere
torch.save(model.state_dict(), "model.pt")

At inference time, load in these optimized model weights and use them to map inputs to outputs

In [3]:
# build the operations that constitute a model
inference_model = MLP(INPUT_SIZE, HIDDEN_SIZES).cuda(0)

# set the values of the parameters of these
# operations to their optimized values
inference_model.load_state_dict(torch.load("model.pt"))

# create an array on the CPU
x = np.random.randn(INPUT_SIZE).astype("float32")
with torch.no_grad():
    # move it on to the GPU
    x = torch.from_numpy(x).cuda(0)

    # use the model infer predictions on the GPU
    y = inference_model(x)

# move these predictions back to the CPU
# for downstream processing
y.cpu().numpy()

array([0.5128155], dtype=float32)

#### Inference on a dataset
- Generally interested in using the model for many thousands or millions of inferences
- Start with the simplest case: a dataset that can fit into memory at once

In [4]:
N = 5 * 10**5  # number of observations in our dataset
dataset = np.random.randn(N, 64).astype("float32")

@torch.no_grad()
def do_some_inference(model, dataset, batch_size=8, device_index=0):
    # move the data to the GPU in bulk
    gpu_dataset = torch.from_numpy(dataset).cuda(device_index)

    # iterate through it in batches and yield predictions
    dataset = torch.utils.data.TensorDataset(gpu_dataset)
    for [x] in torch.utils.data.DataLoader(dataset, batch_size=batch_size):
        y = model(x)
        yield y.cpu().numpy()

# run through the dataset and get a rough time estimate
%time outputs = [y for y in do_some_inference(inference_model, dataset)]

CPU times: user 17.8 s, sys: 198 ms, total: 18 s
Wall time: 18 s


Seems to work pretty fast, but how well are we utilizing the GPU?

In [5]:
with utils.GpuUtilProgress(gpu_ids=0) as progbar:
    task_id = progbar.add_task("[cyan]Inference", total=N)

    outputs = []
    for y in do_some_inference(model, dataset):
        outputs.append(y)
        progbar.update(task_id, advance=len(y))

output = np.concatenate(outputs, axis=0)

Output()

Yikes, only around 20%!
- GPUs are expensive, can we improve things via parallel execution (assuming we can't change the batch size)?
- First attempt: Naive (and sloppy) implementation using threading:

In [6]:
q = Queue()

def task(dataset_chunk, device_index):
    # for each inference task, create a copy of the
    # model on the indicated GPU device
    model = MLP(INPUT_SIZE, HIDDEN_SIZES).cuda(device_index)
    model.load_state_dict(torch.load("model.pt"))

    # iterate through our dataset and send
    # the results back to the main thread
    for y in do_some_inference(
        model, dataset_chunk, device_index=device_index
    ):
        q.put(y)

Output()

Run this task on multiple parallel threads (if we tried to use processes, Torch would complain):

In [None]:
num_jobs = 4

with utils.GpuUtilProgress(0) as progbar:
    task_id = progbar.add_task(f"Inference with {num_jobs} jobs", total=N)

    # create a pool of threads to do inference in parallel
    with ThreadPoolExecutor(4) as pool:
        # split the dataset into chunks and submit
        # inference on them as tasks to the pool
        [pool.submit(task, x, 0) for x in np.array_split(dataset, num_jobs)]

        # iterate through the dataset
        outputs = []
        while not progbar.finished:
            outputs.extend(q.get())
            progbar.update(task_id, completed=len(outputs))

It looks like things actually got worse!
- Extracting good GPU performance is rarely simple or intuitive
- What next?

After spending a few hours perusing the PyTorch documentation and experimenting, we come up with the following basic functional implementation:

In [7]:
def parallel_inference(X, num_gpus, jobs_per_gpu, progbar):
    num_jobs = num_gpus * jobs_per_gpu
    task_id = progbar.add_task(
        f"[cyan]{num_gpus} GPUs/{num_jobs} jobs",
        total=len(X),
        start=False
    )

    # we need special queue and value objects
    # specific to process spawning
    smp = torch.multiprocessing.get_context("spawn")
    q = smp.Queue()
    sync = smp.Value("d", 0.0)

    # pass a bunch of arguments into each
    # process that we need to spawn
    # note that we have to pass copies of some
    # of our local functions that live in `utils`
    # since we can't unpickle elsewhere functions
    # which are defined in __main__
    args = (
        X,  # the full dataset
        utils.MLP,  # the module class to use for inference
        [INPUT_SIZE, HIDDEN_SIZES],  # arguments to initialize the module
        utils.do_some_inference,  # the inference funcntion to use
        q,  # the queue to put the results in
        sync,  # a task synchronizer
        jobs_per_gpu,
        num_gpus
    )

    # spawn parallel jobs across all GPUs.
    # We have to host the `parallel_inference_task` function
    # in a separate module for the same pickling pickle
    # described above
    procs = torch.multiprocessing.spawn(
        utils.parallel_inference_task,
        args=args,
        nprocs=num_jobs,
        join=False
    )

    # wait to synchronize until all models load
    # so that we can compare throughput better
    while sync.value < num_jobs:
        time.sleep(0.01)

    # increment the synchronizer one more
    # time to kick off the jobs
    sync.value += 1
    progbar.start_task(task_id)

    # collect all the (unordered) inputs
    outputs = []
    while not (progbar.finished and procs.join(0.01)):
        try:
            # try to get the next result in
            # in the queue and increment everything
            outputs.extend(q.get_nowait())
            progbar.update(task_id, completed=len(outputs))
        except Empty:
            time.sleep(0.01)

    # concatenate the outputs and return
    return np.stack(outputs, axis=0)

How does GPU usage and time-to-completion scale with the number of jobs/GPUs?

In [8]:
with utils.GpuUtilProgress([0, 1]) as progbar:
    y = parallel_inference(dataset, num_gpus=1, jobs_per_gpu=2, progbar=progbar)
    y = parallel_inference(dataset, num_gpus=1, jobs_per_gpu=4, progbar=progbar)
    y = parallel_inference(dataset, num_gpus=2, jobs_per_gpu=4, progbar=progbar)

Output()

Things seem to scale pretty well! But this is still far from ideal:
- Framework specific
    - No help if we want to extend to other frameworks
    - Torch is pretty unique in having this functionality at all
- The code is complicated and required a lot of non-physics expertise to build
    - Non-trivial to reconstruct for new applications
- Extremely contrived example, breaks down in most real use cases
    - Explore a few cases to show how

### Throughput too low
#### _The constraints of our use case demand that we further reduce processing time by an order of magnitude_
- Not obvious how to simply extend this code to multi-node
- Scaling not dynamic
    - Have to pick a level of parallelism and hope the resources are available to use it

### Throughput too high
####  _Data generation process is slow, needs to be parallelized to saturate GPU throughput_
- Low GPU utilization with local resources. Not obvious how this code can:
    - Scale to multiple clients
    - Allow other users to leverage spare cycles

### Model ensembling
#### _Connecting multiple models in a single pipeline_
<div class="center">
    <img src="images/model_sharing.png" height="auto" width="400px" class="center" />
    <img src="images/model_ensemble.png" height="auto" width="400px" class="center" />
</div>

### Model ensembling
Naive implemenation

In [12]:
@torch.no_grad()
def do_some_multi_model_inference(models, dataset, batch_size=8, device_index=0):
    gpu_dataset = torch.from_numpy(dataset).cuda(device_index)
    dataset = torch.utils.data.TensorDataset(gpu_dataset)

    for [x] in torch.utils.data.DataLoader(dataset, batch_size=batch_size):
        for model in models:
            x = model(x)
        yield x.cpu().numpy()

noise_remover = utils.NoiseRemovalModel(INPUT_SIZE, [32, 16]).cuda(0)
models = [noise_remover, inference_model]

with utils.GpuUtilProgress(0) as progbar:
    task_id = progbar.add_task("[cyan]Ensemble inference", total=N)
    for y in do_some_multi_model_inference(models, dataset):
        progbar.update(task_id, advance=len(y))

Output()

## Model Ensembling

Once again, it _works_, but scaling it up is non-trivial:

- Models may require different levels of parallelism to keep any one model from bottlenecking the other (see figure)
- Most efficient implementation would have models executing asynchronously, with tensors passed between GPUs
- If the models utilize different frameworks, this problem becomes exponentially harder

<figure>
    <img src="images/bottleneck-both.png" height="auto" width="800px"/>
    <figcaption>Model 1 throughput too high for model 2, need to run more concurrent instances of model 2 to maximize throughput</figcaption>
</figure>

## Distribution
#### _Who do you want to use your model?_

- How much expertise should someone need to have to utilize your model in their pipeline?
- How much do they need to know about how your model is implemented?
- How will they be kept up-to-date when you retrain the model or improve the architecture?
    - Will these updates change their pipeline?
- What if they don't have access to accelerators?

# Inference-as-a-Service
### An alternative paradigm

#### Takeaways so far:
- Efficiently scheduling cross-platform, multi-GPU, multi-model asynchronous DL inference is hard
- Inference is just one piece of your pipeline. Really even just one line:
```python
y = model(x)
```
    How and where `model(x)` happens is largely irrelevant to everything else

So:
- Manage this piece separately to hide these details
- Scale it to meet the rate at which you can generate `x`s or how quickly you need `y`s

## Inference-as-a-Service
The **inference-as-a-service** (Iaas) paradigm addresses these issues
- Out-of-the-box software optimized for efficiently executing complex asynchronous workloads across devices
    - Hardware _and_ framework agnostic
- Exposes models for inference to **client** pipelines via standardized APIs
    - Pipeline code stays the same even as the model changes or the service moves
    - Centralized model repositories keep all clients on the same page
- Containerization makes deployments portable to meet workload demands
    - Minimizes environment management overhead
    - Integration with container management servicse like Kubernetes leads to easy scaling

Traditional pipeline pseudocode
```python
# need to define the architecture somewhere
# that users can get access to it
from model_zoo import MyModel

# load in some parameters to make sure we
# initialize the model correctly
with open("path/to/init/args.pickle", "rb") as f:
    args = pickle.load(f)
model = MyModel(**args)

# load in the latest checkpoint we know of
model.load_weights("path/to/latest/weights.h5")

for x in dataset:
    # this syntax will differ depending on the framework
    x = move_array_to_gpu(x)
    y = model(x)
    y = move_output_to_cpu(y)
    do_downstream_postprocessing(y)
```

Iaas
```python
import tritonclient.grpc as triton

# connect to the service at some url
# pipeline never needs to touch model itself
client = triton.InferenceServerClient("0.0.0.0:8001")

# build a protobuf message representing the input
metadata = client.get_model_metadata("my_model").inputs[0]
input = triton.InferInput(
    metadata.name, metadata.shape, metadata.datatype
)

for x in dataset:
    input.set_data_from_numpy(x)

    # use the latest available version of the model
    y = client.infer("my_model", inputs=[input])
    do_downstream_postprocessing(y)
```

The difference is that, _as is_, this code gets you:
- As much scale as you want
- on whatever hardware you want
- using whatever backend framework you want
- wherever you want
- and can receive updates without interrupting service

```python
import tritonclient.grpc as triton

# connect to the service at some url
# pipeline never needs to touch model itself
client = triton.InferenceServerClient("0.0.0.0:8001")

# build a protobuf message representing the input
metadata = client.get_model_metadata("my_model").inputs[0]
input = triton.InferInput(
    metadata.name, metadata.shape, metadata.datatype
)

for x in dataset:
    input.set_data_from_numpy(x)

    # use the latest available version of the model
    y = client.infer("my_model", inputs=[input])
    do_downstream_postprocessing(y)
```

## Deployment scenarios
- IaaS represents a _software_ model for managing inference execution on heterogenous _hardware_
- Not tied to any _particular_ hardware platform or deployment location
    - Tune to meet the needs of each use case

<div class="center">
    <img src="images/triton-ldg.png" height="auto" width="350px" class="left" />
    <img src="images/triton-cloud.png" height="auto" width="350px" class="right" />
</div>

### [Triton Inference Server](https://github.com/triton-inference-server/server)
Off-the-shelf inference service developed and maintained by NVIDIA.
- Efficient scheduling of GPU resources
- Multiple framework backend support
    - Pre-built containers released monthly to simplify dependency management
- Dynamic model versioning and ensemble scheduling
- Separately installable [client libraries](https://github.com/triton-inference-server/client)

![Find a triton image](images/triton-logo.png "Find a triton image")

## End-to-end example
- A working example starting from model export and working up to multi-node ensemble inference in the cloud
- Discuss key concepts of IaaS, Triton, and cloud computing along the way

## Exporting models
Things we need for each model we want to use for inference
> - The set of operations that constitute the model
> - The optimized (trained) set of model parameters or "weights"

Triton expects these objects to be stored in a single location: the **model repository**
- Local filesystem or cloud storage location
- Expected to have fixed structure denoting different models and their various versions
- Models are loaded from the repository into memory

## Model repository
<div class="center">
    <img src="images/repo-outline.png" height="auto" width="800px"/>
</div>

- Repo conventions can be tedious and redundant
    - Built around protobuf - syntax tricky to learn
- We've built a wrapper library `quiver` to simplify these headaches:

In [13]:
from gravswell import quiver as qv

# represents a model repo on the local filesystem
repo = qv.ModelRepository("my-repo")

# add a blank model entry to the repo
entry = repo.add("my-model", platform=qv.Platform.ONNX)

# see how this looks visually
utils.print_tree("my-repo")

print("Model config:")
entry.config

name: "my-model"
platform: "onnxruntime_onnx"

Export an initial version of our model
- Combines the operations and weights into a single representation using [ONNX](https://github.com/onnx)
- Specify the names and sizes of the tensors going in and out of the model
    - Output sizes are inferred by running the model on input tensors of the specified shape

In [14]:
# move the model back to the CPU for output shape inference
inference_model.to("cpu")

# export this version of the model
export_path = entry.export_version(
    inference_model,
    input_shapes={"x": (None, INPUT_SIZE)},
    output_names=["y"]
)

# inputs and outputs are dynamically added to the config
entry.config

name: "my-model"
platform: "onnxruntime_onnx"
input {
  name: "x"
  data_type: TYPE_FP32
  dims: -1
  dims: 64
}
output {
  name: "y"
  data_type: TYPE_FP32
  dims: -1
  dims: 1
}

Now what does our model repo look like?

In [15]:
utils.print_tree("my-repo")

my-repo/
    my-model/
        1/
            model.onnx
        config.pbtxt


- For ensuing versions of our model, don't need to specify anything.
    - If we do, it will be compared against the config to make sure they match

In [16]:
# do_some_more_training(model, new_train_dataset)
# export this even-more-optimized version 2.0 of the model
entry.export_version(inference_model)

# what does this look like now?
utils.print_tree("my-repo")

my-repo/
    my-model/
        1/
            model.onnx
        2/
            model.onnx
        config.pbtxt


## Parallel Inference
- Triton can host multiple copies of our model on a single GPU at once
    - Allows for easy parallel execution
    - Can be scaled per-model, per-gpu
    - Described in config **instance group**

In [17]:
# host 4 parallel inference executions on _all_ available GPUs
entry.config.add_instance_group(count=4)
entry.config

name: "my-model"
platform: "onnxruntime_onnx"
input {
  name: "x"
  data_type: TYPE_FP32
  dims: -1
  dims: 64
}
output {
  name: "y"
  data_type: TYPE_FP32
  dims: -1
  dims: 1
}
instance_group {
  count: 4
  kind: KIND_GPU
}

## Cloud storage
- Local filesystem not helpful if we choose to host our inference service on a remote server
- Use a **cloud storage** bucket on Google Cloud to host the model repository
- Accessible via web request APIs from anywhere (with authentication)
- `quiver` supports natively without changing code

In [19]:
# get rid of local repository
repo.delete()

# use gs:// prefix to indicate Google Cloud bucket
repo = qv.ModelRepository("gs://ligo-quiver-demo")

# everything else proceeds as normal
entry = repo.add("my-model", platform=qv.Platform.ONNX)
entry.config.add_instance_group(count=4)
export_path = entry.export_version(
    inference_model,
    input_shapes={"x": (None, INPUT_SIZE)},
    output_names=["y"]
)

gs://ligo-quiver-demo/my-model/config.pbtxt
gs://ligo-quiver-demo/my-model/1/


Use Google's command line utility to confirm bucket contents

In [None]:
! gsutil ls gs://ligo-quiver-demo/my-model/*

## Deploying on the cloud
- Now our model has been exported to a model repository, we can deploy a Triton server instance that loads our model from it
- To do this we'll use a **Kubernetes** cluster deployed on Google Cloud
    - Reserve a pool of resoures sitting on servers in a Google data center
    - Schedule workloads on those resources using **Docker** containers
        - Extremely lightweight VMs containing our Triton environment and executable
    - Kubernetes intelligently manages deployed clusters and exposes them to external requests

## Deploying on the cloud
We've built another library, `cloudbreak` to help make managing these deployments simpler:

In [20]:
from gravswell.cloudbreak import google as cb

# object for creating/destroying clusters associated
# with a particular **project** (id used for billing)
# in a particular **zone** (identifies a datacenter
# location. We'll use the same one this VM is running
# in, somwhere in Virginia)
manager = cb.ClusterManager(project="gunny-multi-instance-dev", zone="us-east4-b")

Output()

In [None]:
from google.cloud import container_v1 as container

# create a description of a newly initialized
# cluster. Give it a single vanilla node to run
# the kubernetes manager on
cluster_config = container.Cluster(
    name="ligo-demo-cluster",
    node_pools=[container.NodePool(
        name="default-pool",
        initial_node_count=1,
        config=container.NodeConfig()
    )]
)

# initialize this "blank" cluster in the
# specified zone, with its resource usage
# billed to the specified project
cluster = manager.create_resource(cluster_config)

# we're going to use GPUs on this cluster, so
# run a utility script which makes sure that
# the NVIDIA drivers will be installed on any
# GPU-enabled nodes we might create
cluster.deploy_gpu_drivers()

Now that we have a cluster with a node dedicated to running our Kubernetes deployment, attach some GPU-enabled nodes that will be controlled by Kubernetes.

In [21]:
# describe what these GPU-enabled nodes should look like
# and how many we want
node_pool_config = container.NodePool(
    name="tritonserver-t4-pool",
    initial_node_count=2,
    config=cb.create_gpu_node_pool_config(
        vcpus=16,
        gpus=4,
        gpu_type="t4"
    )
)

# add these nodes to our cluster
node_pool = cluster.create_resource(node_pool_config)

Output()

## Deploying Triton with Kubernetes
- Kubernetes operates by using config YAML files to describe the desired _state_ of a deployment. This includes:
    - The name of the deployment
    - The Docker container(s) needed to run it
    - The command to run inside those containers
    - What types of node are acceptable for deploying on
    - How to expose the deployment to requests
- Let's take a look at our Triton deployment file below:

In [3]:
from IPython.display import Code
Code(filename="triton.yaml")

- `cloudbreak.Cluster` object takes care of making the deployment request
- Fills in the `{{ .Values.* }}` wildcards for flexibility

In [22]:
cluster.deploy(
    "triton.yaml",
    name="tritonserver",
    tag="20.11",
    bucket="ligo-quiver-demo",
    gpus=4,
    vcpus=15  # at least come cpu has to go to running kubernetes
)
cluster.k8s_client.wait_for_deployment("tritonserver")

Output()

## Quick recap
Now we have a model exported to a Google Cloud Storage bucket, that's being loaded into memory by a Triton server instance, which is running on a Google Cloud server node, being managed by Kubernetes, and exposed to external requests by a load balancer.

Got all that?

## Quick recap
In one place, the code to implement all this looks like:
```python
from gravswell import quiver as qv
from gravswell.cloudbreak import google as cb
from google.cloud import container_v1 as container

# make a repo and export
repo = qv.ModelRepository("gs://ligo-quiver-demo")
entry = repo.add("my-model", platform=qv.Platform.ONNX)
entry.config.add_instance_group(count=4)
export_path = entry.export_version(
    inference_model,
    input_shapes={"x": (None, INPUT_SIZE)},
    output_names=["y"]
)

# build a cluster and make it GPU friendly
manager = cb.ClusterManager(project="gunny-multi-instance-dev", zone="us-east4-b")
cluster_config = container.Cluster(
    name="ligo-demo-cluster",
    node_pools=[container.NodePool(
        name="default-pool",
        initial_node_count=1,
        config=container.NodeConfig()
    )]
)
cluster = manager.create_resource(cluster_config)
cluster.deploy_gpu_drivers()

# add GPU-enabled nodes to the cluster
node_pool_config = container.NodePool(
    name="tritonserver-t4-pool",
    initial_node_count=2,
    config=cb.create_gpu_node_pool_config(
        vcpus=16,
        gpus=4,
        gpu_type="t4"
    )
)
node_pool = cluster.create_resource(node_pool_config)

# deploy our Triton application to these GPU nodes
cluster.deploy(
    "triton.yaml",
    name="tritonserver",
    tag="20.11",
    bucket="ligo-quiver-demo",
    gpus=4,
    vcpus=15
)
cluster.k8s_client.wait_for_deployment("tritonserver")
```

So now we're ready to make inference requests to this model!

In [23]:
import tritonclient.grpc as triton

# get the IP address at which the server is deployed
ip = cluster.k8s_client.wait_for_service("tritonserver")

# create a client which communicates with the server
client = triton.InferenceServerClient(f"{ip}:8001")
assert client.is_server_live()

# load our model into memory explicitly
# (can also have this be done automatically
# when the server starts)
client.load_model("my-model")
assert client.is_model_ready("my-model")

Output()

- Let's define a new inference function for making requests to the model
- This will be subsumed by another library in the works `stillwater` for asynchronous IaaS inference

In [24]:
batch_size = 8

def do_some_iaas_inference(model_name):
    metadata = client.get_model_metadata(model_name).inputs[0]
    shape = [i if i != -1 else batch_size for i in metadata.shape]
    input = triton.InferInput(metadata.name, shape, metadata.datatype)

    with utils.GpuUtilProgress() as progbar:
        submit_task_id = progbar.add_task("Submitting requests", total=N)
        infer_task_id = progbar.add_task("Inferences completed", total=N)

        def callback(result, error):
            y = result.as_numpy("y")
            progbar.update(submit_task_id, advance=len(y))

        num_batches = len(dataset) // batch_size
        for x in np.split(dataset, num_batches):
            input.set_data_from_numpy(x)
            client.async_infer(model_name, inputs=[input], callback=callback)
            progbar.update(submit_task_id, advance=len(x))

        while not progbar.finished:
            time.sleep(0.1)

Output()

## Inference time

In [None]:
do_some_iaas_inference("my-model")

## Ensemble inference
- Adding in our "noise removal" model and piping its output to the input of `"my-model"`.
- Start by creating a new entry for it in our model repo

In [25]:
noise_entry = repo.add("noise-remover", platform=qv.Platform.ONNX)
noise_entry.config.add_instance_group(count=4)
noise_remover.to("cpu")
export_path = noise_entry.export_version(
    noise_remover,
    input_shapes={"noisy": (None, INPUT_SIZE)},
    output_names=["cleaned"]
)
! gsutil ls gs://ligo-quiver-demo/*

gs://ligo-quiver-demo/my-model/
gs://ligo-quiver-demo/noise-remover/


## Ensemble inference
- `quiver` makes piping models together straightforward
    - `EnsembleModel` represents a meta-model that just schedules execution of `step`s consisting of other models
        - Doesn't live on any particular GPU- schedules inference across all model instances across _all_ GPUs

In [27]:
ensemble_entry = repo.add("end-to-end", platform=qv.Platform.ENSEMBLE)
ensemble_entry.add_input(noise_entry.inputs["noisy"])
ensemble_entry.pipe(noise_entry.outputs["cleaned"], entry.inputs["x"])
ensemble_entry.add_output(entry.outputs["y"])
export_path = ensemble_entry.export_version(None)
! gsutil ls gs://ligo-quiver-demo/*

## Ensemble Inference
Use the client to load the end-to-end model onto the server, then make requests to it

In [29]:
client.load_model("end-to-end")
assert client.is_model_ready("noise-remover")  # ensemble loads all its "step" models
assert client.is_model_ready("end-to-end")

do_some_iaas_inference("end-to-end")

Output()

## Multi-node inference
- Kubernetes makes scaling up more server instances seamless
- Hoping to roll this behavior into `cloudbreak`

In [40]:
import kubernetes

app_client = kubernetes.client.AppsV1Api(cluster.k8s_client._client)
body = app_client.read_namespaced_deployment("tritonserver", namespace="default")
body.spec.replicas = 2

response = app_client.patch_namespaced_deployment_scale(
    "tritonserver", namespace="default", body=body
)
cluster.k8s_client.wait_for_deployment("tritonserver")

## Multi-node inference
Since both nodes are exposed by the same `LoadBalancer`, same client pointing at one IP address can make requests to all 8 available GPUs

In [43]:
do_some_iaas_inference("end-to-end")

Output()

# IaaS results on LIGO data
## Running real online and offline pipelines using Triton

## Challenges with streaming time series

- DeepClean and BBHnet produce inferences on overlapping **kernels** of data which represent a fixed length **snapshot** of the time series at one moment in time
    - Including full kernel with _every_ request creates enormous I/O load that bottlenecks pipeline
- Sample frames at some **inference sampling rate** $r \leq f_s$

<img src="images/snapshotter_overlap.png" height="auto" width="600px"/>

## Challenges with streaming time series

- Introduce a model on server which maintains the most recent snapshot as a "state"
    - Connected to other models via an ensemble
    - Only need to stream state updates of _new_ data
    - Improves I/O, but introduces a serial step which can limit parallelism
- Adding a snapshotter for any model soon to be introduced to `quiver`

<div class="center">
    <img src="images/snapshotter_action.png" height="auto" width="180px" class="unpadded" />
</div>

## Experimental measurements
- IaaS deployments fundamentally trade-off between latency, throughput, and expense
    - Constraints of each problem define a cost surface in these variables
    - Each point associated with a server resource usage/level of parallelism configuration
    - Measure the variables of interest as a function of configuration
        - Use cost function to map to cost and find optimal operating point

<figure>
    <img src="images/cost-surface.png" height="auto" width="800px"/>
    <figcaption>Surface of constant cost in "inference space"</figcaption>
</figure>

## Offline Pipelines
- For each server, assign $k$ clients
    - With $n$ servers, each client is assigned $\frac{1}{nk}$ of the total dataset to process
    - Each client has an associated snapshotter state, must be routed to same server
- Inference sampling rate has no bearing on data generation rate
    - Determines the _number_ of frames needed to process for a given length of time

## DeepClean Offline
- Processed ~1 month of data during O3
<div class="center padded">
    <img src="images/dc-offline.png" height="auto" width="1200px" />
</div>

## Deepclean + BBHnet Offline
Implemented using several different frameworks:
<div class="center">
    <img src="images/ensemble.png" height="auto" width="1000px" />
</div>

## Deepclean + BBHnet Offline
- Processed ~27 hours of data during O2
- Strong scaling + elastic demand makes cloud ideal environment
    - Optimal scale $\rightarrow\infty$
    - Trade-off shifts to prediction quality from higher $r$ vs. expense, need to quantify


<img src="images/e2e-offline.png" height="800px" width="1200px" class="padded"/>

## HEPCloud
Framework from HEP community for sustained, high-throughput inference using Condor APIs
<div class="center">
    <img src="images/hepcloud.png" height="auto" width="800px" />
</div>

# DeepClean Online
- Run on shared memory data replay frames
- Inference sampling rate $r$ dictates _average_ data generation rate
    - Frames become available in 1 second increments, true data generation rate is roughly a square wave
- One data stream = one snapshotter
    - Serial update causes queue build-up, no benefit to extra scale downstream
    - Optimizing this step highest priority

<img src="images/dc-online.png" height="auto" width="1200px" class="padded"/>

## DeepClean Online
- Overlapping outputs due to streaming time series
    - "Fully online" inference scheme causes negative impacts on PSD
    - Can trade off some latency for improved quality through aggregation
    - Working on improving trade-off via better training

<img src="images/dc-output-overlap.png" height="auto" width="300px" class="padded" />
<img src="images/psd-latency.png" height="auto" width="500px" />

# Next Steps
## Taking IaaS into production at LIGO