docs: Add distributed services doc (#4485)

* Add distributed services doc Signed-off-by: Sherlock113 <sherlockxu07@gmail.com> * fix format Signed-off-by: Sherlock113 <sherlockxu07@gmail.com> * ci: auto fixes from pre-commit.ci For more information, see https://pre-commit.ci --------- Signed-off-by: Sherlock113 <sherlockxu07@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
bentoml · Feb 4, 2024 · a13a913 · a13a913
1 parent 5ca2085
commit a13a913
Show file tree

Hide file tree

Showing 3 changed files with 245 additions and 2 deletions.
diff --git a/docs/source/bentocloud/how-tos/create-deployments.rst b/docs/source/bentocloud/how-tos/create-deployments.rst
@@ -83,7 +83,7 @@ The following sections provide examples for commonly used configuration fields.
 
 .. note::
 
-   You can refer to the following code examples directly if you only have a single BentoML Service in ``service.py``. If it contains multiple Services, see Distributed Services for details.
+   You can refer to the following code examples directly if you only have a single BentoML Service in ``service.py``. If it contains multiple Services, see :doc:`/guides/distributed-services` for details.
 
 Scaling
 ^^^^^^^
@@ -196,7 +196,7 @@ You can set environment variables for your deployment to configure the behavior
 
    Ensure that the environment variables you set are relevant to and compatible with your BentoML Service. Use them wisely to manage sensitive data, configuration settings, and other critical information.
 
-If you have multiple Services, you can set environment variables at different levels. For example, setting global environment variables means they will be applied to all Services, while a single Service can have environment variables only specific to itself, which take precedence over global ones. See Distributed Services to learn more.
+If you have multiple Services, you can set environment variables at different levels. For example, setting global environment variables means they will be applied to all Services, while a single Service can have environment variables only specific to itself, which take precedence over global ones. See :doc:`/guides/distributed-services` to learn more.
 
 Deploy with a configuration file
 --------------------------------

diff --git a/docs/source/guides/distributed-services.rst b/docs/source/guides/distributed-services.rst
@@ -0,0 +1,236 @@
+====================
+Distributed Services
+====================
+
+BentoML provides a flexible framework for deploying machine learning models as Services. While a single Service often suffices for most use cases, it is useful to create multiple Services running in a distributed way in more complex scenarios.
+
+This document provides guidance on creating and deploying a BentoML project with distributed Services.
+
+Single and distributed Services
+-------------------------------
+
+Using a single BentoML Service in ``service.py`` is typically sufficient for most use cases. This approach is straightforward, easy to manage, and works well when you only need to deploy a single model and the API logic is simple.
+
+In deployment, a BentoML Service run as processes in a container. If you define multiple Services, they run as processes in different containers. This distributed approach is useful when dealing with more complex scenarios, such as:
+
+- **Pipelining CPU and GPU processing for better throughput**: Distributing tasks between CPU and GPU can enhance throughput. Certain preprocessing or postprocessing tasks might be more efficiently handled by the CPU, while the GPU focuses on model inference.
+- **Optimizing resource utilization and scalability**: Distributed Services can run on different instances, allowing for independent scaling and efficient resource usage. This flexibility is important in handling varying loads and optimizing specific resource demands.
+- **Asymmetrical GPU requirements**: Different models might have varied GPU requirements. Distributing these models across Services helps you allocate resources more efficiently and cost-effectively.
+- **Handling complex workflows**: For applications involving intricate workflows, like sequential processing, parallel processing, or the composition of multiple models, you can create multiple Services to modularize these processes if necessary, improving maintainability and efficiency.
+
+Interservice communication
+--------------------------
+
+Distributed Services support complex, modular architectures through interservice communication. Different Services can interact with each other using the ``bentoml.depends()`` function. This allows for direct method calls between Services as if they were local class functions. Key features of interservice communication:
+
+- **Automatic service discovery & routing**: When Services are deployed, BentoML handles the discovery of Services, routes requests appropriately, and manages payload serialization and deserialization.
+- **Arbitrary dependency chains**: Services can form dependency chains of any length, enabling intricate Service orchestration.
+- **Diamond-shaped dependencies**: Support a diamond dependency pattern, where multiple Services depend on a single downstream Service, for maximizing Service reuse.
+
+The following is an example of two distributed Services with different hardware requirements and one Service depends on another using ``bentoml.depends()``.
+
+- ``SDXLControlNetService``: A high-resource demanding Service, requiring GPU support for image generation.
+- ``ControlNet``: A Service designed to handle incoming requests and route them appropriately. It calls a method of the ``SDXLControlNetService`` Service.
+
+This is the ``service.py`` file:
+
+.. code-block:: python
+
+    from __future__ import annotations
+
+    import typing as t
+
+    import cv2
+    import numpy as np
+    import PIL
+    from PIL.Image import Image as PIL_Image
+
+    import torch
+    from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
+    from pydantic import BaseModel
+
+    import bentoml
+
+    CONTROLNET_MODEL_ID = "diffusers/controlnet-canny-sdxl-1.0"
+    VAE_MODEL_ID = "madebyollin/sdxl-vae-fp16-fix"
+    BASE_MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
+
+
+    @bentoml.service(
+        traffic={"timeout": 600},
+        workers=1,
+        resources={
+            "gpu": "1",
+            "gpu_type": "nvidia-l4",
+        }
+    )
+    class SDXLControlNetService:
+
+        def __init__(self) -> None:
+
+            if torch.cuda.is_available():
+                self.device = "cuda"
+                self.dtype = torch.float16
+            else:
+                self.device = "cpu"
+                self.dtype = torch.float32
+
+            self.controlnet = ControlNetModel.from_pretrained(
+                CONTROLNET_MODEL_ID,
+                torch_dtype=self.dtype,
+            )
+
+            self.vae = AutoencoderKL.from_pretrained(
+                VAE_MODEL_ID,
+                torch_dtype=self.dtype,
+            )
+
+            self.pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+                BASE_MODEL_ID,
+                controlnet=self.controlnet,
+                vae=self.vae,
+                torch_dtype=self.dtype
+            ).to(self.device)
+
+
+        @bentoml.api
+        async def generate(
+                self,
+                prompt: str,
+                arr: np.ndarray[t.Any, np.uint8],
+                **kwargs,
+        ):
+            image = PIL.Image.fromarray(arr)
+            return self.pipe(prompt, image=image, **kwargs).to_tuple()
+
+
+    class Params(BaseModel):
+        prompt: str
+        negative_prompt: t.Optional[str]
+        controlnet_conditioning_scale: float = 0.5
+        num_inference_steps: int = 25
+
+
+    @bentoml.service(
+        traffic={"timeout": 600},
+        workers=8,
+    resources={"cpu": "1"}
+    )
+    class ControlNet:
+        # Pass the dependent Service class as an argument
+        controlnet_service = bentoml.depends(SDXLControlNetService)
+
+        @bentoml.api
+        async def generate(self, image: PIL_Image, params: Params) -> PIL_Image:
+            arr = np.array(image)
+            arr = cv2.Canny(arr, 100, 200)
+            arr = arr[:, :, None]
+            arr = np.concatenate([arr, arr, arr], axis=2)
+            params_d = params.dict()
+            prompt = params_d.pop("prompt")
+            # Invok a class level function of another Service
+            res = await self.controlnet_service.generate(
+                prompt,
+                arr=arr,
+                **params_d
+            )
+            return res[0][0]
+
+.. note::
+
+   You can find this example in the `BentoControlNet <https://github.com/bentoml/BentoControlNet/>`_ project.
+
+To declare a dependency, you use the ``bentoml.depends()`` function by passing the dependent Service class as an argument. This creates a direct link between Services, facilitating easy method invocation. This example uses the following code to achieve this:
+
+.. code-block:: python
+
+    class ControlNet:
+        controlnet_service = bentoml.depends(SDXLControlNetService)
+
+Once a dependency is declared, invoking methods on the dependent Service is similar to calling a local method. In other words, Service ``A`` can call Service ``B`` as if Service ``A`` were invoking a class level function on Service ``B``. This abstracts away the complexities of network communication, serialization, and deserialization. In this example, the Service ``ControlNet`` invokes the ``generate`` function of ``SDXLControlNetService`` as below:
+
+.. code-block:: python
+
+    res = await self.controlnet_service.generate(prompt, arr=arr, **params_d)
+
+Using ``bentoml.depends()`` is a recommended way for creating a BentoML project with distributed Services. It enhances modularity as you can develop reusable, loosely coupled Services that can be maintained and scaled independently.
+
+``bentofile.yaml``
+------------------
+
+For projects with multiple Services, you should reference the primary Service handling user requests for the ``service`` field in ``bentofile.yaml``. For example:
+
+.. code-block:: yaml
+
+    service: "service:ControlNet" #ControlNet is the one that receives users' requests
+    labels:
+      owner: bentoml-team
+      project: gallery
+    ...
+
+Deploy distributed Services
+---------------------------
+
+Deploying a project with distributed Services in BentoML is similar to deploying a single Service, with nuances in setting custom configurations.
+
+To set custom configurations for each, we recommend you use a separate configuration file and reference it in the BentoML CLI command or Python API for deployment.
+
+The following is an example file that defines some custom configurations for both Services in the BentoControlNet project. You set configurations of each Service in the ``services`` field. Refer to :doc:`/bentocloud/how-tos/create-deployments` to see the available configuration fields.
+
+.. code-block:: yaml
+
+    # config-file.yaml
+    name: "deployment-name"
+    description: "This project creates an image generation application based on users' requirements."
+    envs: # Optional. If you specify environment variables here, they will be applied to all Services
+      - name: "AA"
+        value: "aa"
+    services: # Add the configs of each Service under this field
+      SDXLControlNetService: # Service one
+        instance_type: "gpu.l4.1"
+        scaling:
+          max_replicas: 2
+          min_replicas: 1
+          policy:
+            metrics:
+              - type: "cpu | memory | gpu | qps"
+                value: "string"
+            scale_down_behavior: "disabled | stable | fast"
+            scale_up_behavior: "disabled | stable | fast"
+        envs: # Environment variables specific to Service one
+          - name: "BB"
+            value: "bb"
+        deployment_strategy: "RollingUpdate"
+        config_overrides:
+          traffic:
+            # float in seconds
+            timeout: 700
+            max_concurrency: 20
+            external_queue: true
+          resources:
+            cpu: "400m"
+            memory: "1Gi"
+          workers:
+            - gpu: 1
+      ControlNet: # Service two
+        instance_type: "cpu.1"
+        scaling:
+          max_replicas: 5
+          min_replicas: 1
+
+To deploy this Service to :doc:`BentoCloud </bentocloud/get-started>`, you can choose either the BentoML CLI or Python API:
+
+.. tab-set::
+
+    .. tab-item:: BentoML CLI
+
+        .. code-block:: bash
+
+            bentoml deploy . -f config-file.yaml
+
+    .. tab-item:: Python API
+
+        .. code-block:: python
+
+            import bentoml
+            bentoml.deployment.create(bento = "./path_to_your_project", config_file="config-file.yaml")
diff --git a/docs/source/guides/index.rst b/docs/source/guides/index.rst
@@ -45,6 +45,12 @@ This chapter introduces the key features of BentoML. We recommend you read :doc:
 
         Use the BentoML local Model Store to manage your models in a unified way.
 
+    .. grid-item-card:: :doc:`/guides/distributed-services`
+        :link: /guides/distributed-services
+        :link-type: doc
+
+        Create distributed Services for advanced use cases.
+
     .. grid-item-card:: :doc:`/guides/clients`
         :link: /guides/clients
         :link-type: doc
@@ -66,5 +72,6 @@ This chapter introduces the key features of BentoML. We recommend you read :doc:
     containerization
     build-options
     model-store
+    distributed-services
     clients
     configurations