# Getting Started with Ray Serve
Adapted from [Getting Started](https://docs.ray.io/en/latest/serve/getting_started.html)
This tutorial will walk you through the process of writing and testing a Ray Serve application. It will show you how to

* convert a machine learning model to a Ray Serve deployment
* test a Ray Serve application locally over HTTP
* compose multiple-model machine learning models together into a single application

We’ll use two models in this tutorial:
* [HuggingFace’s TranslationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TranslationPipeline) as a text-translation model
* [HuggingFace’s SummarizationPipeline](https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/pipelines#transformers.SummarizationPipeline) as a text-summarizer model

After deploying those two models, we’ll test them with HTTP requests.

## Text Translation Model (before Ray Serve)
First, we will download the [optimum/t5-small](https://huggingface.co/optimum/t5-small/tree/main) model from HuggingFace

In [20]:
! cd model-cache && GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/google-t5/t5-small

Cloning into 't5-small'...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


remote: Enumerating objects: 100, done.[K
remote: Total 100 (delta 0), reused 0 (delta 0), pack-reused 100 (from 1)[K
Receiving objects: 100% (100/100), 970.79 KiB | 1.31 MiB/s, done.
Resolving deltas: 100% (43/43), done.


In [21]:
! ls -ltrh model-cache/t5-small

total 2.2M
-rw-r--r-- 1 ubuntu ubuntu 8.3K May 26 16:04 README.md
-rw-r--r-- 1 ubuntu ubuntu 1.2K May 26 16:04 config.json
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 rust_model.ot
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 pytorch_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K May 26 16:04 onnx
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 model.safetensors
-rw-r--r-- 1 ubuntu ubuntu  147 May 26 16:04 generation_config.json
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 flax_model.msgpack
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 tf_model.h5
-rw-r--r-- 1 ubuntu ubuntu 774K May 26 16:04 spiece.model
-rw-r--r-- 1 ubuntu ubuntu 1.4M May 26 16:04 tokenizer.json
-rw-r--r-- 1 ubuntu ubuntu 2.3K May 26 16:04 tokenizer_config.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [22]:
! cd model-cache/t5-small && \
git lfs pull --include="model.safetensors"

Downloading LFS objects:   0% (0/1), 0 B | 0 B/s                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading LFS objects: 100% (1/1), 242 MB | 13 MB/s                           

In [23]:
! ls -ltrh model-cache/t5-small

total 233M
-rw-r--r-- 1 ubuntu ubuntu 8.3K May 26 16:04 README.md
-rw-r--r-- 1 ubuntu ubuntu 1.2K May 26 16:04 config.json
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 rust_model.ot
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 pytorch_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K May 26 16:04 onnx
-rw-r--r-- 1 ubuntu ubuntu  147 May 26 16:04 generation_config.json
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 flax_model.msgpack
-rw-r--r-- 1 ubuntu ubuntu  134 May 26 16:04 tf_model.h5
-rw-r--r-- 1 ubuntu ubuntu 774K May 26 16:04 spiece.model
-rw-r--r-- 1 ubuntu ubuntu 1.4M May 26 16:04 tokenizer.json
-rw-r--r-- 1 ubuntu ubuntu 2.3K May 26 16:04 tokenizer_config.json
-rw-r--r-- 1 ubuntu ubuntu 231M May 26 16:05 model.safetensors


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


First, let’s take a look at our text-translation model. Here’s its code:

In [26]:
from transformers import pipeline


class Translator:
    def __init__(self):
        # Load model
        self.model = pipeline("translation_en_to_fr",
                              model="./model-cache/t5-small")

    def translate(self, text: str) -> str:
        # Run inference
        model_output = self.model(text)
        print(f"Raw output: {model_output}")

        # Post-process output to return only the translation text
        translation = model_output[0]["translation_text"]

        return translation


translator = Translator()

translation = translator.translate("Hello world!")
print(translation)

Raw output: [{'translation_text': 'Bonjour monde!'}]
Bonjour monde!


The Python file, called `model.py`, uses the `Translator` class to translate English text to French.

The `self.model` variable inside `Translator`’s `__init__` method stores a function that uses the [t5-small model](https://huggingface.co/google-t5/t5-small/tree/main) to translate text.

When `self.model` is called on English text, it returns translated French text inside a dictionary formatted as `[{"translation_text": "..."}]`.

The `Translator`’s `translate` method extracts the translated text by indexing into the dictionary.

You can copy-paste this script and run it locally. It translates `"Hello world!"` into `"Bonjour Monde!"`.

Keep in mind that the `TranslationPipeline` is an example ML model for this tutorial. You can follow along using arbitrary models from any Python framework. Check out our tutorials on scikit-learn, PyTorch, and Tensorflow for more info and examples:

* [Serve ML Models (Tensorflow, PyTorch, Scikit-Learn, others)](https://docs.ray.io/en/latest/serve/tutorials/serve-ml-models.html#serve-ml-models-tutorial)

## Converting to a Ray Serve Application
In this section, we’ll deploy the text translation model using Ray Serve, so it can be scaled up and queried over HTTP. We’ll start by converting Translator into a Ray Serve deployment.

First, we open a new Python file and import `ray` and `ray.serve`:

After these imports, we can include our model code from above:

In [2]:
from starlette.requests import Request

import ray
from ray import serve

from transformers import pipeline


@serve.deployment(num_replicas=1,
                  ray_actor_options={"num_cpus": 0.5, "num_gpus": 0})
class Translator:
    def __init__(self):
        # Load model
        self.model = pipeline("translation_en_to_fr", model="t5-small")
        print("Initialized!")

    def translate(self, text: str) -> str:
        print("Translating ...")
              
        # Run inference
        model_output = self.model(text)

        # Post-process output to return only the translation text
        translation = model_output[0]["translation_text"]

        return translation

    async def __call__(self, http_request: Request) -> str:
        english_text: str = await http_request.json()
        return self.translate(english_text)

  from .autonotebook import tqdm as notebook_tqdm
2024-05-26 18:31:52,541	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


The `Translator` class has two modifications:

1. It has a decorator, `@serve.deployment`.

2. It has a new method, `__call__`.

The decorator converts `Translator` from a Python class into a Ray Serve `Deployment` object.

Each deployment stores a single Python function or class that you write and uses it to serve requests. You can scale and configure each of your deployments independently using parameters in the `@serve.deployment` decorator. The example configures a few common parameters:

* `num_replicas`: an integer that determines how many copies of our deployment process run in Ray. Requests are load balanced across these replicas, allowing you to scale your deployments horizontally.

* `ray_actor_options`: a dictionary containing configuration options for each replica.

* `num_cpus`: a float representing the logical number of CPUs each replica should reserve. You can make this a fraction to pack multiple replicas together on a machine with fewer CPUs than replicas.

* `num_gpus`: a float representing the logical number of GPUs each replica should reserve. You can make this a fraction to pack multiple replicas together on a machine with fewer GPUs than replicas.

All these parameters are optional, so feel free to omit them

Deployments receive Starlette `HTTP` request objects. By default, the deployment class’s `__call__` method is called on this request object. The return value is sent back in the HTTP response body.

This is why `Translator` needs a new `__call__` method. The method processes the incoming HTTP request by reading its JSON data and forwarding it to the translate method. The translated text is returned and sent back through the HTTP response. You can also use Ray Serve’s FastAPI integration to avoid working with raw HTTP requests. Check out [FastAPI HTTP Deployments](https://docs.ray.io/en/latest/serve/http-guide.html#serve-fastapi-http) for more info about FastAPI with Serve.

Next, we need to `bind` our `Translator` deployment to arguments that will be passed into its constructor. This defines a Ray Serve application that we can run locally or deploy to production (you’ll see later that applications can consist of multiple deployments). Since `Translator`’s constructor doesn’t take in any arguments, we can call the deployment’s `bind` method without passing anything in:

In [3]:
translator_app = Translator.bind()

## Start the Ray Serve
Start the Ray Serve proxy

In [4]:
import os
import ray
from ray import serve

runtime_env = {
    "pip": [
            "ipython==8.12.3",
            "transformers==4.38.2"
           ],
    "env_vars": {"TF_WARNINGS": "none"}
}


os.environ["RAY_ADDRESS"] = "ray://localhost:10001"

def r_init():
    ray.shutdown()
    ray.init()
r_init()

2024-05-26 18:32:04,954	INFO worker.py:1429 -- Using address ray://localhost:10001 set in the environment variable RAY_ADDRESS
SIGTERM handler is not set because current thread is not the main thread.
    Ray: 2.23.0
    Python: 3.11.9
This process on Ray Client was started with:
    Ray: 2.23.0
    Python: 3.11.8



In [5]:
ray.runtime_context.get_runtime_context().get_job_id()

'0e000000'

## Start the Ray Serve Proxy and Controller
SSH into the Ray Cluster head node and execute this command
```bash
serve start --http-host=0.0.0.0 --proxy-location=HeadOnly
```

In [6]:
serve.status()

ServeStatus(proxies={'02cb3c849ec98e7acfbbe84c6ad3c42854231dc30b814a9b6ba4ad08': <ProxyStatus.HEALTHY: 'HEALTHY'>}, applications={}, target_capacity=None)

## Deploy the translator_app application
Use `ray.serve.run` to deploy the `translator_app` onto the Ray Cluster so that it becomes available at `http://<ray-cluster-host>:8000/translate`

Reference API:
* [ray.serve.run](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.run.html)

In [7]:
app_name = "eng_to_fr"

ray.serve.run(target=translator_app,
              name="eng_to_fr",
              route_prefix="/translate")

The new client HTTP config differs from the existing one in the following fields: ['host', 'location']. The new HTTP config is ignored.
2024-05-26 18:32:15,498	INFO handle.py:126 -- Created DeploymentHandle 'sp6k5kt0' for Deployment(name='Translator', app='eng_to_fr').
2024-05-26 18:32:15,499	INFO handle.py:126 -- Created DeploymentHandle '2bosd1c8' for Deployment(name='Translator', app='eng_to_fr').
2024-05-26 18:32:19,526	INFO handle.py:126 -- Created DeploymentHandle 'g5pfku8c' for Deployment(name='Translator', app='eng_to_fr').
2024-05-26 18:32:19,527	INFO api.py:584 -- Deployed app 'eng_to_fr' successfully.
2024-05-26 18:32:31,538	INFO handle.py:126 -- Created DeploymentHandle '53nf07ws' for Deployment(name='Translator', app='eng_to_fr').
2024-05-26 18:32:31,538	INFO handle.py:126 -- Created DeploymentHandle '68pwf35w' for Deployment(name='Translator', app='eng_to_fr').
2024-05-26 18:32:31,539	INFO handle.py:126 -- Created DeploymentHandle 'jcl61gjg' for Deployment(name='Translato

DeploymentHandle(deployment='Translator')

In [10]:
import requests

text = "An apricot tree trunk yields excellent wood"

response = requests.post("http://ray-cluster-serve.mlnow.frenoid.com:30080/translate", json=text)

response.text

"Un tronc d'arbre apricot donne un excellent bois"

In [11]:
import requests

text = "Michael's son misses me dearly. They must make haste to avoid missing the train."

response = requests.post("http://ray-cluster-serve.mlnow.frenoid.com:30080/translate", json=text)

response.text

"Le fils de Michael me manque cher, ils doivent s'efforcer d'éviter de manquer le train."

In [12]:
serve.delete(app_name)

2024-05-26 18:38:39,779	INFO pow_2_scheduler.py:260 -- Got updated replicas for Deployment(name='Translator', app='eng_to_fr'): set().


# Composing Multiple Models
Ray Serve allows you to compose multiple deployments into a single Ray Serve application. This makes it easy to combine multiple machine learning models along with business logic to serve a single request. We can use parameters like `autoscaling_config`, `num_replicas`, `num_cpus`, and `num_gpus` to independently configure and scale each deployment in the application.

For example, let’s deploy a machine learning pipeline with two steps:

1. Summarize English text
2. Translate the summary into French

`Translator` already performs step 2. We can use [HuggingFace’s SummarizationPipeline](https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/pipelines#transformers.SummarizationPipeline) to accomplish step 1. Here’s an example of the `SummarizationPipeline` that runs locally: