In [None]:
import ray
from ray.air.config import ScalingConfig, RunConfig
from ray.train.xgboost import XGBoostTrainer
from ray.train.xgboost import XGBoostPredictor
from ray import tune
from ray.tune import Tuner, TuneConfig
from ray import serve
import requests, json
from starlette.requests import Request
import numpy as np

# Ray Serve

<div class="alert alert-block alert-info">
    
__Roadmap to Serve introduction__

1. Implement a simple service
1. Understand key concepts of Ray Serve including __deployments__
1. Observe a running Serve __application__
</div>

Key principles behind Ray and its libraries are
* Performance
* Developer experience and simplicity

# Ray Serve

Serve is a framework for serving ML applications

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/serve_architecture.png' width=700/>

# Deployments

`Deployment` is the fundamental user-facing element of serve.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=600/>

<div class="alert alert-block alert-info">
    
__Roadmap to initial chat app on serve__
    
1. Discover serve deployments via Hello World example
1. Replace placeholder "Hello World" logic with Huggingface transformers chatbot
1. Reserve GPU resources for our chatbot service
</div>

## Our First Service

Let’s jump right in and get something simple up and running on Ray
Serve.

In [None]:
@serve.deployment
class Chat:
    def __init__(self, msg: str):
        self._msg = msg # initial state

    async def __call__(self, request: Request) -> dict:
        data = await request.json()
        data = json.loads(data)
        return {"result": self.get_response(data['input']) }
    
    def get_response(self, message: str) -> str:
        return self._msg + message

handle = serve.run(Chat.bind(msg="Yes... "), name='hello_world')

We can test it as an HTTP endpoint

In [None]:
sample_json = '{ "input" : "hello" }'
requests.post("http://localhost:8000/", json = sample_json).json()

<div class="alert alert-block alert-success">
    
__Lab activity: implement a web service with Ray Serve__
    
The following function will calculate the approximate loan payment for a car.
    
```python
 def monthly_payment(total_price, rate, years_of_loan):
    n = 365.25 # compounding periods
    total_paid = total_price * (((1 + ((rate/100.0)/n)) ** (n*years_of_loan)))
    per_month = total_paid / (12 * years_of_loan)
    return per_month
```
   
<br/>
Deploy this calculator as a web service with Ray Serve!
    
</div>


## Key APIs and concepts

Using Ray Serve, a single Ray cluster can host multiple __applications__

__Applications__ are coarse-grained chunks of functionality *which can be independently upgraded* (i.e., without impacting other applications on the same cluster)

An __application__ is made up of one or more __deployments__

A __deployment__ is a smaller component which can
* specify its own hardware are other resource requirements (like GPUs)
* specify its own runtime environments (like libraries)
* scale independently (including autoscaling)
* maintain state (e.g., models)

We can use __deployments__ to achieve *separation of concerns* -- e.g., separating different models, chunks of business logic, or data conversion

__Ingress deployments__ are typically accessed via HTTP, while other supporting deployments are typically accessed at runtime via a Python `ServeHandle` -- allowing any Serve component (or Ray code) to interact directly with other components as needed

We create a __deployment__ by applying the `@serve.deployment` decorator to a regular Python class or function. We create and start an __application__ by calling `serve.run` on a deployment (typically an ingress deployment).

### Demo: calling a component from Python via a ServeHandle 

In [None]:
response = handle.get_response.remote('hello')
response

In order to support maximal performance, values from remote calls, such as our response string here, are returned as object references (a bit like futures or promises in some frameworks). If we want to block, wait for the result to be ready, and retrieve it, we can use `ray.get(...)`

In [None]:
ray.get(response)

### Demo: observing application and deployment status

In [None]:
! serve status

In [None]:
serve.list_deployments()

Check the Ray dashboard as well to see more information

In [None]:
serve.delete('hello_world')