# R API Serving Examples

In this example, we demonstrate how to quickly compare the runtimes of three methods for serving a model from an R hosted REST API. The following SageMaker examples discuss each method in detail:

* **Plumber**
 * Website: [https://www.rplumber.io/](https://www.rplumber.io)
 * SageMaker Example: [r_serving_with_plumber](../r_serving_with_plumber)
* **RestRServe**
 * Website: [https://restrserve.org](https://restrserve.org)
 * SageMaker Example: [r_serving_with_restrserve](../r_serving_with_restrserve)
* **FastAPI** (reticulated from Python)
 * Website: [https://fastapi.tiangolo.com](https://fastapi.tiangolo.com)
 * SageMaker Example: [r_serving_with_fastapi](../r_serving_with_fastapi)
 
We will reuse the docker images from each of these examples. Each one is configured to serve a small XGBoost model which has already been trained on the classical Iris dataset.

## Building Docker Images for Serving

First, we will build each docker image from the provided SageMaker Examples.

### Plumber Serving Image

In [None]:
!cd .. && docker build -t r-plumber -f r_serving_with_plumber/Dockerfile r_serving_with_plumber

### RestRServe Serving Image

In [None]:
!cd .. && docker build -t r-restrserve -f r_serving_with_restrserve/Dockerfile r_serving_with_restrserve

### FastAPI Serving Image

In [None]:
!cd .. && docker build -t r-fastapi -f r_serving_with_fastapi/Dockerfile r_serving_with_fastapi

## Launch Serving Containers

Next, we will launch each search container. The containers will be launch on the following ports to avoid port collisions on your local machine or SageMaker Notebook instance:

In [None]:
ports = {
    "plumber": 5000,
    "restrserve": 5001,
    "fastapi": 5002,
}

In [None]:
!bash launch.sh

In [None]:
!docker container list

## Define Simple Client

In [None]:
import requests
from tqdm import tqdm
import pandas as pd

In [None]:
def get_predictions(examples, instance=requests, port=5000):
    payload = {"features": examples}
    return instance.post(f"http://127.0.0.1:{port}/invocations", json=payload)

In [None]:
def get_health(instance=requests, port=5000):
    instance.get(f"http://127.0.0.1:{port}/ping")

## Define Example Inputs

Next, we define a example inputs from the classical [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset.
* Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [None]:
column_names = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Label"]
iris = pd.read_csv(
    "s3://sagemaker-sample-files/datasets/tabular/iris/iris.data", names=column_names
)

In [None]:
iris_features = iris[["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]]

In [None]:
example = iris_features.values[:1].tolist()

In [None]:
many_examples = iris_features.values[:100].tolist()

## Testing

Now it's time to test how each API server performs under stress.

We will test two use cases:
* **New Requests**: In this scenario, we test how quickly the server can respond with predictions when each client request establishes a new connection with the server. This simulates the server's ability to handle real-time requests. We could make this more realistic by creating an asynchronous environment that tests the server's ability to fulfill concurrent rather than sequential requests.
* **Keep Alive / Reuse Session**: In this scenario, we test how quickly the server can respond with predictions when each client request uses a session to keep its connection to the server alive between requests. This simulates the server's ability to handle sequential batch requests from the same client.

For each of the two use cases, we will test the performance on following situations:

* 1000 requests of a single example
* 1000 requests of 100 examples
* 1000 pings for health status

## New Requests

### Plumber

In [None]:
# verify the prediction output
get_predictions(example, port=ports["plumber"]).json()

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(example, port=ports["plumber"])

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(many_examples, port=ports["plumber"])

In [None]:
for i in tqdm(range(1000)):
    get_health(port=ports["plumber"])

### RestRserve

In [None]:
# verify the prediction output
get_predictions(example, port=ports["restrserve"]).json()

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(example, port=ports["restrserve"])

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(many_examples, port=ports["restrserve"])

In [None]:
for i in tqdm(range(1000)):
    get_health(port=ports["restrserve"])

### FastAPI

In [None]:
# verify the prediction output
get_predictions(example, port=ports["fastapi"]).json()

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(example, port=ports["fastapi"])

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(many_examples, port=ports["fastapi"])

In [None]:
for i in tqdm(range(1000)):
    get_health(port=ports["fastapi"])

## Keep Alive (Reuse Session)

Now, let's test how each one performs when each request reuses a session connection. 

In [None]:
# reuse the session for each post and get request
instance = requests.Session()

### Plumber

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(example, instance=instance, port=ports["plumber"])

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(many_examples, instance=instance, port=ports["plumber"])

In [None]:
for i in tqdm(range(1000)):
    get_health(instance=instance, port=ports["plumber"])

### RestRserve

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(example, instance=instance, port=ports["restrserve"])

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(many_examples, instance=instance, port=ports["restrserve"])

In [None]:
for i in tqdm(range(1000)):
    get_health(instance=instance, port=ports["restrserve"])

### FastAPI

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(example, instance=instance, port=ports["fastapi"])

In [None]:
for i in tqdm(range(1000)):
    _ = get_predictions(many_examples, instance=instance, port=ports["fastapi"])

In [None]:
for i in tqdm(range(1000)):
    get_health(instance=instance, port=ports["fastapi"])

### Stop All Serving Containers

Finally, we will shut down the serving containers we launched for the tests.

In [None]:
!docker kill $(docker ps -q)

## Conclusion

In this example, we demonstrated how to conduct a simple performance benchmark across three R model serving solutions. We leave the choice of serving solution up to the reader since in some cases it might be appropriate to customize the benchmark in the following ways:

* Update the serving example to serve a specific model
* Perform the tests across multiple instances types
* Modify the serving example and client to test asynchronous requests.
* Deploy the serving examples to SageMaker Endpoints to test within an autoscaling environment.

For more information on serving your models in custom containers on SageMaker, please see our [support documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-main.html) for the latest updates and best practices.