# BentoML - Production ready machine learning

* BentoML: open source model serving library
* Module 6: developed a model for credit approval
    * What needs to be done next?
    * How can the model be used by people?
    * This can be done as a webservice
        * Module 5: wrap into flask app
        * This works well in development, but in real-world scenarios more factors need to be considered (especially much more peaople are going to use it)
    * Goal of this module: 
        * build and deploy ML model at scale
        * Customize your service to fit your use case
        * Make your service production ready
* What is 'Production ready'?
    * Scalability
    * Operationally efficiency
    * Repeatability (CI/CD)
    * Flexibility
    * Resiliency
    * Easy to use- ity
        

"BentoML makes it easy to **create** and **package** your ML service for production"

# 7.2 Building a Prediction Service
* Use model from module 6 from  2022 (copied from github)
* Using BentoML we can save the model as it is recommended for each framework and version
* See end of notebook: module6_2022.ipynb

In [1]:
!pip install bentoml

Collecting bentoml
  Downloading bentoml-1.0.7-py3-none-any.whl (858 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m858.3/858.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hCollecting python-multipart
  Downloading python-multipart-0.0.5.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting opentelemetry-semantic-conventions==0.33b0
  Downloading opentelemetry_semantic_conventions-0.33b0-py3-none-any.whl (26 kB)
Collecting circus
  Downloading circus-0.17.1-py3-none-any.whl (182 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.7/182.7 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cattrs>=22.1.0
  Downloading cattrs-22.2.0-py3-none-any.whl (35 kB)
Collecting opentelemetry-util-http==0.33b0
  Downloading opentelemetry_util_http-0.33b0-py3-none-any.whl (6.6 kB)
Collecting deepmerge
  Downloading deepmerge-1.0.1-py3-none-any.whl (8.0 kB)
Collect

Collecting h11>=0.8
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting pep517>=0.9.1
  Downloading pep517-0.13.0-py3-none-any.whl (18 kB)
Building wheels for collected packages: python-multipart
  Building wheel for python-multipart (setup.py) ... [?25ldone
[?25h  Created wheel for python-multipart: filename=python_multipart-0.0.5-py3-none-any.whl size=31678 sha256=39ea1a4a102528167c3d19d873a238315f114d27e25156a9bf495cc2d2f9ddbf
  Stored in directory: /home/frauke/.cache/pip/wheels/9e/fc/1c/cf980e6413d3ee8e70cd8f39e2366b0f487e3e221aeb452eb0
Successfully built python-multipart
Installing collected packages: deepmerge, commonmark, simple-di, rich, python-multipart, python-dotenv, pynvml, pep517, pathspec, opentelemetry-util-http, opentelemetry-semantic-conventions, h11, fs, exceptiongroup, deprecated, circus, asgiref, watchfiles, uvicorn, starle

In [2]:
!bentoml --version

bentoml, version 1.0.7


```import bentoml
bentoml.xgboost.save_model("credit_risk_model", model,
                          custom_objects={
                              "dictVectorizer": dv
                          })```

* ```custom_objects``` allows to save other things we need for our model, as e.g. in this case the dictionary vectorizer

Output:
```Model(tag="credit_risk_model:4kf5u7coewdndaoi", path="/home/frauke/bentoml/models/credit_risk_model/4kf5u7coewdndaoi/")```

* This creates a unique tag each time 'save_model' is called
* The model is saved at aspecific path

### Create a Service

* saved as 'service.py'
* Call the service from the terimal: ```bentoml serve service.py:svc```
* We then have a service running at ```localhost:3000```
* We can use ```bentoml serve service.py:svc --reload``` to automatically reload the service, when we change it

# 7.3 Deploying your Prediction Service

* bentoml provides a command line tool:

```bentoml models list```: lists all models saved.

Output:

| Tag |                                Module |          Size |       Creation Time |
|-----|---------------------------------------|---------------|---------------------|
| credit_risk_model:4kf5u7coewdndaoi | bentoml.xgboost | 195.66 KiB | 2022-10-17 16:13:33 |
| credit_risk_model:fxjqrmcoekdndaoi | bentoml.xgboost | 195.27 KiB | 2022-10-17 15:47:02 |
    
```bentoml models get credit_risk_model:fxjqrmcoekdndaoi```: gives detailed information about the model

```
name: credit_risk_model                                                                                                             
version: fxjqrmcoekdndaoi                                                                                                           
module: bentoml.xgboost                                                                                                             
labels: {}                                                                                                                          
options:                                                                                                                            
  model_class: Booster                                                                                                              
metadata: {}                                                                                                                        
context:                                                                                                                            
  framework_name: xgboost                                                                                                           
  framework_versions:                                                                                                               
    xgboost: 1.6.2                                                                                                                  
  bentoml_version: 1.0.7                                                                                                            
  python_version: 3.8.3                                                                                                             
signatures:                                                                                                                         
  predict:                                                                                                                          
    batchable: false                                                                                                                
api_version: v2                                                                                                                     
creation_time: '2022-10-17T13:47:02.302464+00:00'  
```

### How to build our bento

* We need to create a 'bentofile.yaml'
* See documentation for complete list of possible parameters: https://docs.bentoml.org/en/latest/concepts/bento.html
* It not only specifies things about the model itself, but also about the environment
* build the bento: ```bentoml build``` in terminal
* Output: ```Successfully built Bento(tag="credit_risk_classifier:tcr675covor57p7e")```
* Going to ```~/bentoml/bentos/credit_risk_classifier/tcr675covor57p7e``` we can see the files stored in the bento
![bento1.png](bento1.png)
* dockerfile is automatically build (can be customized)
* Standardized way of combining all things needed for an ML service at one place
* If we containerize it, we hava a single image to deploy 
* T actually containerize it, go back to the folder, where service.py and bentofile.yaml are stores, then in the terminal: ```bentoml containerize credit_risk_classifier:tcr675covor57p7e```
* Output: ```Successfully built docker image for "credit_risk_classifier" with tags "credit_risk_classifier:tcr675covor57p7e"
To run your newly built Bento container, pass "credit_risk_classifier:tcr675covor57p7e" to "docker run". For example: "docker run -it --rm -p 3000:3000 credit_risk_classifier:tcr675covor57p7e serve --production".```
* Run ```docker run -it --rm -p 3000:3000 credit_risk_classifier:tcr675covor57p7e``` to start docker. Then we can go to ```localhost:3000``` to see our service

# 7.4 Sending, Receiving and Validation Data

* The service as it is at the moment also gives a response, when the input data is not valid, in the sense that some entry is missing or has a wrong name. To avoid this we use the library ```pydantic```
* A list of input and output dscripters can be found in the documentation: https://docs.bentoml.org/en/latest/reference/api_io_descriptors.html

# 7.5 High-Performance Serving

* Test service with big amount of traffig
* Tool to send traffic to service: ```Locust```
    * ```pip install locust```
    * Need to create a ```locustfile```
        * Contains a Data Sample
        * Inherit from ```HttpUser```
        * In this class, we need to create a ```task```, we call it ```classify```
        * define a random waiting time, to simulate inconsistency
        * T start the locust process: ```locust -H http://localhost:3000```
* Optimizations
    * Use ```async``` to parallize the requests, else all requests are done after each other
    * ```async``` allows Parallalize at endpoint-level 
    * Traditionally, web scaling is done by replicating the entire process
    * It is much more efficient to send multiple inputs to a model, instead of sending inputs one at a time
    * Combine inputs to batches: ```micro-batching```
    * We hvae to enable this, when we store the model:
    ```bentoml.xgboost.save_model("credit_risk_model", model,
                          custom_objects={
                              "dictVectorizer": dv
                          },
                          signatures={ # model signatures for micro-batching
                              "predict": {
                                  "batchable": True,
                                  "batch_dim": 0
                              }
                          })```
    * ```signature``` defines which endpoints of the model are going to be batchable
    * We then need to run ```bentoml serve --production``` in the terminal
    * This tells that we want more than 1 process for our webservice
    * For more details see documetaion: www.docs.bentoml.or/en/latest/guides/batching.html#architecture
    * Especially take care about the parameters: ```max_batch_size``` and ```max_latency_ms```
    * These paramters can be changed creating a ```bentoconfiguration.yam``` file:
    ```runners:
         batching:
           max_batch_size: 100
           max_latency_ms: 500```
    * In this file we can e.g. alos specify, if we want to run on a GPU 

# 7.6 Bento Production Deployment
* Build a bento, create a docker container, deploy docker container to AWS (EWS)
* Build the bento: ```bentoml build```
* Containerize: ```bentoml containerize credit_risk_classifier:3xlt5ocpp2ledn4h```
* AWS.com: login, create 'Elastic Container Registry'
    * After our container is build (previous step), we have to push it to the Container Registry on AWS
    * We can use ```aws``` command as cli command to interface with our cloud
    * install aws cli
    * In order that aws cli works, we need our "Access" and "Secret Keys". These can be find at the link "Security Credentials" -> Access Keys. (Note that for real production cases there are more secure ways, e.g. "IAM User")
    * To specify the access and secret key in the terminal use ````aws configure```
    * To check, if we are connected: ```aws s3 ls``` shows what buckets are available
    * Tag our created docker image: ```docker tag credit_risk_classifier:3xlt5ocpp2ledn4h ... ``` to our new repo (on AWS)
    * Push that image: ```docker push ...```
    * 'Elastic Container Registry' is the place to store the image, to run it we use 'Elastic Container Service'
    * There we first need a 'Cluster' (click on 'Clusters' on the left hand side, then 'create cluster')
    * There are different types of clusters, that we can create. For this example ww use 'Networking only', which makes for the most sence for entry-level AWS account (No GPUs).
    * Configure task and container definitions:
        * Task definition name: credit-risk-classifier
        * Task role: -
        * Operating system family: Linux
        * Task memory (GB): 0.5GB
        * Task CPU (vCGU): 0.25 vCPU
     * These numbers are small, but will stay in the free tier!
     * Press 'Add Container' button: Go back to Container Registry and copy the URI of the image
         * Container name: credid-risk-classifier-container
         * Image: <Image URI>
         * Soft limit: 256 (This allows the applications to burst above the soft limit for short amounts of time, in contrast to a hard limit, for a larger model 'hard limit' makes sense)
         * Port mappings: 3000 (we need to expose this port)
         * When done, click "add"
     * Go back to 'Clusters' and click on our cluster 'credit-risk-classifier-cluster'
        * Go to 'Tasks'
        * Run 'Run new Task' and specify options
            * Launch type: Fargate
            * Operating system family: Linux
            * Family: credit-risk-classifier-task
            * Revision: 2
            * Cluster VPC: select default
            * Subnets: ...
            * Security Group: Click 'Edit'
                * Select 'Inbound rules for security group': Type: Cluster ... , Port: 3000
                * 'Save'
            * 'Create'
    * Now our task is running!
    * We can use the puplic IP: <IP: 3000>
    * Try 'locust' to see how it scales
    
* Alternatives:
    * ECS, SageMaker: can hose Notebooks, a bit more expensive, can use GPUs
    * Google
    * Azure
* Share Bento
    * In terminal ```bentoml export```
    * pushes the bento to a local file or to a remote destiation as e.g. a s3 bucket

# 7.7 - (Optional) Advanced Example: Deploying Stable Diffusion Model

* https://github.com/bentoml/stable-diffusion-bentoml
* Clone the repo: ```git clone https://github.com/bentoml/stable-diffusion-bentoml.git``` 
* Move into the repo: ```cd stable-diffusion-bentoml```
* Create virtuel env: ```python3 -m venv venv```
* Activate the env: ```. venv/bin/activate```
* Update pip ```pip install -U pip```
* Install requirements ```pip install -r requirements.txt```

**There are different ways to create a stable Difference Bento**
* Download a pre-build bento
* Build from Stable Diffusion Models -> We will use this approach

Choose a Stable Diffusion model: fp16 (for GPU with less than 10GB VRAM). This is a little bit smaller than the fp32
* move to directory: ```cd fp16/```  
* pull down the model and put it into the 'models' directory: ```curl https://s3.us-west-2.amazonaws.com/bentoml.com/stable_diffusion_bentoml/sd_model_v1_4_fp16.tgz | tar zxf - -C models/``` (this takes a bit)
* move into the 'models' directory
* use ```tree``` to get an overview about the model

take a look at the bento: ```vim service.py```
* We are instantiating a custom runner:
 ```stable_diffusion_runner = bentoml.Runner(StableDiffusionRunnable, name='stable_diffusion_runner', max_batch_size=10)```

* create a service: ```svc = bentoml.Service("stable_diffusion_fp16", runners=[stable_diffusion_runner])```

* the api gets as input a JSON and the output is an image
* stable diffusion as a second api, that takes as input text and image and outputs a text, this is also in our service. The input is a bit more complicated as it is 'multipart' (text+image)

* The custom runner is defined as ```StableDiffusionRunnable```
    * inherits from the ```bentoml.Runnable``` class
    * Need to define ```SUPPORTED_RESSOURCES```: where should the modl run, what type of hardware (GPU?), ```SUPPORTED_CPU_THREADING``` if True multiple threads are created for the runner
    *```__init___```: initializes where the model is and the device it is running on (here: cuda - driver for nvidia gpus), make sure everything we need in here to make a prediction
    * ```txt2img```: runnable method with some preprocessing defined
    * ```img2img```: runnable method with some (more complicated) preprocessing defined
    
take a look at the bentofile:
* include the files and packages we need
* docker options: the ```cuda_version``` is here important, if we want to run on a nvidia GPU!

the configuration.yaml controls attributes of the runner
* In this case only the timeout is made a larger, because the model needs at least a few seconds to run

* build the bento: ```bentoml build```
* We can't serve it locally wthout a GPU (will take a long time)
* do ```pip install bentoctl```
* There are different services supported to deploy on. we will use AWS EC2
* move to ```stable-diffusion-bentoml/bentoctl``` and open ```deployment_config.yaml```, here aws-ec is defined and a GPU instance ('g4dn.xlarge', this is one of the cheaper GPU instances)
* install operator: ```bentoctl operator install aws-ec2```
* generate all files we need based on deployment_config.yaml: ```bentoctl enerate -f deployment_config.yaml```: creates ```./main.tf``` and ```bentoctl.tfvars```
* install terraform: https://learn.hashicorp.com/tutorials/terraform/install-cli
* terraform allows us to interface with AWS and deploy our bento to EC2, this is done using ```main.tf``` and ```bentoctl.tfvars```
* build the docker conatiner, that we will deploy to EC2 and push it to a repository in our AWS account: ```bentoctl build -b stable_diffusion_fp16:latest```
* EC2 has a few special reqirements for deploying docker, for locations that don't need any special packaging (like ECS, kubernetes) ```bento containerize``` is sufficient
* create EC2 machine, that hosts the model: ```bentoctl apply -f deployment_config.yaml```

To create the resources specifed run this after the build command.
```$ bentoctl apply```

To cleanup all the resources created and delete the registry run
```$ bentoctl destroy```


