## Walkthrough of model deployment as ML web service on Kubernetes

This notebook outlines steps for deploying a machine learning model as a simple custom-built REST API prediction service to a Kubernetes instance.

It is composed of the following sections:
 1. Prepare environment
 2. Test the model
 3. Run the service locally with Flask
 4. Run the service using Docker
 5. Run the service on a Kubernetes instance
 6. Autoscaling and load-testing the service on Kubernetes

Note: this notebook assumes the user is running on a windows device and has the Docker, Kubectl and Helm CLIs installed. Alternative Curl command syntax would be needed for a linux-user.


### Prepare environment

**Import libraries**

In [1]:
from yaml import load, Loader
import pandas as pd
import os, glob
import requests
import json
import joblib

**Load config and chosen models**

Load configuration

In [2]:
with open('config.yaml','r') as config_file:
    config = load(config_file, Loader=Loader)

docker_registry = config['DOCKER_REGISTRY']
service_name = config['SERVICE_NAME']
api_version = config['API_VERSION']
model_repo = '..\experimentation\models'

Copy latest model to deployment directory

In [3]:
latest_model = sorted(os.listdir(model_repo))[-1]
latest_model_path = os.path.join(model_repo,latest_model)

!copy "{latest_model_path}" .

        1 file(s) copied.


### Test the model

Import data for testing

In [3]:
test_df = pd.read_csv("../experimentation/datasets/test.csv")
test_entry = test_df[test_df.Fare.notna()].copy()

Load in the ML model and call the predict method on the data.

In [4]:
# load in file with .pkl extension as the model
ml_model = joblib.load(glob.glob('*.pkl')[0])
predictions = ml_model.predict(test_entry)
predictions

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

### Run the prediction service locally using Flask

Run the flask app. The service will be served at http://127.0.0.1:5000/titanic/v0.0.1/predict

In [5]:
# run the service locally
!python api.py

^C


Alternatively, run the service from a different command prompt / shell, and test the web service using Curl here. 

In [6]:
# this will need to be run from a separate kernel / terminal to that running the web service
!curl -X POST -H "Content-Type:application/json" --data "{\"PassengerId\":[892],\"Pclass\":[3],\"Name\":[\"Kelly, Mr. James\"],\"Sex\":[\"male\"],\"Age\":[34.5],\"SibSp\":[0],\"Parch\":[0],\"Fare\":[7.8292],\"Embarked\":[\"S\"]}" http://127.0.0.1:5000/titanic/v0.0.1/predict

{"predictions":[0]}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   167  100    20  100   147   1435  10551 --:--:-- --:--:-- --:--:-- 12846


### Containerise the prediction service using Docker

**Build the docker image**

Create a relevant tag that includes the image repository, a name for the service and its version. Build the image and tag it with the relevant tag.

In [13]:
tag = f'{docker_registry}/{service_name}:{api_version}'
!docker build -t {tag} .

#1 [internal] load build definition from Dockerfile
#1 sha256:3653b7f4eb55c89c4ca666c0fefffb0333f8f8ac5ee2edfbcbb32b34f45053ee
#1 transferring dockerfile: 32B done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 sha256:d45d1a578aaa317b91817517a57aedad5a13ca5c8f968a3ebcf9160a01480f57
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/python:3.9-slim
#3 sha256:3425157df499c84dd49181e5611a11caeed16adf15a5ddbcfa4c3002c56d3d27
#3 DONE 1.5s

#4 [1/5] FROM docker.io/library/python:3.9-slim@sha256:f4efbe5d1eb52c221fded79ddf18e4baa0606e7766afe2f07b0b330a9e79564a
#4 sha256:9ce0d84a404c9ac604ef98baa1f1065d5a70e321684b314c01df3d72c5a89693
#4 DONE 0.0s

#6 [internal] load build context
#6 sha256:98489ef9bda79a123558bad715b9a37b5e733c2152a44f8c85f7d73a65b6d3a9
#6 transferring context: 210B 0.0s done
#6 DONE 0.0s

#5 [2/5] RUN mkdir /app
#5 sha256:f8977e52fc2da4995e347b7fb878eedc812cb1c54dc0d42a662aa3db7b518aba
#5 CACHED

#7 [3/5] COPY config.yaml api.p

**Run the service on Docker**

Run the image as a container locally and map container port 5000 to localhost port 5000 for testing.

In [17]:
!docker run --rm -p 5000:5000 --name test-ml-model edlongbottom/mlwebservice/titanic:0.0.1

^C


**Test the service**

Use Curl or the python requests module to test the prediction web service

In [7]:
# again, this must be executed from a separate kernel/terminal as the kernel is occupied running the previous cell
!curl -X POST -H "Content-Type:application/json" --data "{\"PassengerId\":[892],\"Pclass\":[3],\"Name\":[\"Kelly, Mr. James\"],\"Sex\":[\"male\"],\"Age\":[34.5],\"SibSp\":[0],\"Parch\":[0],\"Fare\":[7.8292],\"Embarked\":[\"S\"]}" http://127.0.0.1:5000/titanic/v0.0.1/predict

{"predictions":[0]}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   167  100    20  100   147   1279   9403 --:--:-- --:--:-- --:--:-- 11928


**Tear down**

Once testing is complete, stop and remove the docker container. This step isn't required if the '--rm' flag was included when performing the docker run step.

It will throw an error if you used the '--rm' flag as below.

In [8]:
!docker stop test-ml-model
!docker rm test-ml-model

test-ml-model


Error: No such container: test-ml-model


### Deploy the prediction service to Kubernetes

Push the built image to Docker hub so it available remotely (you may need to log in to Docker first and create the repository if you haven't already).

In [None]:
!docker push {tag}

**Configure a kubernetes cluster** 

At this point, a kubernetes cluster is required and your kubectl CLI must be configured to set the chosen cluster as its current context. Docker desktop or Minikube can be used to spin up a cluster locally, or alternatively you could look to provision a cluster through a cloud provide (for example, AKS from Azure).

I am using Minikube here, which needs to be started to spin up a kubernetes cluster:

`minikube start`

I can then use the kubectl CLI to run some basic commands to check the cluster is running ok:
 - the first command provides information on the node
 - the second command provides info on all kubernetes objects in all namespaces on this cluster

In [13]:
!kubectl get nodes -o wide
print("\n")
!kubectl get all --all-namespaces     

NAME       STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION                      CONTAINER-RUNTIME
minikube   Ready    control-plane,master   22h   v1.22.3   192.168.49.2   <none>        Ubuntu 20.04.2 LTS   5.10.16.3-microsoft-standard-WSL2   docker://20.10.8


NAMESPACE              NAME                                             READY   STATUS    RESTARTS        AGE
kube-system            pod/coredns-78fcd69978-fph8x                     1/1     Running   2 (4m55s ago)   22h
kube-system            pod/etcd-minikube                                1/1     Running   2 (4m55s ago)   22h
kube-system            pod/kube-apiserver-minikube                      1/1     Running   2 (4m55s ago)   22h
kube-system            pod/kube-controller-manager-minikube             1/1     Running   2 (4m55s ago)   22h
kube-system            pod/kube-proxy-n9dvx                             1/1     Running   2 (4m55s ago)   22h
kube-system  

**Deploy the prediction service using Helm**

Once you have a cluster setup and you are connected to it, deploy the docker image to Kubernetes using Helm. The helm chart is included under the deployment folder. Let's spin up a single instance of the web service first (using the helm-ml-serving-single chart).

This chart includes the following files, which collectively create Namespace, LoadBalancer and Deployment objects on Kubernetes:

 - `templates/deployment.yaml`
 - `templates/service.yaml`
 - `templates/namespace.yaml`
 - `Chart.yaml`
 - `values.yaml`
 
These objects could be deployed one by one declaratively as YAML files. However, things can get complicted when you have complex applications with many objects and services that all reference each other. Helm is useful as a package manager, keeping all variables in one file (values.yaml) and version information in another (Chart.yaml). The Helm CLI is used to deploy, upgrade and rollback charts.   

In [14]:
!helm upgrade --install mlwebservice-titanic helm-ml-serving-single

Release "mlwebservice-titanic" does not exist. Installing it now.
NAME: mlwebservice-titanic
LAST DEPLOYED: Thu Jan 20 15:14:28 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None


Confirm the deployment was successful by checking the pods in the model-serving namespace (you may need to wait a minute).

In [16]:
!kubectl get pods -n model-serving

NAME                                               READY   STATUS    RESTARTS   AGE
mlwebservice-titanic-deployment-6b8c7c5dcc-hmghx   1/1     Running   0          2m48s


Test the web service using Curl

In [17]:
!curl -X POST -H "Content-Type:application/json" --data "{\"PassengerId\":[892],\"Pclass\":[3],\"Name\":[\"Kelly, Mr. James\"],\"Sex\":[\"male\"],\"Age\":[34.5],\"SibSp\":[0],\"Parch\":[0],\"Fare\":[7.8292],\"Embarked\":[\"S\"]}" http://127.0.0.1:5000/titanic/v0.0.1/predict

{"predictions":[0]}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   167  100    20  100   147   1223   8995 --:--:-- --:--:-- --:--:-- 11133


**Tear down**

Remove the service when not in use

In [18]:
!helm uninstall mlwebservice-titanic

release "mlwebservice-titanic" uninstalled


### Introduce autoscaling and stress-test the service

Kubernetes has a Horizontal Pod Autoscaler (HPA) to allow you to scale up the number of pods to meet demands on the application. Depending on the resource requirements of your pods and the spec of the node, you may also want to introduce more nodes to cope with demand. Here, we will look at scaling the number of pods only.

**Enable Metrics Server**

HPA queries a Metrics Server to measure resource utilisation such as CPU / RAM. Minikube has a metrics server that must be launched using the following command. It will launch a deployment object which you can view using kubectl.

In [19]:
!minikube addons enable metrics-server

  - Using image k8s.gcr.io/metrics-server/metrics-server:v0.4.2
* The 'metrics-server' addon is enabled


In [20]:
!kubectl get deployment metrics-server -n kube-system

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server   1/1     1            1           22h


**Deploy the prediction service**

We can now deploy the second version of our helm chart which includes `autoscale.yaml` which will create a HPA object. This will allow the deployment to scale between 1 and 3 replicas of the model based on CPU utilization. The `deployment.yaml` has also been amended to include resource requests and limits for the pod. Keeping this low will force Kubernetes to scale up the number of pods as the limit is reached.

In [55]:
!helm upgrade --install mlwebservice-titanic helm-ml-serving-multi

Release "mlwebservice-titanic" has been upgraded. Happy Helming!
NAME: mlwebservice-titanic
LAST DEPLOYED: Thu Jan 20 16:51:31 2022
NAMESPACE: default
STATUS: deployed
REVISION: 4
TEST SUITE: None


Use kubectl to get information on the auto-scaler and the load-balancer:

In [57]:
!kubectl get hpa -n model-serving

NAME                       REFERENCE                                    TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
mlwebservice-titanic-hpa   Deployment/mlwebservice-titanic-deployment   200%/20%   1         3         1          19s


In [58]:
!kubectl get svc -n model-serving

NAME                           TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service-mlwebservice-titanic   LoadBalancer   10.109.149.55   <pending>     5000:32276/TCP   82m


If using Minikube locally, run the following command to set up a tunnel from a random port on localhost to the loadbalancer: 

`minikube service service-mlwebservice-titanic -n model-serving`

Then, send a request using the port which was assigned (mine was `58100`)

In [54]:
# define base URL (localhost +  port)
url = "http://127.0.0.1:5000/titanic/v0.0.1/predict"
headers = {"Content-Type": "application/json"}
body = {'PassengerId':[892],'Pclass':[3],'Name':['Kelly, Mr. James'],'Sex':['male'],
        'Age':[34.5],'SibSp':[0],'Fare':[7.8292],'Embarked':['S']}

# send a get request to flask api
response = requests.post(url=url, data=json.dumps(body), headers=headers)
print(response.json()) 

{'predictions': [0]}


**Stress test the service**

The service can be tested by repeatedly sending requests at it to simulate traffic. See `stress-test.py` for a script that uses a while loop and the requests module to fire requests at the prediction service.

First, run the dashboard so you can monitor the deployment and how it scales as it is subject to traffic:

`minikube dashboard`

Then, run the python script to test the service:

`python stress-test.py`

By viewing the dashboard, you should be able to see the number of pods increase to meet the demand from the stress test.

**Tear down resources**

In [60]:
!helm uninstall mlwebservice-titanic

release "mlwebservice-titanic" uninstalled


In [61]:
!minikube delete

* Deleting "minikube" in docker ...
* Deleting container "minikube" ...
* Removing C:\Users\eddlo\.minikube\machines\minikube ...
* Removed all traces of the "minikube" cluster.


### Load testing using Locus