Skip to content

Latest commit

 

History

History

inference-server

Serving Models stored in S3 buckets using Datashim Datasets

(Note: Credit to YAML file from @zioproto for TGI deployment in Kubernetes which provided the basis for the TGI deployment shown in this example)

Large language models are the most interesting cloud workloads of the day. This example demonstrates reducing the friction of loading models from an S3 bucket using Datashim. For this example, we use the open-source Text Generation Inference (TGI) from HuggingFace as the inference service that loads the models and makes it available for prompt inputs.

Note

In this tutorial we will assume you are in the datashim-demo namespace. To create this namespace and set it as your current context you can run:

kubectl create namespace datashim-demo
kubectl config set-context --current --namespace=datashim-demo

Prerequisites

There are no prerequisites needed to follow this tutorial, as it will provide instructions to provision a local S3 endpoint and store a model in it. If you already have them, feel free to skip the optional instructions, but make sure to update the values in the YAMLs, as they will all reference the setup we provide.

(OPTIONAL) Creating a local object storage endpoint

The YAML we provide provisions a local MinIO instance using hardcoded credentials.

Caution

Do not use this for any real production workloads!

From this folder, simply run:

kubectl create namespace minio
kubectl apply -f minio.yaml
kubectl wait pod --for=condition=Ready -n minio --timeout=-1s minio 

Creating a Dataset

To access our data, we must first create a Secret containing the credentials to access the bucket that holds our data, and then a Dataset object that links configuration information to the access credentials.

Important

Make sure your active namespace is labelled with monitor-pods-datasets=enabled so that Datashim can mount volumes in the pods during the tutorial. Using datashim-demo as the namespace, run:

kubectl label namespace datashim-demo monitor-pods-datasets=enabled

Run

kubectl apply -f minio-secret.yaml
kubectl apply -f minio-dataset.yaml

It will apply the following:

---
apiVersion: v1
kind: Secret
metadata:
  name: model-weights-secret
stringData:
  accessKeyID: "ACCESS_KEY"
  secretAccessKey: "SECRET_KEY"
---
apiVersion: datashim.io/v1alpha1
kind: Dataset
metadata:
  name: model-weights
spec:
  local:
    provision: "true"
    bucket: my-model
    endpoint: http://minio.minio.svc.cluster.local:9000
    secret-name: model-weights-secret
    type: COS

(OPTIONAL) Adding a model in the object storage

In this tutorial we will use the FLAN-T5-Base model as our set of weights to be loaded. To load them in our MinIO instance we can run:

kubectl apply -f download-flan-t5-base-to-minio.yaml
kubectl wait --for=condition=complete job/download-flan --timeout=-1s

To create a download Job and wait for its completion:

apiVersion: batch/v1
kind: Job
metadata:
  name: download-flan
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        dataset.0.id: "model-weights"
        dataset.0.useas: "mount"
    spec:
      containers:
        - image: alpine/git
          command: ["sh", "-c"]
          args:
            - cd /tmp && git clone https://huggingface.co/google/flan-t5-base/
              && cp -r flan-t5-base /mnt/datasets/model-weights/flan-t5-base/
          imagePullPolicy: IfNotPresent
          name: git
      restartPolicy: Never

Note

Using git to clone directly in /mnt/datasets/model-weights/flan-t5-base/ would fail on OpenShift due to the default security policies. Errors such as cp: can't preserve permissions you might see in the pod logs can be safely ignored.

Creating the TGI pod

As anticipated, we will use TGI to serve the model. Run

kubectl apply -f tgi+service.yaml

To create the following Pod and Service:

apiVersion: v1
kind: Pod
metadata:
  name: text-generation-inference
  labels:
    run: text-generation-inference
    dataset.0.id: "model-weights"
    dataset.0.useas: "mount"
spec:
  containers:
    - name: text-generation-inference
      image: ghcr.io/huggingface/text-generation-inference:1.3.4
      env:
        - name: RUST_BACKTRACE
          value: "1"
      command:
        - "text-generation-launcher"
        - "--model-id"
        - "/mnt/datasets/model-weights/flan-t5-base/"
        - "--sharded"
        - "false"
        - "--port"
        - "8080"
        - "--huggingface-hub-cache"
        - "/tmp"
      ports:
        - containerPort: 8080
          name: http
      readinessProbe:
        tcpSocket:
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5
  restartPolicy: Never
---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-inference
spec:
  ports:
    - port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    run: text-generation-inference
  type: ClusterIP

The key lines are the labels starting with dataset.0. which define the model-weights Dataset as an input to the TGI pod and the command arguments "--model-id" which indicates to TGI to load the model weights from a specific directory. In this example, the directory location points to the volume where the bucket will eventually be mounted (/mnt/datasets/model-weights) and where the model weights will be found.

We can wait for TGI to be ready using the command:

kubectl wait pod --for=condition=Ready text-generation-inference --timeout=-1s

We can also monitor the pods by looking at the logs:

kubectl logs -f text-generation-inference

If all goes well, you will see the following:

...
2024-03-04T16:37:56.587319Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-04T16:38:06.594424Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-04T16:38:09.479212Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-03-04T16:38:09.496633Z INFO shard-manager: text_generation_launcher: Shard ready in 22.918174777s rank=0
2024-03-04T16:38:09.593469Z INFO text_generation_launcher: Starting Webserver
2024-03-04T16:38:09.659675Z WARN text_generation_router: router/src/main.rs:194: no pipeline tag found for model /mnt/datasets/model-weights/flan-t5-base/
2024-03-04T16:38:09.663104Z INFO text_generation_router: router/src/main.rs:213: Warming up model
2024-03-04T16:38:11.273900Z WARN text_generation_router: router/src/main.rs:224: Model does not support automatic max batch total tokens
2024-03-04T16:38:11.273926Z INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 16000
2024-03-04T16:38:11.273934Z INFO text_generation_router: router/src/main.rs:247: Connected
2024-03-04T16:38:11.273942Z WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0

This indicates that the service has been set up successfully and is ready to reply to prompts.

Validating the deployment

We can now forward the service exposing TGI as such:

kubectl port-forward --address localhost pod/text-generation-inference 8888:8080

And run an inference request against it with:

curl -s http://localhost:8888/generate -X POST -d '{"inputs":"The square root of x is the cube root of y. What is y to the power of 2, if x = 4?", "parameters":{"max_new_tokens":1000}}'  -H 'Content-Type: application/json' | jq -r .generated_text

We should see the following output:

x = 4 * 2 = 8 x = 16 y = 16 to the power of 2