# Kubeflow and Prompt Tunning

This notebook demonstrates how Kubeflow could be levaraged for prompt tuning a foundational LLM and serving it. The HuggingFace's [PEFT](https://huggingface.co/docs/peft/index) library is used for prompt tunning the open sourced LLM. 
```
🤗 PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. PEFT methods only fine-tune a small number of (extra) model parameters, significantly decreasing computational and storage costs because fine-tuning large-scale PLMs is prohibitively costly. Recent state-of-the-art PEFT techniques achieve performance comparable to that of full fine-tuning.
```
This notebook create Kubeflow pipeline that has the following steps: 
- Reads user huggingface token from the already kubernetes created secret
- Downloads an open source large language model (LLM) 
- Trains the prompt tuning configuration against the Hugging Face open source model (open source twitter_complaints prompt dataset).
- Publishes a trained configuration to Hugging Face.
- Runs inferencing on the Large Language Model with the new prompt tuning configuration
- Tests the model



## Run through the Notebook

After installing kfp, you must restart the kernel for the new version to take affect. After the restart, run through all cells.

In [13]:
!pip install kfp==2.3.0
# !pip install kfp 




In [14]:
import kfp

print(kfp.__version__)

2.3.0


In [15]:
import kfp
import kfp.dsl as dsl
from kfp.dsl import component
from kfp.compiler import Compiler
import kfp.components as comp

### Pipeline Input Parameters

- `peft_model_server_image` We are using pre-build custom runtime image to serve HuggingFace LLM with prompt tuning configuration. But here is all the code on how the HuggingFace LLM with prompt tuning configuration is being served. The _load_model function below shows we will choose a pretrained LLM model with the PEFT prompt tuning configuration trained in the prompt tuning tutorial. The tokenizer is also defined as part of the model, so it can be used to encode and decode raw string inputs from the inference requests without asking users to preprocess their input into tensor bytes.

```python
from typing import List

from mlserver import MLModel, types
from mlserver.codecs import decode_args

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

class PeftModelServer(MLModel):
  async def load(self) -> bool:
      self._load_model()
      self.ready = True
      return self.ready

  @decode_args
  async def predict(self, content: List[str]) -> List[str]:
      return self._predict_outputs(content)

  def _load_model(self):
      model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
      peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
      self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
      config = PeftConfig.from_pretrained(peft_model_id)
      self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
      self.model = PeftModel.from_pretrained(self.model, peft_model_id)
      self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
      return

  def _predict_outputs(self, content: List[str]) -> List[str]:
      output_list = []
      for input in content:
        inputs = self.tokenizer(
            f'{self.text_column} : {input} Label : ',
            return_tensors="pt",
        )
        with torch.no_grad():
          inputs = {k: v for k, v in inputs.items()}
          outputs = self.model.generate(
              input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
          )
          outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
        output_list.append(outputs[0])
      return output_list
```
If you would like to build your ouw KServe ModelMesh custom runtime image, you could do it using this Docker file:
```
# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
# ...

# The custom `MLModel` implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/

# environment variables to be compatible with ModelMesh Serving
# these can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
    MLSERVER_GRPC_PORT=8001 \
    MLSERVER_HTTP_PORT=8002 \
    MLSERVER_LOAD_MODELS_AT_STARTUP=false \
    MLSERVER_MODEL_NAME=peft-model

# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer

CMD mlserver start ${MLSERVER_MODELS_DIR}
```


```bash 
docker build -t <CONTAINER_REGISTRY_NAME>/peft-model-server:latest .
docker push <CONTAINER_REGISTRY_NAME>/peft-model-server:latest
```


In [None]:
peft_model_server_image="quay.io/aipipeline/peft-model-server:latest"
modelmesh_namespace="modelmesh-serving"
modelmesh_servicename="modelmesh-serving"
pipeline_out_file="llm-prompt_tuning_pipeline.yaml"
kserv_component="https://raw.githubusercontent.com/kubeflow/pipelines/release-2.0.1/components/kserve/component.yaml"

In [18]:
@component(
    packages_to_install=["peft", "transformers", "datasets", "torch", "datasets", "tqdm"],
    base_image='python:3.10'
)
def prompt_tuning_bloom(huggingface_name: str, peft_model_publish_id: str, model_name_or_path: str, num_epochs: int, hf_token: str):
    from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
    from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
    import torch
    from datasets import load_dataset
    import os
    from torch.utils.data import DataLoader
    from tqdm import tqdm
    import base64

    peft_config = PromptTuningConfig(
        task_type=TaskType.CAUSAL_LM,
        prompt_tuning_init=PromptTuningInit.TEXT,
        num_virtual_tokens=8,
        prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
        tokenizer_name_or_path=model_name_or_path,
    )

    dataset_name = "twitter_complaints"
    text_column = "Tweet text"
    label_column = "text_label"
    max_length = 64
    lr = 3e-2
    batch_size = 8

    dataset = load_dataset("ought/raft", dataset_name)
    dataset["train"][0]

    classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
    dataset = dataset.map(
        lambda x: {"text_label": [classes[label] for label in x["Label"]]},
        batched=True,
        num_proc=1,
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    def preprocess_function(examples):
        batch_size = len(examples[text_column])
        inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
        targets = [str(x) for x in examples[label_column]]
        model_inputs = tokenizer(inputs)
        labels = tokenizer(targets)
        for i in range(batch_size):
            sample_input_ids = model_inputs["input_ids"][i]
            label_input_ids = labels["input_ids"][i] + [tokenizer.pad_token_id]
            model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
            labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
            model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
        for i in range(batch_size):
            sample_input_ids = model_inputs["input_ids"][i]
            label_input_ids = labels["input_ids"][i]
            model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
                max_length - len(sample_input_ids)
            ) + sample_input_ids
            model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
                "attention_mask"
            ][i]
            labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
            model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
            model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
            labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
    
    processed_datasets = dataset.map(
        preprocess_function,
        batched=True,
        num_proc=1,
        remove_columns=dataset["train"].column_names,
        load_from_cache_file=False,
        desc="Running tokenizer on dataset",
    )

    train_dataset = processed_datasets["train"]
    eval_dataset = processed_datasets["train"]


    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=False
    )
    eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=False)

    model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
    model = get_peft_model(model, peft_config)
    print(model.print_trainable_parameters())

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=(len(train_dataloader) * num_epochs),
    )

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for step, batch in enumerate(tqdm(train_dataloader)):
            batch = {k: v for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.detach().float()
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()
        eval_loss = 0
        eval_preds = []
        for step, batch in enumerate(tqdm(eval_dataloader)):
            batch = {k: v for k, v in batch.items()}
            with torch.no_grad():
                outputs = model(**batch)
            loss = outputs.loss
            eval_loss += loss.detach().float()
            eval_preds.extend(
                tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
            )

        eval_epoch_loss = eval_loss / len(eval_dataloader)
        eval_ppl = torch.exp(eval_epoch_loss)
        train_epoch_loss = total_loss / len(train_dataloader)
        train_ppl = torch.exp(train_epoch_loss)
        print("epoch=%s: train_ppl=%s train_epoch_loss=%s eval_ppl=%s eval_epoch_loss=%s" % (epoch, train_ppl, train_epoch_loss, eval_ppl, eval_epoch_loss))

    from huggingface_hub import login
    login(token=base64.b64decode(hf_token).decode())

    peft_model_id = "{}/{}".format(huggingface_name, peft_model_publish_id)
    model.save_pretrained("output_dir") 
    model.push_to_hub(peft_model_id, use_auth_token=True)

In [19]:
@component(
    packages_to_install=["kubernetes"],
    base_image='python:3.10'
)
def deploy_modelmesh_custom_runtime(huggingface_name: str, peft_model_publish_id: str, model_name_or_path: str, server_name: str, namespace: str, image: str):
    import kubernetes.config as k8s_config
    import kubernetes.client as k8s_client
    from kubernetes.client.exceptions import ApiException

    def create_custom_object(group, version, namespace, plural, manifest):
        cfg = k8s_client.Configuration()
        cfg.verify_ssl=False
        cfg.host = "https://kubernetes.default.svc"
        cfg.api_key_prefix['authorization'] = 'Bearer'
        cfg.ssl_ca_cert = '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
        with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
            lines = f.readlines()
            for l in lines:
                cfg.api_key['authorization'] = "{}".format(l)
                break
        with k8s_client.ApiClient(cfg) as api_client:
            capi = k8s_client.CustomObjectsApi(api_client)
            try:
                res = capi.create_namespaced_custom_object(group=group,
                                                           version=version, namespace=namespace,
                                                           plural=plural, body=manifest)
            except ApiException as e:
                # object already exists
                if e.status != 409:
                    raise
    custom_runtime_manifest = {
        "apiVersion": "serving.kserve.io/v1alpha1",
        "kind": "ServingRuntime",
        "metadata": {
            "name": "{}-server".format(server_name),
            "namespace": namespace
        },
        "spec": {
            "supportedModelFormats": [
            {
                "name": "peft-model",
                "version": "1",
                "autoSelect": True
            }
            ],
            "multiModel": True,
            "grpcDataEndpoint": "port:8001",
            "grpcEndpoint": "port:8085",
            "containers": [
            {
                "name": "mlserver",
                "image": image,
                "env": [
                {
                    "name": "MLSERVER_MODELS_DIR",
                    "value": "/models/_mlserver_models/"
                },
                {
                    "name": "MLSERVER_GRPC_PORT",
                    "value": "8001"
                },
                {
                    "name": "MLSERVER_HTTP_PORT",
                    "value": "8002"
                },
                {
                    "name": "MLSERVER_LOAD_MODELS_AT_STARTUP",
                    "value": "true"
                },
                {
                    "name": "MLSERVER_MODEL_NAME",
                    "value": "peft-model"
                },
                {
                    "name": "MLSERVER_HOST",
                    "value": "127.0.0.1"
                },
                {
                    "name": "MLSERVER_GRPC_MAX_MESSAGE_LENGTH",
                    "value": "-1"
                },
                {
                    "name": "PRETRAINED_MODEL_PATH",
                    "value": model_name_or_path
                },
                {
                    "name": "PEFT_MODEL_ID",
                    "value": "{}/{}".format(huggingface_name, peft_model_publish_id),
                }
                ],
                "resources": {
                "requests": {
                    "cpu": "500m",
                    "memory": "4Gi"
                },
                "limits": {
                    "cpu": "5",
                    "memory": "5Gi"
                }
                }
            }
            ],
            "builtInAdapter": {
            "serverType": "mlserver",
            "runtimeManagementPort": 8001,
            "memBufferBytes": 134217728,
            "modelLoadingTimeoutMillis": 90000
            }
        }
    }
    create_custom_object(group="serving.kserve.io", version="v1alpha1",
                         namespace=namespace, plural="servingruntimes",
                         manifest=custom_runtime_manifest)

In [20]:
@component(
    base_image='python:3.10'
)
def inference_svc(model_name: str, namespace: str) -> str :

    inference_service = '''
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: {}
  namespace: {}
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: peft-model
      runtime: {}-server
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib
'''.format(model_name, namespace, model_name)

    return inference_service

In [21]:
@component(
    packages_to_install=["transformers", "peft", "torch", "requests"],
    base_image='python:3.10'
)
def test_modelmesh_model(service: str,  namespace: str, model_name: str, input_tweet: str):
    import requests
    import base64
    import json

    url = "http://%s.%s:8008/v2/models/%s/infer" % (service, namespace, model_name)
    input_json = {
        "inputs": [
            {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": [input_tweet]
            }
        ]
    }

    x = requests.post(url, json = input_json)

    print(x.text)
    respond_dict = json.loads(x.text)
    inference_result = respond_dict["outputs"][0]["data"][0]
    base64_bytes = inference_result.encode("ascii")
  
    string_bytes = base64.b64decode(base64_bytes)
    inference_result = string_bytes.decode("ascii")
    print("inference_result: %s " % inference_result)

### Read huggingface-secret 

Read the secret that contains your WRITE token for Huggingface.

In [22]:
@component(
    packages_to_install=["kubernetes"],
    base_image='python:3.10'
)
def get_hf_token() -> str:
    from kubernetes import client, config

    config.load_incluster_config()
    core_api = client.CoreV1Api()
    secret = core_api.read_namespaced_secret(name="huggingface-secret", namespace="kubeflow-user-example-com")
    return secret.data["token"]

In [None]:
# Define your pipeline function
@dsl.pipeline(
    name="Serving LLM with Prompt tuning",
    description="A Pipeline for Serving Prompt Tuning LLMs on Modelmesh"
)
def prompt_tuning_pipeline(
    huggingface_name: str = "difince",
    peft_model_publish_id: str = "bloomz-560m_PROMPT_TUNING_CAUSAL_LM",
    model_name_or_path: str = "bigscience/bloomz-560m",
    model_name: str = "vml-demo",
    input_tweet: str = "@nationalgridus I have no water and the bill is current and paid. Can you do something about this?",
    test_served_llm_model: str ="true",
    num_epochs: int = 50
):
    hf_token_task = get_hf_token()
    prompt_tuning_llm = prompt_tuning_bloom( huggingface_name=huggingface_name, 
                                             peft_model_publish_id=peft_model_publish_id, 
                                             model_name_or_path=model_name_or_path,
                                             num_epochs=num_epochs,
                                             hf_token=hf_token_task.output)
    deploy_modelmesh_custom_runtime_task = deploy_modelmesh_custom_runtime(huggingface_name=huggingface_name,
                                                                           peft_model_publish_id=peft_model_publish_id, 
                                                                           model_name_or_path=model_name_or_path,
                                                                           server_name=model_name, namespace=modelmesh_namespace,
                                                                           image=peft_model_server_image)
    deploy_modelmesh_custom_runtime_task.after(prompt_tuning_llm)

    inference_svc_task = inference_svc(model_name=model_name, namespace=modelmesh_namespace)
    inference_svc_task.after(deploy_modelmesh_custom_runtime_task)
    inference_svc_task.set_caching_options(False)
    
    kserve_launcher_op = comp.load_component_from_url(kserv_component)
    serve_llm_with_peft_task = kserve_launcher_op(action="apply", inferenceservice_yaml=inference_svc_task.output)
    serve_llm_with_peft_task.after(inference_svc_task)
    serve_llm_with_peft_task.set_caching_options(False)

    with dsl.If(test_served_llm_model == 'true'):
        test_modelmesh_model_task = test_modelmesh_model(service=modelmesh_servicename, namespace=modelmesh_namespace, 
                                                         model_name=model_name, input_tweet=input_tweet).after(serve_llm_with_peft_task)
        test_modelmesh_model_task.set_caching_options(False)

In [24]:
Compiler().compile(
    pipeline_func=prompt_tuning_pipeline,
    package_path="prompt_tuning_pipeline.yaml"
)

kfp_client=kfp.Client()

run = kfp_client.create_run_from_pipeline_func(
    prompt_tuning_pipeline,
    arguments={}
)

run_id = run.run_id
print("Run ID: ", run_id)

Run ID:  255c0c06-7420-4419-8b54-469c3a1a0a4b
