# Kubeflow RAG Demo
This demo walksthrough a simple RAG application using documents from the [introduction to AI/ML toolkits course](https://training.linuxfoundation.org/training/introduction-to-ai-ml-toolkits-with-kubeflow-lfs147/)
This example has two main ways to handle building a Retrieval Augmented Generation application. The first is using a [custom predictor](https://kserve.github.io/website/0.8/modelserving/v1beta1/custom/custom_model/) and the second is using the [Nvidia NIM repo](https://catalog.ngc.nvidia.com/orgs/nim/teams/meta/containers/llama-2-7b-chat). In the future, we will discuss a third option using the [Kserve VLLM](https://docs.vllm.ai/en/latest/deployment/integrations/kserve.html) docs. At the time of authoring this demo, we were using a GCP install running Kubeflow 1.8 and the deployed Kserve version did not support the new VLLM instance. 

## Requirements
If you want to fully test the RAG application using a multi-tenant cluster, you will need to adjust the `VirtualService` that exposes the cluster and the `Gradio path` that the application expects to be served at. More details on that as we go, but you will need a container build environment and a registry to host those images. You can also build the other `Transformer` and `Predictor` images using the dockerfiles provided. Otherwise, you simply require a Kubeflow 1.8 cluster with Kserve 0.11 on it. If you are at a training, you should have already been granted access to a cluster. Worth noting, if you require a different `VirtualService` you will need to either have admin access to the cluster or send an admin the manifests to apply.  

## Architecture and Flow
### Step 1: Ingest Curated Documents Into An Object Store
We use a [Kubeflow Notebook](https://www.kubeflow.org/docs/components/notebooks/) to ingest documents from our local document folder into [MinIO](https://min.io/). It is worth noting that due to AGPL requirements from MinIO, we are running an older version. Future updates will include distributed ingestion as well as multiple object store support (I.E. using the object store of choice for pipeplines). [SeaweedFS](https://github.com/seaweedfs/seaweedfs) has been considered as an option.


In [None]:
!pip install minio

In [None]:
import os
from minio import Minio
import requests

In [None]:
client = Minio("minio-service.kubeflow.svc.cluster.local:9000",
    access_key="minio",
    secret_key="minio123",
    secure=False,           
)

In [None]:
type(client)

In [None]:
# List all buckets
buckets = client.list_buckets()
for bucket in buckets:
    print(bucket.name, bucket.creation_date)

Notice the `-kfp` bucket. This bucket is used for marshalling with Kubeflow.

In [None]:
#change this to whatever you'd like if using a multi-tenant environment. 
#This is a shared MinIO, so you will overwrite each others document storage if you fail to do so.
bucket_name = "sanfranai" 

In [None]:
# List objects in the bucket
# This will error out if the bucket doesn't exist with "The specified bucket does not exist"
objects = client.list_objects(bucket_name, recursive=True)
for obj in objects:
    print(obj.object_name)


In [None]:
client.bucket_exists(bucket_name)

In [None]:
def upload_files(bucket_name, file_location, client):
    found = False  # Initialize 'found' before the try block
    print("Current working directory:", os.getcwd())
    print("Listing directories in the current working directory:", os.listdir("."))
    print(f"Checking existence of {file_location}: ", os.path.exists(file_location))

    try:
        found = client.bucket_exists(bucket_name)
    except Exception as e:
        print("error trying to search for MinIO Bucket:", e)
        return  # Return early since we cannot proceed without knowing if the bucket exists

    if not found:
        try:
            client.make_bucket(bucket_name)
            print("Created bucket", bucket_name)
        except Exception as e:
            print("Failed to create bucket:", e)
            return  # Return early since we cannot proceed if the bucket cannot be created
    else:
        print("Bucket", bucket_name, "exists, we won't attempt to create one")
        
    # Ensure the directory exists
    if not os.path.isdir(file_location):
        print(f"The directory {file_location} does not exist.")
        return

    # Iterate through all files in the directory
    for file_name in os.listdir(file_location):
        # Construct the full file path
        source_file = os.path.join(file_location, file_name)
        # Check if it's a file and not a directory
        if os.path.isfile(source_file):
            try:
                # Upload the file
                client.fput_object(bucket_name, file_name, source_file)
                print(f"Successfully uploaded {file_name} to bucket {bucket_name}.")
            except Exception as e:
                print(f"Failed to upload {file_name}: {e}")


In [None]:
upload_files(bucket_name,"./documentation",client)

In [None]:
# List objects in the bucket
objects = client.list_objects(bucket_name, recursive=True)
for obj in objects:
    print(obj.object_name)

### Step 2: Deploy Vector Store and Inferencing Container (TileDB) 
The next step is to use [TileDB Vector Search + Langchain](https://github.com/TileDB-Inc/TileDB-Vector-Search) to build a vector store. We will serve the vector store as an `InferenceService`. The `InferenceService` will ingest the data from `MinIO` and start the `TileDB` vector database. The `InferenceService` uses `all-MiniLM-L6-v2` for embeddings. Worth noting this is good for some generic RAG tasks, but for more specialized RAG workflows (think Life Sciences), you will need a specialized embedding model. 

In [None]:
!pwd

In [None]:
# Assuming /home/jovyan/tiledb_demo/notebooks as the directory 
# If you are in a multi-tenant cluster, please make sure you view the manifest and update it with your desired values. 
!kubectl apply -f ../manifests/core/minio_secret_key.yml
!kubectl apply -f ../manifests/core/vector_db_isvc.yml

In [None]:
!kubectl get inferenceservices vectorstore

Wait for the above output to be `True` for `Ready`. This can take up to 5 minutes.  You should also see a URL for the vector store. If you are using your own images, you will need to update the `vector_db_isvc.yml` to use your image and ensure your Kubeflow cluster has [access to the registry](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/). Once the VectorStore is up and running, you can run `kubectl get pods` and find a pod with a name similar to `vectorstore-predictor-00001-deployment-`. Notice the pod has created an `in-memory` index. [TileDB](https://python.langchain.com/docs/integrations/vectorstores/tiledb/) CAN run using object as backing to store the vector embeddi!kubectl get inferenceservices vectorstorengs (these are 1D arrays), but we are using in-memory for simplicity for this first iteration. Once the `InferenceService` is up and happy, run the below cell to validate you can get a prediction. Make sure to add your `namespace` in the proper spot below. Use the URL from the `!kubectl get inferenceservices vectorstore` command above. Once complete, run the below command.

In [None]:
data = {
  "instances": [{
    "input": "When was Kubeflow open sourced?",
    "num_docs": 6  # number of documents to retrieve
  }]
}

URL = "http://vectorstore-predictor.christensenc3526.svc.cluster.local/v1/models/vectorstore:predict"  # Adjust path as necessary

response = requests.post(URL, json=data, verify=False)  # 'verify=False' for self-signed certs
#print(response)
#print(response.json())
print(response.text)

### Step 3: Deploy a Custom Model and Transformer
This section will deploy a custom `predictor` and a `transformer` for our end user application to use.
We will serve a `orca-mini-3b` model for generation and the `transformer` to retrieve the documents and provide the context from the vector store.

In [None]:
!kubectl apply -f ../manifests/Inference/CPU/llm_isvc_custom.yml

In [None]:
!kubectl get inferenceservice  llm 

Once the above line reports `READY` `True`, adjust below to be your namespace and run the command. You should see an `llm` model.

In [None]:
!curl -X GET http://llm-predictor.christensenc3526.svc.cluster.local/v1/models

In [None]:
!curl -X GET http://llm.christensenc3526.svc.cluster.local/v1/models

We can now test the model by sending a prediction! 

In [None]:
URL = "http://llm-transformer.christensenc3526.svc.cluster.local/v1/models/llm:predict"

In [None]:
data = {
  "instances": [{
      "system": "You are an AI assistant. You will be given a task. You must generate a detailed answer.",
      "instruction": "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.",
      "input": "What is Kubeflow?",
      "max_tokens": 5000,
      "top_k": 100,
      "top_p": 0.4,
      "num_docs": 3,
      "temperature": 0.2
  }]
}
response = requests.post(URL, json=data,verify=False)

In [None]:
print(response)
#print(response.json())
print(response.text)

Note that the above is running on a CPU so its gonna be SLOW. We are going to fix that issue when we use Nvidia to deploy a Llama model. 

### Step 4: Deploying the Frontend Application

The frontend application is [Gradio](https://www.gradio.app/) served from a [Kubernetes deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). The manifest by default serves the frontend at `/professorflow`. This is also configured at the Gradio level. If you are in a multi-tenant environment, you will need to adjust Gradio's configuration at `tiledb_demo/dockerfiles/frontends/frontend/src/app.py` and then build and push the container to a registry. You will then need to update  `tiledb_demo/manifests/core/frontend/frontend.yml` with the appropriate image. 
The `frontend.yml` has several manifests within it. 
1. `demo-deployment` is a Kubernetes deployment that will deploy our frontend application
2. `demo` is a [service](https://kubernetes.io/docs/concepts/services-networking/service/) for routing requests to your deployment container. 
3. `frontend-virtual-service` is a [virtual service](https://istio.io/latest/docs/reference/config/networking/virtual-service/) that will ensure we can route requests to our cluster using [Istio](https://istio.io). 
4. `demo-service-external-access` is an [AuthorizationPolicy](https://istio.io/latest/docs/reference/config/security/authorization-policy/) that will allow access externally to our demo service. 
5.  **This manifest requires admin access**. 

Before applying the manifests, ensure they reflect the path where you are intending to serve `gradio` 

In [None]:
!kubectl apply -f ../manifests/core/frontend/frontend.yml

In [None]:
!kubectl get deployment demo-deployment 

Once the above shows `READY 1/1` Visit the URL your application is being served at (example is https://kubeflow.endpoints.sanfranai25.cloud.goog/professorflow). You should see a Gradio interface! Now, enter a prompt asking questions about Kubeflow and wait! Note: this will take AWHILE due to us using CPU for inferencing, but we will fix this! 

## Serving a LLAMA-2-7b-chat with NVIDIA NIM
We will be serving a Llama model, using Kserve. We will update our `InferenceService` to serve a new custome `transformer` and route to the NVIDIA `predictor`. This will improve our inferencing times. The detailed instructions can be found [here](https://github.com/NVIDIA/nim-deploy/blob/main/kserve/README.md). You will need to request an `NGC_API_KEY` to create a [Kubernetes secret](https://kubernetes.io/docs/concepts/configuration/secret/) with the NGC_API_KEY as well as a registry secret. Details on how to do that [here](https://docs.nvidia.com/ai-enterprise/deployment/spark-rapids-accelerator/latest/appendix-ngc.html). You will also need to deploy a valid [Nvidia Plugin](https://github.com/NVIDIA/k8s-device-plugin). If using GCP, you may need to reference [this guide](https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/356) if using L4 GPUs. You can (check catalog)[https://catalog.ngc.nvidia.com/orgs/nim/teams/meta/containers/llama-3.1-8b-instruct] for valid NVIDIA images. 

You will need to deploy:
1. An `inference-pvc` to store the model. 
2. A [llm server runtime](https://kserve.github.io/website/0.8/modelserving/servingruntimes/)
3. The updated `InferenceService` with a new transformer.


In [None]:
!export NGC_API_KEY=...

In [None]:
!kubectl create secret generic nvidia-nim-secrets   --from-literal=NGC_API_KEY="$NGC_API_KEY"

In [None]:
!kubectl create secret docker-registry ngc-secret -n admin --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=${NGC_API_KEY}

In [103]:
!kubectl apply -f ../manifests/Inference/GPU/nvidia-inference/llm_inference_nvidia.yml
!kubectl apply -f ../manifests/Inference/GPU/nvidia-inference/inference-pvc.yaml
!kubectl apply -f ../manifests/Inference/GPU/nvidia-inference/llm_server_runtime_nvidia.yaml

inferenceservice.serving.kserve.io/llm unchanged
persistentvolumeclaim/nvidia-pvc unchanged
servingruntime.serving.kserve.io/llama-2-7b-chat unchanged


In [106]:
!kubectl get inferenceservice llm

NAME   URL                                                                     READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
llm    http://llm.christensenc3526.kubeflow.endpoints.sanfranai25.cloud.goog   False          100                              llm-predictor-00001   24h


# run this command from your terminal once the above command says READY True. Update below with your namespace. 
# NOTE: The container can take awhile to be ready. 
```
curl -X POST http://llm-predictor.christensenc3526.svc.cluster.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-2-7b-chat",
    "messages": [{"role": "user", "content": "What is Kubeflow?"}],
    "temperature": 0.5,
    "top_p": 1,
    "max_tokens": 256,
    "stream": false
  }'
``` 

Now, use the same Gradio app, and run a query! You should notice a quicker response and higher quality response (if you have GPU nodes).Delete the `inferenceservice`and your cluster should scale down. 