### Introduction
This notebook illustrates and automates the Continuous Deployment process for bringing the popular open-source large language models backend [vLLM](https://docs.vllm.ai/) into SAP AI Core. Running Llama 2, Mistral, Mixtral, LLaVa, Gemma, and other large language models in SAP AI Core through BYOM(Bring Your Own Model) approach. <br/>

### Prerequisites
Before running this notebook, please assure you have perform the [Prerequisites](../../README.md)<br/><br/>

If the configuration of vllm scenario is created through SAP AI Launchpad instead of running [00-init-config.ipynb](../00-init-config.ipynb), please manually update the configuration_id in [env.json](env.json)
```json
{
    "configuration_id": "<YOUR_CONFIGURATION_ID_OF_VLLM_SCENARIO>",
    "deployment_id": "<WILL_BE_UPDATED_BY_THIS_NOTEBOOK>"
}
```
 
### The high-level flow of this Continuous Deployment process:
- Build a custom docker image adapted for SAP AI Core<br/>
- Push the docker image to docker hub<br/>
- Connect to SAP AI Core via SDK<br/>
- Create a deployment<br/>
- Check the status and logs of the deployment<br/>


#### 1.Build a custom docker image adapted for SAP AI Core
Please refer to [Dockerfile](Dockerfile) for details.

In [2]:
%%sh
# 0.Login to docker hub
docker login -u <YOUR_DOCKER_USER> -p <YOUR_DOCKER_ACCESS_TOKEN>

# 1.Build the docker image
docker build \
		--platform=linux/amd64 \
		-t docker.io/<YOUR_DOCKER_USER>/vllm-openai:ai-core .

#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 650B 0.0s done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/vllm/vllm-openai:latest
#3 ...

#4 [auth] vllm/vllm-openai:pull token for registry-1.docker.io
#4 DONE 0.0s

#3 [internal] load metadata for docker.io/vllm/vllm-openai:latest
#3 DONE 3.5s

#5 [1/2] FROM docker.io/vllm/vllm-openai:latest@sha256:4aea20de3b421f7775cfdc6468a04a29d0fcfc3603ad3b18aab4ef1f4652769d
#5 DONE 0.0s

#6 [2/2] RUN mkdir -p /nonexistent/ &&    mkdir -p /hf-home/ &&     chown -R nobody:nogroup /nonexistent /hf-home/ &&     chmod -R 770 /nonexistent/ /hf-home/
#6 CACHED

#7 exporting to image
#7 exporting layers done
#7 writing image sha256:571d9541e6a7c5208c277740a0e03ccdb98bdecbc113c2e9ec8997a8d8994848 done
#7 naming to docker.io/yatsea/vllm-openai:ai-core done
#7 DONE 0.0s


#### 2.Push the docker image to docker hub

In [3]:
%%sh
# 2.Push the docker image to docker hub to be used by deployment in SAP AI Core
docker push docker.io/<YOUR_DOCKER_USER>/vllm-openai:ai-core

The push refers to repository [docker.io/yatsea/vllm-openai]
a653179dc050: Preparing
12947d07ed5f: Preparing
067e9baa9a0d: Preparing
214b6ff61148: Preparing
0fd53473730f: Preparing
ca336086e060: Preparing
c501b4875b93: Preparing
674396d66abf: Preparing
600c676771a0: Preparing
6ac15100dff6: Preparing
40f0eb1871b9: Preparing
8d113b7b997c: Preparing
cd77f58b80cd: Preparing
e4b1bddcbe63: Preparing
765423415d69: Preparing
7b9433fba79b: Preparing
256d88da4185: Preparing
6ac15100dff6: Waiting
40f0eb1871b9: Waiting
8d113b7b997c: Waiting
cd77f58b80cd: Waiting
ca336086e060: Waiting
c501b4875b93: Waiting
674396d66abf: Waiting
600c676771a0: Waiting
e4b1bddcbe63: Waiting
765423415d69: Waiting
7b9433fba79b: Waiting
256d88da4185: Waiting
0fd53473730f: Layer already exists
12947d07ed5f: Layer already exists
067e9baa9a0d: Layer already exists
214b6ff61148: Layer already exists
a653179dc050: Layer already exists
c501b4875b93: Layer already exists
674396d66abf: Layer already exists
ca336086e060: Layer al

#### 3.Initiate an SAP AI Core SDK client
- resource_group loaded from [../config.json](../config.json)
- ai_core_sk(service key) loaded from [../config.json](../config.json)

In [4]:
import requests, json, time, datetime
from datetime import datetime
from ai_core_sdk.ai_core_v2_client import AICoreV2Client

In [5]:
# load the configuration from ../config.json 
with open("../config.json") as f:
    config = json.load(f)

resource_group = config.get("resource_group", "default")
print( "resource group: ", resource_group)

resource group:  oss-llm


In [6]:
# Initiate an AI Core SDK client with the information of service key
ai_core_sk = config["ai_core_service_key"]
base_url = ai_core_sk.get("serviceurls").get("AI_API_URL") + "/v2/lm"
client = AICoreV2Client(base_url=ai_core_sk.get("serviceurls").get("AI_API_URL")+"/v2",
                        auth_url=ai_core_sk.get("url")+"/oauth/token",
                        client_id=ai_core_sk.get("clientid"),
                        client_secret=ai_core_sk.get("clientsecret"),
                        resource_group=resource_group)

In [7]:
# Prepare the http header which will be used later through request.
token = client.rest_client.get_token()
headers = {
    "Authorization": token,
    "ai-resource-group": resource_group,
    "Content-Type": "application/json",
}

#### 4.Create a deployment for llama.cpp scenario
To create a deployment in SAP AI Core, it requires the corresponding resource_group and configuration_id
- resource_group loaded from [../config.json](../config.json)
- configuration_id of  loaded from [env.json](env.json)

In [8]:
# resource_group: The target resource group to create the deployment
# configuration_id: The target configuration to create the deployment, which is created in ../00-init-config.ipynb 
with open("./env.json") as f:
    env = json.load(f)

configuration_id = env["configuration_id"]
print("configuration id:", configuration_id)

configuration id: 6a640f56-df21-487e-855b-5e1cde04bd9c


**Helper function**
- get the current UTC time in yyyy-mm-dd hh:mm:ss format, to be used to filter deployments logs

In [9]:
# Helper function to get the current time in UTC, used to filter deployments logs
def get_current_time():  
    current_time = datetime.utcnow()
    # Format current time in the desired format
    formatted_time = current_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
    return formatted_time

**Helper function**
- Write back the configuration value back to configuration json file

In [10]:
# Helper function to write back the configuration value back to configuration json file
def update_json_file(file_path, key, value):
    # Load the JSON configuration file
    with open(file_path, 'r') as file:
        config = json.load(file)

    # Update the value
    config[key] = value

    # Write the updated configuration back to the file
    with open(file_path, 'w') as file:
        json.dump(config, file, indent=4)
        print(f"{file_path} updated. {key}: {value}")

**Create a deployment for llama.cpp in SAP AI Core**
- configuration_id
- resource_group
<br/><br/>
The created deployment id will be written back to [env.json](env.json), which will be used in
- [02-vllm.ipynb](02-vllm.ipynb)to test the inference of open-source llms with llama.cpp server in SAP AI Core
- [04-cleanup.ipynb](04-cleanup.ipynb) to stop and delete the deployment and clean up the resource.

In [11]:
# Create a Deployment in SAP AI Core
print("Creating deployment.")
response = client.deployment.create(
    configuration_id=configuration_id,
    resource_group=resource_group
)

# last_check_time will be used to check the deployment status continuously afterwards
# set initial last_check_time right after creating deployment
last_check_time = get_current_time()
deployment_start_time = datetime.now()

deployment_id = response.id
status = response.status
update_json_file("env.json", "deployment_id", deployment_id)
print("Deployment Result:\n", response.__dict__)

Creating deployment.
env.json updated. deployment_id: d2205e6da6740a73
Deployment Result:
 {'id': 'd2205e6da6740a73', 'message': 'Deployment scheduled.', 'deployment_url': '', 'status': <Status.UNKNOWN: 'UNKNOWN'>, 'ttl': None}


#### 5.Check the status and logs of the deployment

In [13]:
print("5.Checking deployment status.")
deployment_url = f"{base_url}/deployments/{deployment_id}"
deployment_log_url = f"{deployment_url}/logs?start="
interval_s = 20

while status != "RUNNING" and status != "DEAD":
    current_time = get_current_time()
    #check deployment status
    response = requests.get(url=deployment_url, headers=headers)
    resp = response.json()
    
    status = resp['status']
    print(f'...... Deployment Status at {current_time}......', flush=False)
    print(f"Deployment status: {status}")

    #retrieve deployment logs
    response_log = requests.get(url=f"{deployment_log_url}{last_check_time}", headers=headers)
    last_check_time = current_time
    print(f"Deployment logs: {response_log.text}")

    # Sleep for 60 secs to avoid overwhelming the API with requests
    time.sleep(interval_s)

deployment_end_time = datetime.now()
duration_in_min = (deployment_end_time - deployment_start_time) / 60

if status == "RUNNING":
    print("Deployment is up and running now!")
else:
    print(f"Deployment {deployment_id} failed!")   

print(f"Deployment duration: {duration_in_min} mins")

5.Checking deployment status.
Deployment is up and running now!
Deployment duration: 0:02:12.083366 mins
