### Introduction
In this notebook, we will test out a mistral-7b with [vLLM](https://docs.vllm.ai/) in SAP AI Core. You can also run Llama 2, Mixtral, Gemma, and other [supported models](https://docs.vllm.ai/en/latest/models/supported_models.html). 

### Prerequisites
Before running this notebook, please assure you have performed the [Prerequisites](../../README.md) and [01-deployment.ipynb](01-deployment.ipynb). As a result, a deployment of vLLM scenario is running in SAP AI Core.<br/><br/>

If the configuration and deployment are created through SAP AI Launchpad, please manually update the configuration_id and deployment_id in [env.json](env.json)
```json
{
    "configuration_id": "<YOUR_CONFIGURATION_ID_OF_VLLM_SCENARIO>",
    "deployment_id": "<YOUR_DEPLOYMENT_ID_BASED_ON_CONFIG_ABOVE>"
}
```
 
### The high-level flow:
- Load configurations info
- Connect to SAP AI Core via SDK
- Check the status and logs of the deployment
- Inference the model with OpenAI-compatible chat completion API


#### 1.Load config info 
- resource_group loaded from [config.json](../config.json)
- deployment_id(created in 01-deployment.ipynb) loaded [env.json](env.json)

In [1]:
import requests, json
from ai_api_client_sdk.ai_api_v2_client import AIAPIV2Client

In [2]:
# Please replace the configurations below.
# config_id: The target configuration to create the deployment. Please create the configuration first.
with open("../config.json") as f:
    config = json.load(f)

with open("./env.json") as f:
    env = json.load(f)

deployment_id = env["deployment_id"]
resource_group = config.get("resource_group", "default")
print("deployment id: ", deployment_id, " resource group: ", resource_group)

deployment id:  d2205e6da6740a73  resource group:  oss-llm


#### 2.Initiate connection to SAP AI Core 

In [3]:
aic_sk = config["ai_core_service_key"]
base_url = aic_sk["serviceurls"]["AI_API_URL"] + "/v2/lm"
ai_api_client = AIAPIV2Client(
    base_url= base_url,
    auth_url=aic_sk["url"] + "/oauth/token",
    client_id=aic_sk['clientid'],
    client_secret=aic_sk['clientsecret'],
    resource_group=resource_group)

In [4]:
token = ai_api_client.rest_client.get_token()
headers = {
        "Authorization": token,
        'ai-resource-group': resource_group,
        "Content-Type": "application/json"}

#### 3.Check the deployment status 

In [5]:
# Check deployment status before inference request
deployment_url = f"{base_url}/deployments/{deployment_id}"
response = requests.get(url=deployment_url, headers=headers)
resp = response.json()    
status = resp['status']

deployment_log_url = f"{base_url}/deployments/{deployment_id}/logs"
if status == "RUNNING":
        print(f"Deployment-{deployment_id} is running. Ready for inference request")
else:
        print(f"Deployment-{deployment_id} status: {status}. Not yet ready for inference request")
        #retrieve deployment logs
        #{{apiurl}}/v2/lm/deployments/{{deploymentid}}/logs.

        response = requests.get(deployment_log_url, headers=headers)
        print('Deployment Logs:\n', response.text)


Deployment-d2205e6da6740a73 is running. Ready for inference request


#### 4.Inference completion and chat completion APIs
As of 16 March 2024  
- vllm v0.3.3, based docker image: [vllm/vllm-openai:v0.3.3](https://hub.docker.com/layers/vllm/vllm-openai/v0.3.3/images/sha256-4aea20de3b421f7775cfdc6468a04a29d0fcfc3603ad3b18aab4ef1f4652769d?context=explore)
- SAP AI Core [resource plan](https://help.sap.com/docs/sap-ai-core/sap-ai-core-service-guide/choose-resource-plan-c58d4e584a5b40a2992265beb9b6be3c)

##### Test summary for TheBloke/Mistral-7B-Instruct-v0.2
- SAP AI Core resource plan: infer-S (cuda Out-of-Memory for unquantized model, works well for awq quantized model)
- Failed to load with out of memory with unquantized model TheBloke/Mistral-7B-Instruct-v0.2 with cuda out of memory which seems to require too much GPU VRAM.
- Have tried the suggested options below without luck:
    ```sh
    --gpu-memory-utilization 0.95 #(also tried from 0.4 to 0.95) 
    --enforce-eager
    --max-model-len 2048
    --max-num-batched-tokens 2048
    --max-num-seqs 2048
    ```
- Likely it is a bug according to this issue(https://github.com/vllm-project/vllm/issues/2248). 
- Hence switch AWQ quantization model(TheBloke/Mistral-7B-Instruct-v0.2-AWQ). It works well with same configuration above.<b4/>

##### Test summary for **TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ**
- SAP AI Core resource plan infer-L(cuda Out-of-Memory for unquantized model)
- Failed to load with out of memory with awq model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ with cuda out of memory which seems to require too much GPU VRAM.
- Have tried the suggested options below without luck:
    ```sh
    --gpu-memory-utilization 0.95 #(also tried from 0.4 to 0.95) 
    --enforce-eager
    --max-model-len 512
    --max-num-batched-tokens 512
    --max-num-seqs 512
    ```
- Likely it is a bug according to this issue(https://github.com/vllm-project/vllm/issues/2248). 
<br/><br/>

**Important**: <br/>
Please choose your target model with [hugging face](https://huggingface.co) model id

In [10]:
model = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
#model = "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ" #not working

deployment = ai_api_client.deployment.get(deployment_id)
inference_base_url = f"{deployment.deployment_url}"
openai_chat_api_endpoint = f"{inference_base_url}/v1/chat/completions"
openai_completion_api_endpoint = f"{inference_base_url}/v1/completions"

In [11]:
# List models
endpoint = f"{inference_base_url}/v1/models"
print(endpoint)

response = requests.get(url=endpoint, headers=headers)
print('Result:', response.text)

https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/d2205e6da6740a73/v1/models
Result: {"object":"list","data":[{"id":"TheBloke/Mistral-7B-Instruct-v0.2-AWQ","object":"model","created":1711597462,"owned_by":"vllm","root":"TheBloke/Mistral-7B-Instruct-v0.2-AWQ","parent":null,"permission":[{"id":"modelperm-c20740c70a124f208331e62e530bb4e5","object":"model_permission","created":1711597462,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}


#### 4.1 Sample#1: Test OpenAI-compatible API for Chat Completion
Now let's test its [OpenAI-compatible API for Chat Completion](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) with a basic sample about Chain of Thought, which is the exact API interface of Chat Completion of GPT-3.5/4 in SAP Generative AI Hub. 

In [20]:
#let's try its openai-compatible chat completion api
sys_msg = "You are an helpful AI assistant"
#user_msg = "why the sky is blue?"
user_msg = "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?Let's thinks step by step."
json_data = { 
  "model": model, 
  "messages": [
            # mistral fine tune model doesn't accept system message. 
            # please refer to https://github.com/vllm-project/vllm/discussions/2112
            # {
            #     "role": "system",
            #     "content": sys_msg
            # },
            {
                "role": "user",
                "content": user_msg
            }
        ]
}

response = requests.post(openai_chat_api_endpoint, headers=headers, json=json_data)
print('Result:', response.text)

Result: {"id":"cmpl-0725f1ceafb64301912218e8d4ced5ad","object":"chat.completion","created":9457,"model":"TheBloke/Mistral-7B-Instruct-v0.2-AWQ","choices":[{"index":0,"message":{"role":"assistant","content":" Roger initially has 5 tennis balls. He then buys 2 cans, each containing 3 tennis balls. So, he gets 2 * 3 = <<2*3=6>>6 tennis balls from the cans.\n\nTherefore, Roger now has 5 (his initial balls) + 6 (new balls from cans) = <<5+6=11>>11 tennis balls."},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":53,"total_tokens":142,"completion_tokens":89}}


#### 4.2 Sample#2: Write a haiku about running vllm in AI Core

In [14]:
#let's test its openai-compatible chat completion api by writing a haiku
user_msg = "Write a haiku for running vllm in AI Core"
json_data = {
  "model": model,
  "messages": [
            {
                "role": "user",
                "content": user_msg
            }
        ]
}

response = requests.post(openai_chat_api_endpoint, headers=headers, json=json_data)
print('Result:', response.json())

Result: {'id': 'cmpl-e3d38647b9d241ccbb10159c92f7eb94', 'object': 'chat.completion', 'created': 8020, 'model': 'TheBloke/Mistral-7B-Instruct-v0.2-AWQ', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' Silicon heart beats,\n\nData streams in endless dance,\nVlm runs, learning grows.'}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 21, 'total_tokens': 44, 'completion_tokens': 23}}


#### 4.3 Sample#3: Customer Message Processing 
In our sample [btp-industry-use-cases/04-customer-interaction-gpt4](https://github.com/SAP-samples/btp-industry-use-cases/tree/main/04-customer-interaction-gpt4),GPT-3.5/4 is used to process customer messages in customer interactions and output in json schema with plain prompting.
- Summarize customer message into title and a short description
- Analyze the sentiment of the customer message
- Extract the entities from the customer message, such as customer, product, order no etc.

Let's see if the same scenario could be achieved with mistral-7b.


In [15]:
# Let's test its openai-compatible chat completion api with to process customer message with
# summarization, sentiment analysis and entities extraction and output as json
user_msg = r'''
You are an AI assistant to process the input text. Here are your tasks on the text.
1.Apply Sentiment Analysis
2.Generate a title less than 100 characters,and summarize the text into a short description less than 200 characters
3.Extract the entities such as customer,product,order,delivery,invoice etc from the text Here is a preliminary list of the target entity fields and description. Please extract all the identifiable entities even not in the list below. Don't include any field with unknown value. \
-customer_no: alias customer number, customer id, account id, account number which could be used to identify a customer.
-customer_name: customer name, account name
-customer_phone: customer contact number. -product_no: product number, product id
-product_name
-order_no: sales order number, order id
-order_date 
-delivery_no: delivery number, delivery id
-delivery_date: delivery date, shipping date
-invoice_no: alias invoice number, invoice id, receipt number, receipt id etc. which can be used to locate a invoice.
-invoice_date: invoice date, purchase date
-store_name
-store_location
etc.
    
For those fields not in list must follow the Snakecase name conversation like product_name, no space allow. 

Output expected in JSON format as below: 
{\"sentiment\":\"{{Positive/Neutral/Negative}}\",\"title\":\"{{The generated title based on the input text less than 100 characters}}\",\"summary\":\"{{The generated summary based on the input text less than 300 characters}}\",\"entities\":[{\"field\":\"{{the extracted fields such as product_name listed above}}\",\"value\":\"{{the extracted value of the field}}\"}]}

Input text: 
Everything was working fine one day I went to make a shot of coffee it stopped brewing after 3 seconds Then I tried the milk frother it stopped after 3 seconds again I took it back they fixed it under warranty but it’s happening again I don’t see this machine lasting more then 2 years to be honest I’m spewing I actually really like the machine It’s almost like it’s losing pressure somewhere, they wouldn’t tell my what the problem was when they fixed it.. Purchased at Harvey Norman for $1,349. \
Product is used: Several times a week
 
JSON:
'''

json_data = { 
  "model": model,
  "response_format": {"type": "json_object"}, #JSON mode
  "messages": [
            {
                "role": "user",
                "content": user_msg
            }
        ]
}

response = requests.post(url=openai_chat_api_endpoint, headers=headers, json=json_data)
print('Result:', response.text)

Result: {"id":"cmpl-b514ea60b7ac4b2a9a96b84546d7604a","object":"chat.completion","created":8097,"model":"TheBloke/Mistral-7B-Instruct-v0.2-AWQ","choices":[{"index":0,"message":{"role":"assistant","content":" {\n\"sentiment\": \"Negative\",\n\"title\": \"Issues with Coffee Machine: Brewing and Milk Frothing Stop After Short Time\",\n\"summary\": \"The coffee machine stopped brewing and milk frothing after a few seconds. It was fixed under warranty but the issue recurred. Machine is suspected to have pressure loss problem. Customer expresses dissatisfaction and doubts machine's longevity. Purchased at Harvey Norman for $1,349.\",\n\"entities\": [\n{\"field\": \"product_name\", \"value\": \"coffee machine\"},\n{\"field\": \"customer_no\", \"value\": \"\"},\n{\"field\": \"customer_name\", \"value\": \"\"},\n{\"field\": \"customer_phone\", \"value\": \"\"},\n{\"field\": \"product_no\", \"value\": \"\"},\n{\"field\": \"order_no\", \"value\": \"\"},\n{\"field\": \"order_date\", \"value\": \"\