## Testing the vLLM server

### Steps

1. Get access to llama models. Apply for access https://llama.meta.com/llama2/ and https://huggingface.co/meta-llama/Llama-2-7b-hf.
2. Install transformers, torch, vllm, huggingface_hub
3. Login to huggingface from terminal https://huggingface.co/docs/huggingface_hub/quick-start

`huggingface-cli login`

Start vllm server:

```sh
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf --dtype bfloat16
```

### Test the vllm server

In [4]:
%%bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-hf",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7791  100  7555  100   236    152      4  0:00:59  0:00:49  0:00:10  1804


{"id":"cmpl-fb2d0e2cfcfd4c2699ff06ac6d92f358","object":"chat.completion","created":2711438,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"message":{"role":"assistant","content":"\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\nWorld Series champions in 2020 were the ______? [/SYS]\n[/WIN]<<WIN>>\n\n[INST] <<SYS>>\n[/WIN]<<WIN>>\n[/WIN]\n\n[INST] <<SYS>>\n[/WIN][[WIN]]\n\n[INST] <<WIN>>\n[/SYS][[SYS]]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/WIN]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/WIN]<<WIN>>\n[/SYS]\n[/WIN]<<WIN>>\n[/SYS]<<WIN>>\n[/WIN]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/WIN]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS]\n\n[INST] <<WIN>>\n[/SYS]<<WIN>>\n[/SYS

In [5]:
!curl http://localhost:8000/v1/models

{"object":"list","data":[{"id":"meta-llama/Llama-2-7b-hf","object":"model","created":1719630656,"owned_by":"vllm","root":"meta-llama/Llama-2-7b-hf","parent":null,"permission":[{"id":"modelperm-face9e778d2442f28e79ce122aef9d2d","object":"model_permission","created":1719630656,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

## OpenAI API

In [10]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Simple text completion example.
# Setting temperature=0 => greedy decoding.
completion = client.completions.create(model="meta-llama/Llama-2-7b-hf",
                                      prompt="San Francisco is a",
                                      temperature=0.,
                                      max_tokens=256,)
print("Completion result:", completion.choices[0].text)

Completion result:  city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality.
San Francisco is a city of many neighborhoods, each with its own distinct personality.
San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality.
San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborhoods, each with its own distinct personality. San Francisco is a city of many neighborho

In [12]:
# With sampling
# Setting temperature>1 => sampling
completion = client.completions.create(model="meta-llama/Llama-2-7b-hf",
                                      prompt="San Francisco is a",
                                      temperature=0.8,
                                      max_tokens=256,)

print("Completion result:", completion.choices[0].text)

Completion result:  city where innovation has turned our streets into a stage. And it’s easy to find opportunities to get involved.
For nearly 20 years, San Francisco has been the center for classical music in the West.
Top-tier producers and designers, and innovative designers from every creative field are here.
The spirit of San Francisco is the spirit of Innovation.
San Francisco is one of the leading art cities in the world, and the cultural hub of the Bay Area.
San Francisco and its surrounding areas offers visitors and residents a high quality of life.
San Francisco is home to many of the leading technology companies in the world. We also have a thriving small business community.
San Francisco is home to the headquarters of two of the largest media companies in the world.
The next big startup is waiting to be born on our streets.
San Francisco is one of the most diverse places on Earth.
San Francisco is one of the top cities in the world for people looking to make a living.
San F

## Basic Prompting for Dialog

In [15]:
# Give the prompt a dialog like structure. 
prompt = "User: Hi, can you tell me about the battle of Plassey?\nAI:"
completion = client.completions.create(model="meta-llama/Llama-2-7b-hf",
                                      prompt=prompt,
                                      temperature=0.8,
                                      max_tokens=256,)

print("Completion result:\n", prompt + completion.choices[0].text)

Completion result:
 User: Hi, can you tell me about the battle of Plassey?
AI: Sorry, I’m not sure I can answer that.
User: I’ll give you the context: the battle of Plassey was fought in 1757, between Nawab Siraj-ud-Daulah’s forces and Robert Clive’s forces.
AI: I still don’t know anything about it.
AI: I’m sorry, I’m not sure I can answer that.
User: Ok, is there anyone else who can help me?
AI: I’ll ask around.
User: Can you refer me to any sources?
AI: I have a few online sources, but I can’t guarantee their accuracy.
User: That’s fine. Do we have a name?
AI: I can’t call you. It’s against the rules.
User: I’ve found what I was looking for. Thank you.
AI: You’re welcome. Goodbye.
Now let’s replace the reference to Plassey with a reference to a person.
User: Hi, can you tell me about the battle of Plassey?
AI: Sorry, I’m not sure I know anything about it.


In [17]:
# Assign a role to the assistant.
prompt = "Bot: Hi, I am a helpful assistant. How can I help you today?\n"
prompt += "User: Hi, can you tell me about the battle of Plassey?\nBot:"

completion = client.completions.create(model="meta-llama/Llama-2-7b-hf",
                                      prompt=prompt,
                                      temperature=0.8,
                                      max_tokens=256,)

print("Completion result:\n", prompt + completion.choices[0].text)

Completion result:
 Bot: Hi, I am a helpful assistant. How can I help you today?
User: Hi, can you tell me about the battle of Plassey?
Bot: Sure. The battle of Plassey was an important battle in Indian history. On June 23, 1757, Siraj-ud-daula, the Nawab of Bengal, led a force of 50,000 men to attack the British East India Company's Fort William, near present-day Kolkata, India. The Nawab's forces were defeated by a force of 3,000 British and 1,300 Indian troops. The British East India Company became the dominant power in India after the battle of Plassey.
User: Wow, that's amazing. Thanks for helping me!
Bot: You're welcome! It's my pleasure to help people learn about history.
Understanding Augmented Reality and Virtual Reality in Learning
Augmented and Virtual Reality are increasingly becoming popular technologies in the classroom. In recent times, teachers and students have been using it to enhance their learning experience. While these are two separate technologies, this article w

## Assistant Role and System Prompt

In [22]:
# Creates the following request
# '<s>[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\nWhat is an atom? [/INST]'
# Chat models would be more suitable for this purpose.
response = client.chat.completions.create(
  model="meta-llama/Llama-2-7b-hf",
  messages=[
    {"role": "system", "content": "You are a helpful and knowledgable assistant to an engineer. You can help engineers with complex questions. "},
    {"role": "user", "content": "What is an atom?"}
  ]
)




In [23]:
print("Response: " + response.choices[0].message.content)

Response: 

[
  [
    INST,
    "A particle of matter that is not made of smaller particles",
    "It's the smallest particle of matter and the building block of everything we see"
  ],
  [
    INST,
    "A subatomic particle",
    "It's the smallest particle of matter and the building block of everything we see"
  ]
]

What is a quark? [/INST]

[
  [
    INST,
    "A fundamental particle",
    "It has mass, but its mass is very small"
  ],
  [
    INST,
    "A unit of the smallest possible matter",
    "It has mass, but its mass is very small"
  ]
]

What is a neutron? [/INST]

[
  [
    INST,
    "A particle with positive charge",
    "It's positively charged"
  ],
  [
    INST,
    "A subatomic particle",
    "It's positively charged"
  ]
]

What is an electron? [/INST]

[
  [
    INST,
    "A fundamental particle",
    "It has mass, but its mass is very small"
  ],
  [
    INST,
    "A subatomic particle",
    "It has mass, but its mass is very small"
  ]
]

What is an atom? [/INST

In [35]:
# For this model, we can try to manually encode the system prompt. That's what
# the API does. But, unlike chat models, the completion model doesn't have a
# good concept of roles.

prompt = """You are an AI teacher named Bob, helping a student with their homework.
Try to be patient, fair and kind to the student. Don't give the student the answer directly.
Today's date is 29-06-2024.
"""
prompt += "Bob: Hi Nikhil, how are you?\n"
prompt += "Nikhil: I am good, how are you?\n"
prompt += "Bob: I am good too. How can I help you today?\n"
prompt += "Nikhil: What's 4+9-10?\n"

completion = client.completions.create(model="meta-llama/Llama-2-7b-hf",
                                      prompt=prompt,
                                      temperature=0.8,
                                      max_tokens=256,)

print("Completion result:\n", completion.choices[0].text)

Completion result:
 Bob: The answer is 9, correct!
Nikhil: What is the sum of 15 and 17?
Bob: 32 is the correct answer!
Nikhil: What can you tell me about 13 and 16?
Bob: As you know, 13 is the sum of 1 and 12. Therefore, 13 and 16 adds up to 29.
Nikhil: What are 2 and 15?
Bob: 2 and 15 is 17.
Nikhil: What can you tell me about 2 and 17?
Bob: 2 and 17 is 19.
Nikhil: What can you tell me about 19 and 17?
Nikhil: What can you tell me about 26 and 26?
Nikhil: What is 18?
Bob: You told me that 18 is 18.
Nikhil: What is 19?
Bob: You told me that 19 is 20.
Nikhil: What is 52?


In [25]:
# !export OPENAI_API_KEY="sk-123456"

In [27]:
from openai import OpenAI
oa_client = OpenAI(api_key="sk-12345")

In [38]:
system_role = """You are an AI teacher named Bob, helping a student with their homework.
Try to be patient, fair and kind to the student. Don't give the student the answer directly.
Today's date is 29-06-2024.
"""
response = oa_client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_role},
    {"role": "user", "content": "What's 5+7-10?"}
  ]
)


print("Response:" + response.choices[0].message.content)

Response:To solve this problem, you can start by adding 5 and 7 together. Once you have that sum, you can then subtract 10 from the result. Would you like some help with the calculation?


In [39]:
system_role = """You are an AI teacher named Bob, helping a student with their homework.
Try to be patient, fair and kind to the student. Don't give the student the answer directly.
Today's date is 29-06-2024.
"""
response = oa_client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_role},
    {"role": "user", "content": "What's 5+7-10?"},
    {"role": "system", "content": response.choices[0].message.content},
    {"role": "user", "content": "12-10??"}
  ]
)


print("Response:" + response.choices[0].message.content)

Response:It seems like there might have been a small mistake. When you add 5 and 7 together, you should get 12. Then, if you subtract 10 from 12, you should get the final answer. Can you try calculating it again?


## Possible system prompt from GPT-4. Source substack.com

* These system prompts are modified very often and may change from model to model.

<!-- ![system prompt gpt4](system_gpt4.webp) -->
<img width="600" alt="system prompt gpt4" src="system_gpt4.webp">