<a href="https://colab.research.google.com/github/aknip/Local_LLMs/blob/main/lite_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# litellm

Examples, how-tos

- call via Python API: OpenAI, local models (Ollama, LM Studio)
- call via REST: OpenAI, local models (Ollama)

see
- https://litellm.vercel.app/docs
- https://litellm.vercel.app/docs/tutorials/first_playground

In [None]:
import json
import os
from getpass import getpass
import psutil
import requests
IN_NOTEBOOK = any(["jupyter-notebook" in i for i in psutil.Process().parent().cmdline()])
if IN_NOTEBOOK:
  CREDS = json.loads(getpass("Secrets (JSON string): "))
  os.environ['CREDS'] = json.dumps(CREDS)
  CREDS = json.loads(os.getenv('CREDS'))

Secrets (JSON string):  ········


In [None]:
from litellm import completion
import openai
os.environ["OPENAI_API_KEY"] = CREDS['OpenAI']['v1']['credential'] # my key

# Call OpenAI
Requirement: OPENAI availabe (see cell above)

In [None]:
response = completion(
  model="gpt-3.5-turbo",
  messages=[{ "content": "Hello, how are you?","role": "user"}]
)
print(response.choices[0].message.content)
print(response)

Hello! I'm an AI, so I don't have feelings, but I'm here to help you. How can I assist you today?
ModelResponse(id='chatcmpl-8RIQBSLAaEQ1aW9KR3CbFLuGGhswQ', choices=[Choices(finish_reason='stop', index=0, message=Message(content="Hello! I'm an AI, so I don't have feelings, but I'm here to help you. How can I assist you today?", role='assistant'))], created=1701516358, model='gpt-3.5-turbo-0613', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=29, prompt_tokens=13, total_tokens=42))


# Call Ollama
Requirement: Ollama running

In [None]:
response = completion(
  # api_base="http://localhost:11434",   # seems to work even without this parameter!
  model="ollama/zephyr",
  messages=[{ "content": "Hello, how are you?","role": "user"}]
)
print(response.choices[0].message.content)
print(response)

I'm programmed to respond in a polite and friendly manner. At the moment, I do not have the ability to feel emotions or experience a sense of being. However, I'm always here to help you with any questions or requests you might have! How may I assist you today?
ModelResponse(id='chatcmpl-3aebe52f-1395-4dee-8a6d-f775e4f18ffd', choices=[Choices(finish_reason=None, index=0, message=Message(content="I'm programmed to respond in a polite and friendly manner. At the moment, I do not have the ability to feel emotions or experience a sense of being. However, I'm always here to help you with any questions or requests you might have! How may I assist you today?", role='assistant'))], created=1701516918, model='ollama/zephyr', object='chat.completion', system_fingerprint=None, usage={'prompt_tokens': 6, 'completion_tokens': 57, 'total_tokens': 63})


# Call LM Studio

Requirement: LM Studio and server running

In [None]:
# see https://litellm.vercel.app/docs/providers/openai_compatible
response = completion(
  api_base="http://localhost:1234/v1",
  model="openai/just-a-dummy-model", # prefix 'openai/' is required! model name is currently unused in LM Studio
  messages=[{ "content": "Hello, how are you?","role": "user"}]
)
print(response.choices[0].message.content)
print(response)

# Call Ollama via litellm Proxy

Reuirement: Type in Terminal:
`litellm --model ollama/zephyr`

## Step 1: Use OpenAI API and route to proxy (Zephyr)

In [None]:
client = openai.OpenAI(base_url="http://0.0.0.0:8000", api_key="anything")
#client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [{"role": "user", "content": "this is a test request, write a short poem"}]
)
print(response)

ChatCompletion(id='chatcmpl-a7d980a2-701b-48db-a239-0a6fdb7b4abd', choices=[Choice(finish_reason=None, index=0, message=ChatCompletionMessage(content="In the stillness of night,\nWhen the world slumbers deep,\nWhispers dance in the air,\nSoftly weaving magic's keep.\n\nA poet's heart beats slow,\nAs he listens closely by,\nTo the rhythmic beat of life,\nAnd the whispers that pass him by.\n\nMelodies arise from within,\nAnd take flight upon silver wings,\nCarrying tales of love and loss,\nAnd memories that time does bring.\n\nIn the darkness, a poet sees,\nThe secrets that shadows hide,\nAnd with words he brings them forth,\nTo be seen by all who have eyes to abide.\n\nOh how sweet is this secret place,\nWhere words take flight and souls collide,\nWhere whispers dance in the night,\nAs poets weave their magic's tide.", role='assistant', function_call=None, tool_calls=None))], created=1701362635, model='ollama/zephyr', object='chat.completion', system_fingerprint=None, usage=CompletionUs

## Step 2: Use OpenAI API without proxy (same code, but no base_url)

In [None]:
#client = openai.OpenAI(base_url="http://0.0.0.0:8000", api_key="anything")
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [{"role": "user", "content": "this is a test request, write a short poem"}]
)
print(response)

ChatCompletion(id='chatcmpl-8QeUlmNCslzGRbKlxJixBjwnp8Bvm', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="In a realm where dreams unfold,\nA tale of words, untold, untold,\nWhispers woven, voices dance,\nAn ethereal, poetic trance.\n\nThrough inked rivers, my pen does glide,\nAcross the page, where thoughts reside,\nFrom twilight's gleam to dawn's embrace,\nA symphony of words interlace.\n\nWith every verse, emotions arise,\nLike sparks that shimmer in midnight skies,\nBeauty found in nature's embrace,\nA gentle breeze, a flower's grace.\n\nOh, words in hues of vibrant glow,\nThrough stanzas crafted, they freely flow,\nUnveiling secrets, unlocking hearts,\nA vibrant tapestry, where art imparts.\n\nSo let us wander through lyrical lands,\nWhere rhymes are built on whispered sands,\nFor in this test, a poem's birth,\nReveals the magic of words on Earth.", role='assistant', function_call=None, tool_calls=None))], created=1701362883, model='gpt-3.5-t

# Use local litellm-Server to provide all LLMs via one URL

**Important to know: The REST call to OpenAI / litellm and the response-JSON-structures are identical !**

## Reference: OpenAI REST Call and Result-JSON

### CURL via Terminal:
```
curl --location 'https://api.openai.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-...YOUR_KEY_HERE' \
--data ' {
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'
```

### Response:
```
{
  "id": "chatcmpl-8RHu3ncira10vxZVRjmVTDZv2qZxM",
  "object": "chat.completion",
  "created": 1701514367,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am an AI language model developed by OpenAI, with capabilities to assist with various tasks and provide information across different domains."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 25,
    "total_tokens": 37
  },
  "system_fingerprint": null
}
```

## As comparison: litellm Proxy REST Call and Result-JSON

### CURL via Terminal:
```
curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'
```

### Response:
```
{
   "id":"chatcmpl-aa9d6a1f-71bf-42fd-abad-2c2f713fd09a",
   "choices":[
      {
         "finish_reason":null,
         "index":0,
         "message":{
            "content":"I am not a physical entity, but rather an artificial intelligence language model (LLM) created by a computer program. My main function is to process and respond to human input in natural language text, based on the patterns and knowledge that I have been trained on during my creation. I do not have consciousness or feelings, and I do not exist beyond the digital realm in which I operate. However, I can understand and provide helpful responses to a wide range of queries and requests, just like any other human communication partner.",
            "role":"assistant"
         }
      }
   ],
   "created":1701513387,
   "model":"ollama/zephyr",
   "object":"chat.completion",
   "system_fingerprint":null,
   "usage":{
      "prompt_tokens":5,
      "completion_tokens":104,
      "total_tokens":109
   }
}

```


## Step 1: Reference: OpenAI REST Call (direct, without litellm)

In [None]:
url = "https://api.openai.com/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer " + CREDS['OpenAI']['v1']['credential']
}
data = {
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello, how are you?"}]
}
response = requests.post(url, headers=headers, json=data)
response_data = response.json()
output = response_data['choices'][0]['message']['content']
print(response_data)

{'id': 'chatcmpl-8RIwU7L6c8JsBZWCO0XZCxA8T95Hp', 'object': 'chat.completion', 'created': 1701518362, 'model': 'gpt-3.5-turbo-0613', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "Hello! I'm an AI, so I don't have feelings, but I'm here to help you with any questions or tasks you have. How can I assist you today?"}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 36, 'total_tokens': 49}, 'system_fingerprint': None}


## Step 2: OpenAI REST Call via local litellm-Server

see https://litellm.vercel.app/docs/tutorials/first_playground

Requirement: Type in Terminal: `python3 lite_llm_playground_server.py`

Use the following cell as source for 'lite_llm_playground_server.py'

In [None]:
%%writefile lite_llm_playground_server.py

#
# Local litelllm-Server
#
# save as 'lite_llm_playground_server.py'
#
# run in terminal with 'python3 lite_llm_playground_server.py'
#
# Dependencies:
# pip install flask waitress

import os
import json
from flask import Flask, jsonify, request
from litellm import completion_with_retries

## set ENV variables
os.environ["OPENAI_API_KEY"] = "sk-...-YOUR-KEY"

app = Flask(__name__)

# Example route
@app.route('/', methods=['GET'])
def hello():
    return jsonify(message="Hello, Flask!")

@app.route('/chat/completions', methods=["POST"])
def api_completion():
    data = request.json
    data["max_tokens"] = 256 # By default let's set max_tokens to 256
    try:
        # COMPLETION CALL
        response = completion_with_retries(**data)

        #print(response)

        responseJSON = {
            "id": response["id"], # response.id
            "choices": [
                {
                    "index": response.choices[0].index,
                    "finish_reason": response.choices[0].finish_reason,
                    "message" : {
                        "content" : response.choices[0].message.content,
                        "role" : response.choices[0].message.role
                    }
                }
            ],
            "created": response["created"], # response.created
            "model": response.model,
            "object": response.object,
            "system_fingerprint": response.system_fingerprint,
            "usage": {
                "prompt_tokens" : response["usage"]["prompt_tokens"], #response.usage.prompt_tokens
                "completion_tokens" : response["usage"]["completion_tokens"], # response.usage.completion_tokens
                "total_tokens" : response["usage"]["total_tokens"] # response.usage.total_tokens
            }
        }
        #responseString = json.dumps(responseJSON)
    except Exception as e:
        # print the error
        print(e)
    return responseJSON

if __name__ == '__main__':
    from waitress import serve
    serve(app, host="0.0.0.0", port=4000, threads=500)


Overwriting lite_llm_playground_server.py


## Call server

In [None]:
url = "http://localhost:4000/chat/completions" ## COMPLETION CALL -- assumes your server is running on port 4000
headers = {
    "Content-Type": "application/json"
}
data = {
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello, how are you?"}]
}
response = requests.post(url, headers=headers, json=data)
response_data = response.json()
output = response_data['choices'][0]['message']['content']
print(response_data)

{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': "Hello! I'm an AI language model, so I don't have feelings, but I'm here to assist you. How can I help you today?", 'role': 'assistant'}}], 'created': 1701518377, 'id': 'chatcmpl-8RIwkICMDesOHslMZaHeh2i0AbPZu', 'model': 'gpt-3.5-turbo-0613', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'completion_tokens': 31, 'prompt_tokens': 13, 'total_tokens': 44}}


## Step 3: Ollama/Zephyr REST Call via local litellm-Server

Requirement: Type in Terminal: `python3 lite_llm_playground_server.py`

In [None]:
url = "http://localhost:4000/chat/completions" ## COMPLETION CALL -- assumes your server is running on port 4000
headers = {
    "Content-Type": "application/json"
}
data = {
    "model": "ollama/zephyr",
    # "api_base": "http://localhost:11434",   # seems to work even without this parameter!
    "messages": [{"role": "user", "content": "Hello, how are you?"}]
}
response = requests.post(url, headers=headers, json=data)
response_data = response.json()
output = response_data['choices'][0]['message']['content']
print(response_data)

{'choices': [{'finish_reason': None, 'index': 0, 'message': {'content': 'I do not have the ability to feel emotions or sensations like a human does. However, I am always happy to assist you with your requests and inquiries! please let me know how I can be of service to you today.', 'role': 'assistant'}}], 'created': 1701518586, 'id': 'chatcmpl-371f8675-1f6f-49ca-86d2-3d765cd4d60c', 'model': 'ollama/zephyr', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'completion_tokens': 45, 'prompt_tokens': 6, 'total_tokens': 51}}
