## TurboML LLM Tutorial
TurboML can spin up LLM servers with an OpenAI-compatible API. We currently support models
in the GGUF format, but also support non-GGUF models that can be converted to GGUF. In the latter
case you get to decide the quantization type you want to use.

## Set up the environment and install TurboML's SDK.
We use `turboml-installer` to set up the environment for TurboML's SDK.

In [3]:
!pip install -q turboml-installer
import turboml_installer ; turboml_installer.install_on_colab()

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.6 MB[0m [31m18.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m26.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h📦 Installing...
🩹 Patching environment...
⏲ Done in 0:02:09
🔁 Restarting kernel...


The kernel should now be restarted with TurboML's SDK installed.

## Login to your TurboML instance

Note that you can copy and replace this snippet with one from your TurboML homepage.

In [2]:
import turboml as tb
tb.init(
  backend_url="http://34.93.248.93:8500",
  api_key="tb_Cn9wlAPvO09Qsak7rVfYqivOxdh9Tz88_17ef9683"
)

ModuleNotFoundError: No module named 'turboml'

In [None]:
LlamaServerRequest = tb.llm.LlamaServerRequest
HuggingFaceSpec = LlamaServerRequest.HuggingFaceSpec
ServerParams = LlamaServerRequest.ServerParams

## Choose a model
Let's use a Llama 3.2 quant already in the GGUF format.

In [1]:
hf_spec = HuggingFaceSpec(
    hf_repo_id="hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
    select_gguf_file="llama-3.2-1b-instruct-q4_k_m.gguf",
)

NameError: name 'HuggingFaceSpec' is not defined

## Spawn a server
On spawning a server, you get a `server_id` to reference it later as well as `server_relative_url` you can
use to reach it. This method is synchronous, so it can take a while to yield as we retrieve (and convert) your model.

In [None]:
response = tb.llm.spawn_llm_server(
    LlamaServerRequest(
        source_type=LlamaServerRequest.SourceType.HUGGINGFACE,
        hf_spec=hf_spec,
        server_params=ServerParams(
            threads=-1,
            seed=-1,
            context_size=0,
            flash_attention=False,
        ),
    )
)
response

In [None]:
server_id = response.server_id

### Interacting with the LLM

Our LLM is exposed with an OpenAI-compatible API, so we can use the OpenAI SDK, or any
other tool compatible tool to use it.

In [None]:
%pip install openai

In [None]:
from openai import OpenAI

base_url = tb.common.env.CONFIG.TURBOML_BACKEND_SERVER_ADDRESS
server_url = f"{base_url}/{response.server_relative_url}"

client = OpenAI(base_url=server_url, api_key="-")


response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Hello there how are you doing today?",
        }
    ],
    model="-",
)

print(response)

In [None]:
embeddings = (
    client.embeddings.create(input=["Hello there how are you doing today?"], model="-")
    .data[0]
    .embedding
)
len(embeddings), embeddings[:5]

## Stop the server

In [None]:
tb.llm.stop_llm_server(server_id)