In [None]:
import time

import lingua

# Getting Started

There is a bit of documentation on how to interact with the large models [here](https://lingua-sdk.readthedocs.io/en/latest/getting_started.html). The relevant github links to the SDK are [here](https://github.com/VectorInstitute/lingua-sdk) and underlying code [here](https://github.com/VectorInstitute/lingua).

First we connect to the service through which, we'll interact with the LLMs and see which models are avaiable to us

In [None]:
# Establish a client connection to the Lingua service
client = lingua.Client(gateway_host="llm.cluster.local", gateway_port=3001)

Show all supported models

In [None]:
client.models

Show all model instances that are currently active

In [None]:
client.model_instances

Let's start by querying the OPT-175B model. We'll try other models below. Get a handle to a model. In this example, let's use the OPT-175B model.

In [None]:
opt_model = client.load_model("OPT-175B")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while opt_model.state != "ACTIVE":
    time.sleep(1)

We need to configure the model to generate in the way we want it to. We set important parameters.

*`max_tokens` sets the number the model generates before haulting generation.
*`top_k`: Range: 0-Vocab size. At each generation step this is the number of tokens to select from with relative probabilities associated with their likliehoods. Setting this to 1 is "Greedy decoding." If top_k is set to zero them we use nucleus sample (i.e. top_p below).
*`top_p`: Range: 0.0-1.0, nucleus sampling. At each generation step, the tokens the largest probabilities, adding up to `top_p` are sampled from relative to their likliehoods.
*`rep_penalty`: Range >= 1.0. This attempts to decrease the likelihood of tokens in a generation process if they have been generated before. A value of 1.0 means no penalty and larger values increasingly penalize repeated values. 1.2 has been reported as a good default value.
*`temperature`: Range >=0.0. This value "sharpens" or flattens the softmax calculation done to produce probabilties over the vocab. As temperature goes to zero: only the largest probabilities will remain non-zero (approaches greedy decoding). As it approaches infinity, the distribution spreads out evenly over the vocabulary.

In [None]:
generation_config = {"max_tokens": 5, "top_k": 4, "top_p": 3, "rep_penalty": 1, "temperature": 0.5}

Let's try a basic prompt