> ℹ️ Adapted from the Guidance documentation https://guidance.readthedocs.io

We will be learning about constrained decoding using the `guidance` library.

In [1]:
import guidance

`guidance` allows us to load models from several different sources, 
such as Huggingface Transformers, OpenAI, and LlamaCpp.

In [7]:
# Mistral download link:
# https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q8_0.gguf
# Put this model the current directory.
# model = guidance.models.LlamaCpp("mistral-7b-instruct-v0.2.Q8_0.gguf", n_gpu_layers=-1, n_ctx=4096)
model = guidance.models.Transformers("meta-llama/Llama-3.2-1b") # Alternative: Llama-3.2-1B (gated access)
# model = guidance.models.Transformers("gpt2") # Use this one if you don't have resources to download/run a big model

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]



In `guidance` we use the `+` operator to give the model a prompt.

In [8]:
llm = model + "Taylor Swift is"

# Temperature sampling and the diversity-coherence tradeoff

We can then generate from the model using the `gen` function.
You can pass decoding parameters like `temperature` here. 

> 📝
The library does not support other truncation sampling parameters like top-p or top-k yet, but top-p appears to be on the roadmap.

In [None]:
llm + guidance.gen(max_tokens=100, temperature=1.0)

Compare the above generation with the greedy generation (temperature 0) below:

In [10]:
llm + guidance.gen(max_tokens=100, temperature=0)

This illustrates the diversity-coherence trade-off. Whereas the first generation likely led to an incoherent ramble, the seecond is likely highly repetitive.

# Forcing QA behavior in base LLMs.
LLM developers often use post-training (instruction fine-tuning and reward modeling) in order to elicit chatbot-like behavior from LLMs.
On the other hand, base LLMs often struggle with things like questions answering.
Nevertheless, we can use `guidance` to force base language models to adhere to a QA template.

In [12]:
query = "Who won the last Kentucky derby and by how much?"
lm = gpt2 + f'''\
Q: {query}
A: {guidance.gen(name="answer", stop=["Q:", "A:"], temperature=0.8, max_tokens=100)}'''

You can use the `name` keyword argument to capture the generation

In [13]:
lm["answer"]

'The Kentucky Derby returns to the small towns of Lexington and Jefferson this summer. The Kentucky Derby was held in April of 1999 in Nashua, New Hampshire, and the last two years are seen as the best two-year run of the NMD.\n'

# Enforcing valid JSON outputs
When integrating LLMs into larger systems it is often desirable to obtain *structured* outputs in the form of JSON objects. For instance, if generating characters for a game, one might desire a JSON containing the character information.

In [14]:
import json
import jsonschema

In [15]:
character_schema = """{
    "type": "object",
    "properties": {
        "description" : { "type" : "string" },
        "name" : { "type" : "string" },
        "age" : { "type" : "integer" },
        "armour" : { "type" : "string", "enum" : ["leather", "chainmail", "plate"] },
        "weapon" : { "type" : "string", "enum" : ["sword", "axe", "mace", "spear", "bow", "crossbow"] },
        "class" : { "type" : "string" },
        "mantra" : { "type" : "string" },
        "strength" : { "type" : "integer" },
        "quest_items" : { "type" : "array", "items" : { "type" : "string" } }
    }
}
"""
character_schema_obj = json.loads(character_schema)

Without constraints, language models can struggle to adhere to JSON 
schemas, even with prompting.

In [17]:
llm = gpt2 + (
    "Character descriptions follow this JSON schema:\n\n"
    f"{character_schema}\n\n"
    "Here is a JSON for a character with the description"
    ' "A quick and nimble fighter":'
) + guidance.gen(max_tokens=50, name="json output")
try: 
    json.loads(llm["json output"])
except json.JSONDecodeError:
    print("Failed: Invalid JSON")

Failed: Invalid JSON


The `guidance` library supports template-based enforcement of JSON schemas for this purpose.

In [25]:
gpt2 + "A quick and nimble fighter\n" + guidance.json(schema=character_schema_obj, name="next character", temperature=1.0)

KeyboardInterrupt: 

The resulting output now fits the schema without fail!