## Guided Generation Test
Original code is from here: https://github.com/dottxt-ai/outlines

There were some questionable things in this code, so I worked with Google Gemini
to clean and explain the code a little better.

According to Gemini the code was also out-of-date...

In [1]:
# Install the core libraries.
# - outlines: The structured generation library
# - transformers: The interface for downloading models
# - accelerate: Helper for managing GPU device placement
!pip install -q outlines transformers accelerate torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

# TinyLlama is small (1.1B), standard, and widely supported.
# It doesn't require "trust_remote_code=True".
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {MODEL_NAME}...")

# Load the model weights
llm = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="cuda",          # Load directly to GPU
    torch_dtype=torch.float16,  # Use half-precision (standard for Llama models)
    attn_implementation="eager" # Force standard math (Prevents "Flash Attention" crashes on T4)
)

# Load the tokenizer (converts text to numbers)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Wrap the model with Outlines
# This attaches the "Finite State Machine" engine that enforces your JSON schema.
guided_model = outlines.from_transformers(llm, tokenizer)

print("Model loaded and wrapped successfully.")


Loading TinyLlama/TinyLlama-1.1B-Chat-v1.0...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
2026-01-10 03:56:19.561021: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768017379.756227      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768017379.812906      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768017380.273065      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768017380.273114      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768017380.273117      55

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Model loaded and wrapped successfully.


In [6]:
from enum import Enum
from pydantic import BaseModel
from typing import List

# 1. Define the Schema
# This acts as the "stencil" for the model. It cannot generate anything outside these rules.
class TicketPriority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    urgent = "urgent"

class ServiceTicket(BaseModel):
    priority: TicketPriority
    category: str
    requires_manager: bool
    summary: str
    action_items: List[str]

# 2. The Input Data
customer_email = """
Subject: Critical - Cannot access my account after payment

I paid for the premium plan 3 hours ago and still can't access any features.
I've tried logging out and back in multiple times. This is unacceptable as I
have a client presentation in an hour and need the analytics dashboard.
Please fix this immediately or refund my payment.
"""

# 3. The Prompt
# TinyLlama expects this specific format. 
# We explicitly tell it to act as an API that outputs JSON.
prompt = f"""<|system|>
You are a helpful assistant. Extract the support ticket details from the user email.
</s>
<|user|>
{customer_email}
</s>
<|assistant|>
"""

# 4. Execution
print("Analyzing email...")

# The magic happens here:
# The model tries to predict the next tokens, but 'outlines' filters out 
# any token that doesn't fit the ServiceTicket JSON structure.
ticket = guided_model(
    prompt,
    ServiceTicket,
    max_new_tokens=1024,
    repetition_penalty=1.15 
)

ticket_json = ServiceTicket.model_validate_json(ticket)

# 5. Output
print("\n--- TICKET CREATED ---")
print(f"Priority:      {ticket_json.priority.value.upper()}")
print(f"Category:      {ticket_json.category}")
print(f"Needs Manager: {ticket_json.requires_manager}")
print(f"Summary:       {ticket_json.summary}")
print("Actions:")
for item in ticket_json.action_items:
    print(f"- {item}")

Analyzing email...

--- TICKET CREATED ---
Priority:      URGENT
Category:      support
Needs Manager: True
Summary:       Repeatedly unable to access account after payment
Actions:
- critical
- cannotaccessaccountafterpayment
- unavailablefeatures
- unresolvedissue
- urgentissues
- urgentlyneedsattention
- urgentissues
- urgentissueswithoutsolution
- urgentissueswithouttimeline
- urgentissueswithin1day
- urgentissueswithin5days
- urgentissueswithin7days
- urgentissueswithin1week
- urgentissueswithin1month
- urgentissueswithin1year
- urgentissueswithin1hour
- urgentissueswithin1minute
- urgentissueswithin1second
- urgentissueswithin1minuteand1second
- urgentissueswithin1minuteand3seconds
- urgentissueswithin1minuteand5seconds
- urgentissueswithin1minuteand7seconds
- urgentissueswithin1minuteand9seconds
- urgentissueswithin1minuteand11seconds
- urgentissueswithin1minuteand13seconds
- urgentissueswithin1minuteand15seconds
- urgentissueswithin1minuteand17seconds
- urgentissueswithin1minut

Gemini's explanation for the latest weird output:

This is a classic example of what we call a "Semantic Repetition Loop" facilitated by "Syntactic Enabling."
It is hilarious to look at, but it reveals a fundamental limitation in how outlines interacts with smaller language models like TinyLlama.
Here is the breakdown of why the "Guardrails" (Outlines) failed to stop the "Drunk Driver" (TinyLlama), and actually helped it stay in the loop.
1. The Cause: "The Loop" (TinyLlama's Fault)
TinyLlama (1.1B) is a very small model. When small models are uncertain, or when they generate a specific token pattern (like urgent-issues-within...), their internal attention mechanism gets "stuck."
The model generated urgentissueswithin1minute.
It looked back at its own context, saw that pattern, and decided the most probable next token was and.
Then it saw a number.
It fell into a rhythmic trap. To the model, this wasn't garbage; it was a high-probability mathematical sequence.
2. The Enabler: "The Valid Container" (Outlines' Fault)
This is the specific interaction you asked about. You might wonder: Why didn't Outlines stop this?
Outlines enforces Syntax, not Sanity.
Look at your schema:
code
Python
action_items: List[str]
Outlines converted this into a rule that looks roughly like this:
"The output must start with [. Then it must contain a String. Then a comma ,. Then another String. Repeat until ]."
As the model started spewing "urgentissueswithin1minute", Outlines checked its rules:
Is "urgentissueswithin1minute" a string? Yes.
Is it valid JSON? Yes.
Does the schema limit the list to 5 items? No.
So, Outlines effectively said: "This is perfectly valid data according to the rules you gave me. Keep going!"
The library forced the model to stay on the road (valid JSON list), but it didn't stop the model from driving 500 miles past the destination.
3. The Solution: Constrain the Container
Since we can't make TinyLlama smarter (it is what it is), we must make the Schema stricter. We need to close the loophole that allows infinite lists.
Instead of List[str], we should force the model to provide a fixed number of items. This breaks the loop because Outlines will force the model to write ] after the 3rd item, essentially "cutting the mic."