## Guided Generation Test
Original code is from here: https://github.com/dottxt-ai/outlines

There were some questionable things in this code, so I worked with Google Gemini
to clean and explain the code a little better.

According to Gemini the code was also out-of-date...

In [1]:
# Install the core libraries.
# - outlines: The structured generation library
# - transformers: The interface for downloading models
# - accelerate: Helper for managing GPU device placement
!pip install -q outlines transformers accelerate torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

# TinyLlama is small (1.1B), standard, and widely supported.
# It doesn't require "trust_remote_code=True".
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {MODEL_NAME}...")

# Load the model weights
llm = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="cuda",          # Load directly to GPU
    torch_dtype=torch.float16,  # Use half-precision (standard for Llama models)
    attn_implementation="eager" # Force standard math (Prevents "Flash Attention" crashes on T4)
)

# Load the tokenizer (converts text to numbers)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Wrap the model with Outlines
# This attaches the "Finite State Machine" engine that enforces your JSON schema.
guided_model = outlines.from_transformers(llm, tokenizer)

print("Model loaded and wrapped successfully.")


Loading TinyLlama/TinyLlama-1.1B-Chat-v1.0...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
2026-01-10 03:56:19.561021: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768017379.756227      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768017379.812906      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768017380.273065      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768017380.273114      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768017380.273117      55

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Model loaded and wrapped successfully.


In [5]:
from enum import Enum
from pydantic import BaseModel
from typing import List

# 1. Define the Schema
# This acts as the "stencil" for the model. It cannot generate anything outside these rules.
class TicketPriority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    urgent = "urgent"

class ServiceTicket(BaseModel):
    priority: TicketPriority
    category: str
    requires_manager: bool
    summary: str
    action_items: List[str]

# 2. The Input Data
customer_email = """
Subject: Critical - Cannot access my account after payment

I paid for the premium plan 3 hours ago and still can't access any features.
I've tried logging out and back in multiple times. This is unacceptable as I
have a client presentation in an hour and need the analytics dashboard.
Please fix this immediately or refund my payment.
"""

# 3. The Prompt
# TinyLlama expects this specific format. 
# We explicitly tell it to act as an API that outputs JSON.
prompt = f"""<|system|>
You are a helpful assistant. Extract the support ticket details from the user email.
</s>
<|user|>
{customer_email}
</s>
<|assistant|>
"""

# 4. Execution
print("Analyzing email...")

# The magic happens here:
# The model tries to predict the next tokens, but 'outlines' filters out 
# any token that doesn't fit the ServiceTicket JSON structure.
ticket = guided_model(
    prompt,
    ServiceTicket,
    max_new_tokens=512
)

ticket_json = ServiceTicket.model_validate_json(ticket)

# 5. Output
print("\n--- TICKET CREATED ---")
print(f"Priority:      {ticket_json.priority.value.upper()}")
print(f"Category:      {ticket_json.category}")
print(f"Needs Manager: {ticket_json.requires_manager}")
print(f"Summary:       {ticket_json.summary}")
print("Actions:")
for item in ticket_json.action_items:
    print(f"- {item}")

Analyzing email...


ValidationError: 1 validation error for ServiceTicket
  Invalid JSON: EOF while parsing a string at line 1 column 2287 [type=json_invalid, input_value='{"priority": "urgent", "... ticket from the system', input_type=str]
    For further information visit https://errors.pydantic.dev/2.12/v/json_invalid