# Introduction to Structure Prompting using Instructor

## What is Instructor?

Instructor is a Python library that helps you get structured, predictable data from language models like GPT-4 and Claude. <br/>
It's like giving the LLM a form to fill out instead of letting it respond however it wants.

Without Instructor, getting structured data from LLMs can be challenging:
- Unpredictable outputs: LLMs might format responses differently each time
- Format errors: Getting JSON or specific data structures can be error-prone
- Validation headaches: Checking if the response matches what you need

Instructor solves these problems by:
- Defining exactly what data you want using Python classes
- Making sure the LLM returns data in that structure
- Validating the output and automatically fixing issues

## Preparation

### Install Dependencies

In [1]:
# !pip install requirements.txt

### Connect to the LLM with Instructor

In [2]:
from dotenv import load_dotenv
from openai import OpenAI, AzureOpenAI
import instructor

load_dotenv()

# client = instructor.from_openai(OpenAI())
client = instructor.from_openai(AzureOpenAI())

## Case Study: Simple extraction from a text

In [3]:
text = """John is a 30 years old software engineer. 
He was born in Cicago and currently resides in New York.
He has houses at 123 Main St, Springfield, IL 62704 and 456 Oak Ave, Chicago, IL 60601."""

### Define the expected response structure

In [4]:
from pydantic import BaseModel, Field


class Person(BaseModel):
    name: str
    age: int = Field(description="The user's age in years", gt=0, lt=120)
    city: str = Field(description="The city where the user lives")
    occupation: str

### Prepare the prompt

In [5]:
system_prompt = "You are a personal data extraction engine which capable to extract several personal details from a given text."
user_prompt = f"Extract a person from: {text}"

### Extract structured output

In [6]:
person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Person,
    temperature=0.0,
    max_retries=5,
    stream=False,
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": user_prompt,
        },
    ],
)

In [7]:
print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"City: {person.city}")
print(f"Occupation: {person.occupation}")

Name: John
Age: 30
City: New York
Occupation: software engineer


### Let's wrap it as a proper function

In [8]:
def extract_person_data(text: str) -> Person:
    system_prompt = "You are a personal data extraction engine which capable to extract several personal details from a given text."
    user_prompt = f"Extract a person from: {text}"

    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=Person,
        temperature=0.0,
        max_retries=5,
        stream=False,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
    )

In [9]:
person = extract_person_data(text)

print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"City: {person.city}")
print(f"Occupation: {person.occupation}")

Name: John
Age: 30
City: New York
Occupation: software engineer


## Case Study: A more complex extraction from a text

### Define a more complex response structure

In [10]:
from pydantic import BaseModel, Field


class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str


class Person(BaseModel):
    name: str
    age: int = Field(description="The user's age in years", gt=0, lt=120)
    city: str = Field(description="The city where the user lives")
    occupation: str
    addresses: list[Address] = Field(description="The addresses of the user")

### No changes required in both the prompt and the extraction function

In [11]:
person = extract_person_data(text)

### New structured output

In [12]:
print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"City: {person.city}")
print(f"Occupation: {person.occupation}")
print(f"Addresses: {person.addresses}")

Name: John
Age: 30
City: New York
Occupation: software engineer
Addresses: [Address(street='123 Main St', city='Springfield', state='IL', zip_code='62704'), Address(street='456 Oak Ave', city='Chicago', state='IL', zip_code='60601')]


In [13]:
for address in person.addresses:
    print(
        f"""Address:
    \t Street: {address.street}
    \t City: {address.city}
    \t State: {address.state}
    \t ZIP Code: {address.zip_code}"""
    )

Address:
    	 Street: 123 Main St
    	 City: Springfield
    	 State: IL
    	 ZIP Code: 62704
Address:
    	 Street: 456 Oak Ave
    	 City: Chicago
    	 State: IL
    	 ZIP Code: 60601


## Case Study: URL extraction from a markdown document

### Define the response structure

In [14]:
from pydantic import BaseModel, Field, HttpUrl


class ExtractedURL(BaseModel):
    url: list[HttpUrl] = Field(
        description="List of extracted URLs from a given document"
    )

### Prepare the extraction function

In [15]:
def get_url(input_document: str) -> ExtractedURL:
    system_prompt = "You are a URL extraction engine."
    user_prompt = f"Extract list of URLs from this document: {input_document}"

    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=ExtractedURL,
        temperature=0.0,
        max_retries=5,
        stream=False,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
    )

### Run the extraction

In [16]:
with open("document.md", "r") as file:
    input_document = file.read()

extracted_url = get_url(input_document)

### The structured output

In [17]:
for url in extracted_url.url:
    print(url)

https://ai.google/research
https://azure.microsoft.com/
https://twitter.com/technews
https://linkedin.com/tech
https://instagram.com/techtrends
https://github.com/
https://www.codecademy.com/
https://stackoverflow.com/
https://techconference2024.com/
https://aisummit.global/
https://webdevconf.org/
https://techportal.com/
https://twitter.com/techportal


## Case Study: Single Label Classification

### Define the response structure

In [18]:
from pydantic import BaseModel, Field
from typing import Literal


class ClassificationResponse(BaseModel):
    """
    A few-shot example of text classification:

    Examples:
    - "Buy cheap watches now!": SPAM
    - "Meeting at 3 PM in the conference room": NOT_SPAM
    - "You've won a free iPhone! Click here": SPAM
    - "Can you pick up some milk on your way home?": NOT_SPAM
    - "Increase your followers by 10000 overnight!": SPAM
    """

    chain_of_thought: str = Field(
        description="The chain of thought that led to the prediction.",
    )
    label: Literal["SPAM", "NOT_SPAM"] = Field(
        description="The predicted class label.",
    )

### Prepare the classification function

In [19]:
def classify(input_text: str) -> ClassificationResponse:
    system_prompt = "You are a text classification engine which capable to classify a given text as SPAM or NOT_SPAM"
    user_prompt = f"Classify the following text: <text>{input_text}</text>"

    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=ClassificationResponse,
        temperature=0.0,
        max_retries=5,
        stream=False,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
    )

### Run the classification

In [20]:
for text in [
    "Hey Jason! You're awesome",
    "I am a nigerian prince and I need your help.",
]:
    prediction = classify(text)
    print(f"Text: {text}, Predicted Label: {prediction.label}")
    print(f"Chain of thought: {prediction.chain_of_thought}")
    print("=" * 120)

Text: Hey Jason! You're awesome, Predicted Label: NOT_SPAM
Chain of thought: The text is a friendly message expressing appreciation towards someone named Jason. It does not contain any promotional content, misleading information, or requests for personal information, which are common characteristics of spam. Therefore, it is classified as NOT_SPAM.
Text: I am a nigerian prince and I need your help., Predicted Label: SPAM
Chain of thought: The text claims to be from a 'Nigerian prince' seeking help, which is a common trope in spam messages that attempt to solicit money or personal information from the recipient. This type of message is often associated with scams, making it highly likely to be classified as SPAM.


## Case Study: Multi Label Classification

### Define the response structure

In [21]:
from typing import List
from pydantic import BaseModel, Field
from typing import Literal


class MultiClassPrediction(BaseModel):
    """
    Class for a multi-class label prediction.

    Examples:
    - "My account is locked": ["TECH_ISSUE"]
    - "I can't access my billing info": ["TECH_ISSUE", "BILLING"]
    - "When do you close for holidays?": ["GENERAL_QUERY"]
    - "My payment didn't go through and now I can't log in": ["BILLING", "TECH_ISSUE"]
    """

    chain_of_thought: str = Field(
        description="The chain of thought that led to the prediction.",
    )

    class_labels: List[Literal["TECH_ISSUE", "BILLING", "GENERAL_QUERY"]] = Field(
        description="The predicted class labels for the support ticket.",
    )

### Prepare the classification function

In [22]:
def multi_classify(input_text: str) -> MultiClassPrediction:
    user_prompt = (
        f"Classify the following support ticket: <ticket>{input_text}</ticket>"
    )

    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=MultiClassPrediction,
        temperature=0.0,
        max_retries=5,
        stream=False,
        messages=[
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
    )

### Run the classification

In [23]:
ticket = "My account is locked and I can't access my billing info."
prediction = multi_classify(ticket)

print(f"Ticket: {ticket}")
print(f"Predicted Labels: {prediction.class_labels}")
print(f"Chain of thought: {prediction.chain_of_thought}")

Ticket: My account is locked and I can't access my billing info.
Predicted Labels: ['TECH_ISSUE', 'BILLING']
Chain of thought: The user mentions that their account is locked, which indicates a technical issue. Additionally, they state that they cannot access their billing information, which relates to billing concerns. Therefore, the ticket involves both a technical issue and a billing issue.


## Case Study: Document Segmentation

### Prepare the document

In [24]:
from trafilatura import fetch_url, extract


url = "https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html"
downloaded = fetch_url(url)
document = extract(downloaded)
document

"Understanding Reasoning LLMs\nMethods and Strategies for Building and Refining Reasoning Models\nIn this article, I will describe the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic.\nIn 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., “specializations”).\nThe development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transfo

### Document preprocessing

In [25]:
def doc_with_lines(document: str) -> tuple[str, dict]:
    document_lines = document.split("\n")
    document_with_line_numbers = ""
    line2text = {}
    for i, line in enumerate(document_lines):
        document_with_line_numbers += f"[{i}] {line}\n"
        line2text[i] = line
    return document_with_line_numbers, line2text

In [26]:
document_with_line_numbers, line2text = doc_with_lines(document)

# print(document_with_line_numbers)
# print(line2text)

### Define the response structure

In [27]:
from pydantic import BaseModel, Field


class Section(BaseModel):
    title: str = Field(description="main topic of this section of the document")
    start_index: int = Field(description="line number where the section begins")
    end_index: int = Field(description="line number where the section ends")


class StructuredDocument(BaseModel):
    """obtains meaningful sections, each centered around a single concept/topic"""

    sections: list[Section] = Field(description="a list of sections of the document")

### Prepare the extraction function

In [28]:
def get_structured_document(document_with_line_numbers) -> StructuredDocument:
    system_prompt = """You are a world class educator working on organizing your lecture notes.
    Read the document below and extract a StructuredDocument object from it where each section of the document is centered around a single concept/topic that can be taught in one lesson.
    Each line of the document is marked with its line number in square brackets (e.g. [1], [2], [3], etc). Use the line numbers to indicate section start and end."""
    user_prompt = f"Process this document: {document_with_line_numbers}"

    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=StructuredDocument,
        temperature=0.0,
        max_retries=5,
        stream=False,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
    )

### Run the extraction

In [29]:
structured_doc = get_structured_document(document_with_line_numbers)
structured_doc

StructuredDocument(sections=[Section(title='Introduction to Reasoning Models', start_index=0, end_index=12), Section(title='Defining Reasoning Models', start_index=13, end_index=18), Section(title='When to Use Reasoning Models', start_index=19, end_index=22), Section(title='Overview of the DeepSeek Training Pipeline', start_index=23, end_index=27), Section(title='DeepSeek-R1 Model Variants', start_index=28, end_index=30), Section(title='Techniques for Building Reasoning Models', start_index=31, end_index=33), Section(title='Inference-Time Scaling', start_index=34, end_index=42), Section(title='Pure Reinforcement Learning (RL)', start_index=44, end_index=53), Section(title='Supervised Fine-Tuning and Reinforcement Learning (SFT + RL)', start_index=54, end_index=60), Section(title='Model Distillation', start_index=62, end_index=78), Section(title='Conclusion on Reasoning Model Strategies', start_index=89, end_index=99), Section(title='Thoughts on DeepSeek R1', start_index=100, end_index=

In [30]:
structured_doc.sections

[Section(title='Introduction to Reasoning Models', start_index=0, end_index=12),
 Section(title='Defining Reasoning Models', start_index=13, end_index=18),
 Section(title='When to Use Reasoning Models', start_index=19, end_index=22),
 Section(title='Overview of the DeepSeek Training Pipeline', start_index=23, end_index=27),
 Section(title='DeepSeek-R1 Model Variants', start_index=28, end_index=30),
 Section(title='Techniques for Building Reasoning Models', start_index=31, end_index=33),
 Section(title='Inference-Time Scaling', start_index=34, end_index=42),
 Section(title='Pure Reinforcement Learning (RL)', start_index=44, end_index=53),
 Section(title='Supervised Fine-Tuning and Reinforcement Learning (SFT + RL)', start_index=54, end_index=60),
 Section(title='Model Distillation', start_index=62, end_index=78),
 Section(title='Conclusion on Reasoning Model Strategies', start_index=89, end_index=99),
 Section(title='Thoughts on DeepSeek R1', start_index=100, end_index=114),
 Section(ti

### Postprocess

In [31]:
def get_sections_text(structured_doc, line2text):
    segments = []
    for s in structured_doc.sections:
        contents = []
        for line_id in range(s.start_index, s.end_index):
            contents.append(line2text.get(line_id, ''))
        segments.append(
            {
                "title": s.title,
                "content": "\n".join(contents),
                "start": s.start_index,
                "end": s.end_index,
            }
        )
    return segments

In [32]:
segments = get_sections_text(structured_doc, line2text)

In [33]:
segments[0]

{'title': 'Introduction to Reasoning Models',
 'content': 'Understanding Reasoning LLMs\nMethods and Strategies for Building and Refining Reasoning Models\nIn this article, I will describe the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic.\nIn 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., “specializations”).\nThe development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specializatio