# Introduction to Structure Prompting using Instructor

## What is Instructor?

Instructor is a Python library that helps you get structured, predictable data from language models like GPT-4 and Claude. <br/>
It's like giving the LLM a form to fill out instead of letting it respond however it wants.

Without Instructor, getting structured data from LLMs can be challenging:
- Unpredictable outputs: LLMs might format responses differently each time
- Format errors: Getting JSON or specific data structures can be error-prone
- Validation headaches: Checking if the response matches what you need

Instructor solves these problems by:
- Defining exactly what data you want using Python classes
- Making sure the LLM returns data in that structure
- Validating the output and automatically fixing issues

## Preparation

### Install Dependencies

In [1]:
# !pip install requirements.txt

### Connect to the LLM with Instructor

In [2]:
from dotenv import load_dotenv
from openai import OpenAI, AzureOpenAI
import instructor

load_dotenv()

# client = instructor.from_openai(OpenAI())
client = instructor.from_openai(AzureOpenAI())

## Case Study: Simple extraction from a text

In [3]:
text = """John is a 30 years old software engineer. 
He was born in Cicago and currently resides in New York.
He has houses at 123 Main St, Springfield, IL 62704 and 456 Oak Ave, Chicago, IL 60601."""

### Define the expected response structure

In [4]:
from pydantic import BaseModel, Field


class Person(BaseModel):
    name: str
    age: int = Field(description="The user's age in years", gt=0, lt=120)
    city: str = Field(description="The city where the user lives")
    occupation: str

### Prepare the prompt

In [5]:
system_prompt = "You are a personal data extraction engine which capable to extract several personal details from a given text."
user_prompt = f"Extract a person from: {text}"

### Extract structured data

In [6]:
person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Person,
    temperature=0.0,
    max_retries=5,
    stream=False,
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": user_prompt,
        },
    ],
)

In [7]:
print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"City: {person.city}")
print(f"Occupation: {person.occupation}")

Name: John
Age: 30
City: New York
Occupation: software engineer


## Case Study: A more complex extraction from a text

### Define a more complex response structure

In [8]:
from pydantic import BaseModel, Field


class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str


class Person(BaseModel):
    name: str
    age: int = Field(description="The user's age in years", gt=0, lt=120)
    city: str = Field(description="The city where the user lives")
    occupation: str
    addresses: list[Address] = Field(description="The addresses of the user")

### No changes in the prompt

In [9]:
print(f"system prompt: {system_prompt}")
print("=" * 150)
print(f"user prompt: {user_prompt}")

system prompt: You are a personal data extraction engine which capable to extract several personal details from a given text.
user prompt: Extract a person from: John is a 30 years old software engineer. 
He was born in Cicago and currently resides in New York.
He has houses at 123 Main St, Springfield, IL 62704 and 456 Oak Ave, Chicago, IL 60601.


### No changes in the extraction method

In [10]:
# Extract structured data
person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Person,
    temperature=0.0,
    max_retries=5,
    stream=False,
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": user_prompt,
        },
    ],
)

### New structured response

In [11]:
print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"City: {person.city}")
print(f"Occupation: {person.occupation}")
print(f"Addresses: {person.addresses}")

Name: John
Age: 30
City: New York
Occupation: software engineer
Addresses: [Address(street='123 Main St', city='Springfield', state='IL', zip_code='62704'), Address(street='456 Oak Ave', city='Chicago', state='IL', zip_code='60601')]


In [12]:
for address in person.addresses:
    print(
        f"""Address:
    \t Street: {address.street}
    \t City: {address.city}
    \t State: {address.state}
    \t ZIP Code: {address.zip_code}"""
    )

Address:
    	 Street: 123 Main St
    	 City: Springfield
    	 State: IL
    	 ZIP Code: 62704
Address:
    	 Street: 456 Oak Ave
    	 City: Chicago
    	 State: IL
    	 ZIP Code: 60601


## Case Study: URL extraction from a markdown document

### Input Document

In [13]:
with open("document.md", "r") as file:
    input_document = file.read()

### Define the response structure

In [14]:
from pydantic import BaseModel, Field, HttpUrl


class ExtractedURL(BaseModel):
    url: list[HttpUrl] = Field(description="List of extracted URLs from a given document")

### Prepare the prompt

In [15]:
system_prompt = "You are a URL extraction engine."
user_prompt = f"Extract list of URLs from this document: {input_document}"

### Run the extraction

In [16]:
extracted_url = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ExtractedURL,
    temperature=0.0,
    stream=False,
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": user_prompt,
        },
    ],
)

### The structured output

In [17]:
for url in extracted_url.url:
    print(url)

https://ai.google/research
https://azure.microsoft.com/
https://twitter.com/technews
https://linkedin.com/tech
https://instagram.com/techtrends
https://github.com/
https://www.codecademy.com/
https://stackoverflow.com/
https://techconference2024.com/
https://aisummit.global/
https://webdevconf.org/
https://techportal.com/
https://twitter.com/techportal
