# Exercise 2 - Structured information extraction
The previous exercise used standard prompt techniques to extract information from a house description. However, this approach was ad hoc and not very robust. 

In this exercise, we will again extract information from the house, but now we will extract it into a structure data model using Pydantic. This will allow us to:
- Validate if the extracted information is correct.
- Easily extend the model to extract more information.
- Easily parse raw strings into our structured Pydantic model.

Before we start, let's import the necessary libraries and create a client for the Azure OpenAI API.

In [None]:
from llm_in_production.openai_utils import get_openai_client
import dotenv
import os
import pandas as pd
from typing import Literal
import pydantic
import datetime


dotenv.load_dotenv()
client = get_openai_client()

## Introduction to Pydantic
Before we dive into the exercise, let's first look at [Pydantic](https://docs.pydantic.dev/latest/). Pydantic is a library that allows you to create structured data models in Python. These models can be used to validate if data is correct and to parse raw data into a structured model. Pydantic is used in many Python libraries, including FastAPI and Typer.

The syntax for creating a Pydantic model is very similar to the syntax for creating a dataclass. You create a Pydantic model as follows:
- Create a class that inherits from `pydantic.BaseModel`.
- Add fields to the class. Each field is a class attribute with a type annotation. Pydantic supports all Python types. For example, `int`, `str`, `bool`, `float`, `datetime.datetime`, etc. Pydantic also supports more advanced types such as `typing.List`, `typing.Dict`, and even union type using the `|` operator. For example, `int | None` means that the field can be either an integer or `None`.


In [None]:
# Here we define a Pydantic model.
class User(pydantic.BaseModel):
    id: int # Optionally, you can add additional information to the fields, such as descriptions.
    name: str = pydantic.Field(description="The first and last name of the user")
    gender: Literal['unknown', "male", "female", "other"] = pydantic.Field(
        "unknown", # The default value of the field.
        description="The gender of the user"
    ) 
    signup: datetime.datetime | None = pydantic.Field(description="The date and time the user signed up") # None means that its optional.

# We can create an instance of the model by passing in the correct types.
user = User(id=123, name='John', gender="male", signup=datetime.datetime(2021, 1, 26, 9, 32, 15))
print(user)

The nice thing about Pydantic is that it can also parse raw JSON into a Pydantic model. For example, we can parse the following JSON into a Pydantic model:

In [None]:
# The model_validate_json takes an input a raw JSON string and returns a instance of the model.
user = User.model_validate_json('{"id": 123, "name": "John", "signup": "2021-01-26 09:32:15"}')
print(user)

Pydantic can also generate JSON schema using the `model_json_schema` method. This is very useful since LLM are very good at understanding these types of schemas and can generate JSON that matches the schema. For example, we can generate the JSON schema for our `User` model using the code bellow.

Have a look at this JSON and try to link the different parts of the schema to the fields in the `User` model.


In [None]:
print(User.model_json_schema())

## Exercise 2a: Demo of pydantic text extraction
Let's put it all our pydantic knowledge to together into the following text extraction prompt. 

In the cell below we create a Pydantic model that can emit the JSON schema of a user's details.

In [None]:
class Address(pydantic.BaseModel):
    street: str
    number: int
    city: str
    country: str

class User(pydantic.BaseModel):
    id: int = pydantic.Field(description="The id of the user")
    name: str = pydantic.Field(description="The first and last name of the user")
    gender: Literal['unknown', "male", "female", "other"] = pydantic.Field("unknown", description="The gender of the user") 
    signup: datetime.datetime | None = pydantic.Field(None, description="The date and time the user signed up")
    address: Address # You can also use nested models in Pydantic
    # YOUR CODE HERE START: Add a new field to the model
    # YOUR CODE HERE END


In the next cell, the following happens:
- We write a system prompt, which explains that the LLM's responses must follow the `User` JSON schema.
- We write a small description of a person to send to the LLM.
- We ask the LLM to generate a JSON object.
- The model responds with this JSON object in raw string format.
- We then parse this raw string into a Pydantic model using the `User.model_validate_json` method.
- Finally, we print the parsed model.

Do the following:
1. Run the code as it is. Does it the parsed user match the description?
2. Change the user message. Does the parsed user still match the description?
3. Add a new field to the `User` model created above (e.g. `job: str`). Does the model still parse the JSON including the new attribute correctly?

In [None]:

system_prompt = f"""
The user sends a description of a User. Its your task to extract details from the description.
You response must be valid JSON parsable by Pydantic using the following schema:
{User.model_json_schema()}
This should not be a JSON Schema, but a JSON object that can be parsed by Pydantic based on the above schema.
""".strip()

print("System prompt:")
print(system_prompt)
print("#" * 80 + "\n")

user_message = f"""
John, a compassionate and dedicated male living at 123 Main Street, New York, NY 10030, USA, is a lawyer who is passionate about helping others. Recently, he signed up for a local volunteering program on January 26, 2021. When he signed up received id number 42. Throughout the year, he selflessly contributed his time and skills to various projects, making a profound impact on his community. John's commitment inspired other lawyers colleagues to join him, creating a ripple effect of kindness and compassion.
""".strip()

print("User message:")
print(user_message)
print("#" * 80+ "\n")

response = client.chat.completions.create(
    model=os.environ["GPT_35_CHAT_MODEL_NAME"],
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ],
    temperature=0.0,
)


message = response.choices[0].message.content
print("Response:")
print(message)
print("#" * 80 + "\n")

user = User.model_validate_json(message)
print("Parsed user:")
print(user)

## Validation errors and error handling
Besides parsing JSON, Pydantic also validates if raw JSON has the correct types and other constraints. If the JSON is invalid, Pydantic will raise a `ValidationError` error. Additionally, the exception also gives you a clear list of which fields are invalid and why.

In [None]:
try:
    user = User.model_validate_json('{"id": "NaN", "name": "John", "signup": "2021-01-26 09:32:15"}')
except pydantic.ValidationError as e:
    for error in e.errors():
        print(f"Errors: {error}")

These error messages are very useful since they tell us exactly what is wrong with the JSON. Additionally, they are also in an easy-to-understand format for the LLM. We can leverage this by allowing the LLM to fix its own mistakes. We do exactly that in the cell below. 

Here, we:
- First, create a system prompt with the task description and the JSON schema for the expected response.
- We then reuse the previous user description as a user message.
- We imagine receiving an invalid response, by hard-coding a response with missing attributes.
- When we parse the incorrect response into a Pydantic model, we get a `ValidationError`.
- We then use the `e.errors()` method to get a list of errors.
- We then create a new user message that tells the LLM what the errors are and how to fix them.
- The LLM then responds with a new JSON string that does not have the errors.


Please run the cell below to see this in action. Also, try to add other errors to the incorrect response and see if the LLM can fix them such as:
- Other missing attributes.
- Incorrect types.
- etc.

In [None]:
system_prompt = f"""
The user sends a description of a User. Its your task to extract details from the description.
You response must be valid JSON parsable by Pydantic using the following schema:
{User.model_json_schema()}
This should not be a JSON Schema, but a JSON object that can be parsed by Pydantic based on the above schema.
""".strip()

user_message = f"""
John, a compassionate and dedicated male living at 123 Main Street, New York, NY 10030, USA, is a lawyer who is passionate about helping others. Recently, he signed up for a local volunteering program on January 26, 2021. When he signed up received id number 42. Throughout the year, he selflessly contributed his time and skills to various projects, making a profound impact on his community. John's commitment inspired other lawyers colleagues to join him, creating a ripple effect of kindness and compassion.
""".strip()

# This hardcoded response is incorrect since:
# - The id is missing.
# - the name is missing.
incorrect_response = """
{
    "gender": "male",
    "signup": "2021-01-26T00:00:00",
    "address": {
        "street": "123 Main Street",
        "number": 0,
        "city": "New York",
        "country": "USA"
    },
    "job": "lawyer"
}
"""

# Parse the incorrect response into a Pydantic model
# This will raise a pydantic.ValidationError
try:
    User.model_validate_json(incorrect_response)
except pydantic.ValidationError as e:
    # We then ask the LLM to repair the response by telling it what the errors are.
    fix_request_user_message = f"Your response is incorrect! Respond only with JSON that does not have these errors: {e.errors()}"
    print("Fix request user prompt:")
    print(fix_request_user_message)
    print("#" * 80)
    
    # We then include the past message for context since it 
    # needs the description to find the correct name and id.
    response = client.chat.completions.create(
        model=os.environ["GPT_35_CHAT_MODEL_NAME"],
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": incorrect_response},
            {"role": "user", "content": fix_request_user_message},
        ],
        temperature=0.0,
    )
    # We then try to parse the (fixed) response again.
    message = response.choices[0].message.content
    print("Response after fixing:")
    print(message)
    print("#" * 80)
    user = User.model_validate_json(message)
    print("Parsed user after fixing:")
    print(user)

## Bring Pydantic to the house listing problem
We have now seen that we can use Pydantic to:
- Specify to an LLM what kind of JSON we want to generate using the `model_json_schema` method.
- Parse raw JSON into a Pydantic model using the `model_validate_json` method.
- That we get nice error messages if the JSON is invalid.

Let's now use these properties to make a more robust information extraction system for our house listing problem.
In this problem, although we are mainly interested in extracting the value, we are also interested in additional information, such as:
- The thoughts and reasoning that the LLM used to extract the value. As a side benefit, this also allows the LLM to improve its reasoning since it is using chain of thought.
- The quotes from the description that were used to extract the value.

There are different types of features that can be extracted, for example:
- Boolean features: Is a certain feature present in the description? For example, does the house have a garden or a balcony? A house either has this feature or not.
- Digit features: A digit feature is a feature that has a numeric value. For example, the number of bedrooms or the price of the house.
- String features: A string feature is a feature that has a string value. For example, the name of the neighborhood or the name of the real estate agent, etc.

For example, we are interested in the following information:
- Whether the house is pet-friendly (boolean feature)
- The number of bedrooms (digit feature)
- The neighborhood name (string feature)

So an output might look something like this:
```json
{
  "pets_allowed": {
    "thoughts": "The description does mention that pets are not allowed.",
    "quotes": [
      "Pets not allowed"
    ],
    "value": false,
  },
  "number_of_bedrooms": {
    "thoughts": "The description mentions the number of bedrooms is 2.",
    "quotes": [
      "Number of bedrooms: 2 (the 2nd bedroom has no windows)"
    ],
    "value": 2
  },
  "neighborhood": {
    "thoughts": "The description mentions the neighborhood is Nieuw-Sloten.",
    "quotes": [
      "Neighborhood: Nieuw-Sloten"
    ],
    "value": "Nieuw-Sloten",
  }
}
```
It can also happen that some information is not present in the description. In that situation, we want something like this:
```json
{
  "pets_allowed": {
    "thoughts": "The description does not mention anything about pets.",
    "quotes": null,
    "value": null,
  },
  "number_of_bedrooms": {
    "thoughts": "The description does not mention anything about bedrooms.",
    "quotes": null,
    "value": null
  },
  "neighborhood": {
    "thoughts": "The description does not mention anything about neighborhoods.",
    "quotes": null,
    "value": null
  }
}
```

## Exercise 2b: Create a pydantic model for the structured house information

In the example above we gave examples of information we may want to extract, for example, the number of bedrooms or the neighborhood of the property. In the future, we might also want to extract more information from the description. For example, the price of the house, the name of the real estate agent, etc.

To make this easier, we will create a `DigitFeature`, `StringFeature`, and `BooleanFeature` class that we can use to extract the different types of features. If we ever have a new feature that we want to extract, we can simply add a new attribute to the `HouseFeatures` model with the correct type of feature.

In the cell below, we have already created the `DigitFeature` and `StringFeature` classes and response format: `HouseFeatures`.

Your first task is as follows:
1. Complete the `BooleanFeature` class. You can use the `DigitFeature` and `StringFeature` classes as an example. The `BooleanFeature` class should have the following properties:
    - `quotes`: A list of strings that contains the exact sentences that were used to extract the value. If the feature was not found in the description, this property should be `None`.
    - `value`: A boolean that indicates if the feature is present in the description. If the feature was not found in the description, this property should be `None`.


In [None]:
class DigitFeature(pydantic.BaseModel):
    thoughts: str = pydantic.Field(
        description="Here you write your thought and reasoning about your observations that will help you determine the value" 
    )
    quotes: list[str] = pydantic.Field(
        description="The exact sentences where you got the answer from. This property is empty list if value is None."
    )
    value: int | None = pydantic.Field(
        default=None, description="The exact integer value in the sentences. This property is an empty list if the feature was not found in the description. It must be None if no quotes are found."
    )
    
    @pydantic.model_validator(mode='after')
    def check_if_no_quotes_value_is_none(self) -> 'DigitFeature':
        if len(self.quotes) == 0 and self.value is not None:
            raise ValueError('If there are no quotes, the value must be None.')
        return self
        


class StringFeature(pydantic.BaseModel):
    thoughts: str = pydantic.Field(
        description="Here you write your thought and reasoning about your observations that will help you determine the value" 
    )
    quotes: list[str] = pydantic.Field(
        description="The exact sentences where you got the answer from. This property is empty list if value is None."
    )
    value: str | None = pydantic.Field(description="The exact value in the sentences. This property is None if the feature was not found in the description. It must be None if no quotes are found.")
    
    @pydantic.model_validator(mode='after')
    def check_if_no_quotes_value_is_none(self) -> 'StringFeature':
        if len(self.quotes) == 0 and self.value is not None:
            raise ValueError('If there are no quotes, the value must be None.')
        return self
    


class BooleanFeature(pydantic.BaseModel):
    # YOUR CODE HERE START: Complete the pydantic model by adding the correct fields and descriptions
    # YOUR CODE HERE END


Your second task is as follows:

2. Add a `pets_allowed` attribute to the `HouseFeatures` class with the correct type and description. You can use the `number_of_bedrooms` and `neighborhood` attributes as an example.

Run the cell and check if all tests pass.

In [None]:

class HouseFeatures(pydantic.BaseModel):
    """Extracted house listing features"""
    number_of_bedrooms: DigitFeature = pydantic.Field(description="The number of bedrooms in the house")
    neighborhood: StringFeature = pydantic.Field(description="The name of the neighborhood where the house is located. Do not include the city name.")
    # YOUR CODE HERE START: Add the boolean feature for pets_allowed and add a description with pydantic.Field
    # YOUR CODE HERE END
    
# Below are some automated tests to check if your Pydantic model is correct.
raw_feature = {"thoughts": "The description mentions a garden.", "quotes": ["The house has a garden"], "value": True}
feature = BooleanFeature.model_validate(raw_feature)
assert feature.quotes == raw_feature["quotes"]
assert feature.value == raw_feature["value"]

raw_feature = {"thoughts": "The description is empty.", "quotes": [], "value": None}
feature = BooleanFeature.model_validate(raw_feature)
assert feature.quotes == raw_feature["quotes"]
assert feature.value == raw_feature["value"]

raw_feature = {"thoughts": "This reponds is invalid.", "quotes": [], "value": True}
try:
    feature = BooleanFeature.model_validate(raw_feature)
    assert False, "It looks like the model does not raise an error if the quotes are empty but the value is not None"
except pydantic.ValidationError as e:
    pass


raw_house_featues = {
    "pets_allowed": {
        "thoughts": "The description mentions that pets are not allowed.",
        "quotes": ["Pets not allowed"],
        "value": False,
    },
    "number_of_bedrooms": {
        "thoughts": "The description mentions the number of bedrooms.",
        "quotes": ["Number of bedrooms: 2 (the 2nd bedroom has no windows)"],
        "value": 2,
    },
    "neighborhood": {
        "thoughts": "The description mentions the neighborhood.",
        "quotes": ["Neighborhood: Nieuw-Sloten"],
        "value": "Nieuw-Sloten",
    },
}
feature = HouseFeatures.model_validate(raw_house_featues)
assert feature.pets_allowed.quotes == raw_house_featues["pets_allowed"]["quotes"], "It looks like the quotes are not parsed correctly for the pets_allowed feature"
assert feature.pets_allowed.value == raw_house_featues["pets_allowed"]["value"], "It looks like the value is not parsed correctly for the pets_allowed feature"
assert feature.number_of_bedrooms.quotes == raw_house_featues["number_of_bedrooms"]["quotes"], "It looks like the quotes are not parsed correctly for the number_of_bedrooms feature"
assert feature.number_of_bedrooms.value == raw_house_featues["number_of_bedrooms"]["value"], "It looks like the value is not parsed correctly for the number_of_bedrooms feature"
assert feature.neighborhood.quotes == raw_house_featues["neighborhood"]["quotes"], "It looks like the quotes are not parsed correctly for the neighborhood feature"
assert feature.neighborhood.value == raw_house_featues["neighborhood"]["value"], "It looks like the value is not parsed correctly for the neighborhood feature"

## Exercise 2c: Explore the json schema
We now have a Pydantic model that describes how to structure the information we extract from the description.

Now, the big question is, is it informative enough for the LLM to generate the correct JSON?

To answer this question, let's first explore the JSON schema ourselves.
Run the cell bellow to inspect the JSON schema.

Read through the JSON schema and try to link the different parts of the schema to the fields in the `HouseFeatures` model.
Then ask yourself the following questions:
- What happens if you:
    - Change the `description` of a field?
    - What happens if you change the `default` of a field?
    - What if change the docstring of a class? 
- Are the descriptions of the fields correct?
- Do the descriptions contain enough information to tell the LLM what kind of value we want to extract?

Note: there is a lot of information in the JSON schema. Don't worry if you don't understand everything. Mainly focus on the `description` fields and `type` fields.

In [None]:
HouseFeatures.model_json_schema()

## Exercise 2d: Extracting information from the descriptions
Now that we have a Pydantic model that describes the structure of the extracted information, let's use it to extract information from the descriptions.

In the cell below, we have already created a function that extracts information from a description using the LLM. Notice that we have also specified that the model ouput should be [JSON](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/json-mode?tabs=python#json-mode-support), so we no longer need to include this in the system prompt. However, please note that JSON mode does not guarantee the output matches any specific schema.

Your task is as follows:
1. Write the prompt that tells the LLM what kind of JSON we want to generate. You can use the `HouseFeatures.model_json_schema()` method to generate the JSON schema.
2. Parse the response from the LLM into a `HouseFeatures` model using the `HouseFeatures.model_validate_json` method.
3. Run the cell and check if all tests pass. If not, try to improve the system prompt.

In [None]:
def extract_info(description):
    
    # YOUR CODE HERE START: Replace this your system prompt you can the above system prompt as an example.
    # YOUR CODE HERE END
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": description},
    ]
    response = client.chat.completions.create(
        model=os.environ["GPT_35_CHAT_MODEL_NAME"],
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0.0,
    )

    message = response.choices[0].message.content
    # YOUR CODE HERE START: Replace this with your code to parse the message into a HouseFeatures model
        # YOUR CODE HERE END    
    except pydantic.ValidationError as e:
        print("LLM response:")
        print(message)
        print("Errors:")
        print(e.errors())
        print("Asking the LLN to fix the errors:")
        messages.append({
            "role": "assistant", 
            "content": message
        })
        messages.append({
            "role": "user", 
            # Write a user message that tells the LLN about the errors in its response `e.errors()`
            # and tell the LLM to respond with JSON that does not have these errors.
            # YOUR CODE HERE START:
            # YOUR CODE HERE END
        })
        response = client.chat.completions.create(
            model=os.environ["GPT_35_CHAT_MODEL_NAME"],
            messages=messages,
            temperature=0.0,
        )
        message = response.choices[0].message.content
        print("LLM response after fixing:")
        print(message)
        # YOUR CODE HERE START: Replace this with your code to parse the message into a HouseFeatures model
    # YOUR CODE HERE END    
    return parsed_house_features


df = pd.read_csv("houses.csv")

# Here we map the pets_allowed column to a boolean value to match our new boolean feature pydantic model
expected_pets_allowed_value = {
    "allowed": True,
    "not_allowed": False,
    "unknown": None,
}
df["pets_allowed"] = df["pets_allowed"].map(expected_pets_allowed_value)

# Herre we loop through all the rows in the dataset and check if the extracted information is correct
for i, row in df.iterrows():
    description = row["description"]
    info = extract_info(description)
    
    assert info.pets_allowed.value == row["pets_allowed"], f"For row {i}, expected `{row['pets_allowed']}` but got `{info.pets_allowed.value}` in `{info}`"
    assert info.neighborhood.value == row["neighborhood"], f"For row {i}, expected `{row['neighborhood']}` but got `{info.neighborhood}` in `{info}`"
    assert info.number_of_bedrooms.value == row["bedrooms"], f"For row {i}, expected `{row['bedrooms']}` but got `{info.number_of_bedrooms}` in `{info}`"
    print(f"✅ Row {i} is correct: {info}")

## Optional exercise 2e: Extract additional information
The descriptions contain a lot more information. For example: 
- Is there a balcony? 
- Is there a garden? 
- What is the price of the house?
- What is the house type (apartment, house, etc.)?
- What is the city name?
- etc.


Your task is as follows:
1. Inspect the descriptions in the `houses.csv` file.
2. Identify additional information that you want to extract.
3. Update the `HouseFeatures` model to include the new information.
4. Update the `extract_info` function to extract the new information.
5. Check if the information is extracted correctly.

Here is some code to inspect the descriptions in the `houses.csv` file.


In [None]:
house_idx = 0
row = df.iloc[house_idx]
print("House:", house_idx)
print("City:", row["city"])
print("Price:", row["price"])
print("Surface area:", row["surface_area"])
print("Bedrooms:", row["bedrooms"])
print("#" * 80 + "\n")
print("Description:\n" + row["description"])

Here is some boilerplate code to get you started.

In [None]:
class HouseFeaturesAdvanced(pydantic.BaseModel):
    """Extracted house listing features"""
    # YOUR CODE HERE START: Add the boolean feature for pets_allowed and add a description with pydantic.Field
    # YOUR CODE HERE END

def extract_info_advanced(description):    
    # YOUR CODE HERE START: Replace this your system prompt you can the above system prompt as an example.
    # YOUR CODE HERE END
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": description},
    ]
    response = client.chat.completions.create(
        model=os.environ["GPT_35_CHAT_MODEL_NAME"],
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0.0,
    )

    message = response.choices[0].message.content
    # YOUR CODE HERE START: Replace this with your code to parse the message into a HouseFeatures model
    # YOUR CODE HERE END    
    return parsed_house_features

house_idx = 0
row = df.iloc[house_idx]
print("House:", house_idx)
print("City:", row["city"])
print("Price:", row["price"])
print("Surface area:", row["surface_area"])
print("Bedrooms:", row["bedrooms"])
print("#" * 80 + "\n")
print("Description:\n" + row["description"])
print("#" * 80 + "\n")
info = extract_info_advanced(row["description"])
print(info)