# 1.4 Extraction - TODO!
(Adapted from https://python.langchain.com/docs/tutorials/extraction)

## Setup

### Install dependencies

In [None]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install langchain~=0.3.7 langchain_openai~=0.2.6 langchain_community~=0.3.5 --upgrade --quiet

# If running locally, you can do this instead:
#%pip install -r ../requirements.txt

### Load environment variables

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup Chat Model

In [None]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
api_version = "2024-10-01-preview"
llm = AzureChatOpenAI(deployment_name="gpt-4o", temperature=0.0, api_version=api_version)

## The Schema

First, we need to describe what information we want to extract from the text.

We'll use Pydantic to define an example schema  to extract personal information.

In [None]:
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

There are two best practices when defining schema:

1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.
2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer. 
 
<br/>

##### _IMPORTANT_ 
> For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text. 

<br/>

## The Extractor

Let's create an information extractor using the schema we defined above.

In [None]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

_**NOTE:**_ This only works with models that supports function/tool calling.

Please review [the documentation](/docs/concepts/tool_calling) for list of some models that can be used with this API.

In [None]:
runnable = prompt | llm.with_structured_output(schema=Person)

Let's test it out

In [None]:
text = "Alan Smith is 6 feet tall and has blond hair."
runnable.invoke({"text": text})

##### IMPORTANT 
> Extraction is Generative 🤯
> 
> LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters
> even though it was provided in feet!

<br/>
We can see the LangSmith trace here: https://smith.langchain.com/public/44b69a63-3b3b-47b8-8a6d-61b46533f015/r

## Multiple Entities

In **most cases**, you should be extracting a list of entities rather than a single entity.

This can be easily achieved using pydantic by nesting models inside one another.

In [None]:
from typing import List, Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ⬆ Doc-string for the entity Person.
    # This doc-string is sent to the chat model as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )


class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

##### IMPORTANT
> Extraction might not be perfect here. Please continue to see how to use **Reference Examples** to improve the quality of extraction, and see the **guidelines** section!


In [None]:
runnable = prompt | llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
runnable.invoke({"text": text})

##### TIP
> When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information
> is in the text by providing an empty list. 
> 
> This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.


We can see the LangSmith trace here: https://smith.langchain.com/public/7173764d-5e76-45fe-8496-84460bd9cdef/r

# TODO: Review

## Exercise - Reference Examples

Improve the quality of the extraction by using reference examples. See **[https://python.langchain.com/docs/how_to/extraction_examples/](https://python.langchain.com/docs/how_to/extraction_examples/)**.





# TODO: Remove

## Next steps

Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guides:

- [Add Examples](/docs/how_to/extraction_examples): Learn how to use **reference examples** to improve performance.
- [Handle Long Text](/docs/how_to/extraction_long_text): What should you do if the text does not fit into the context window of the LLM?
- [Use a Parsing Approach](/docs/how_to/extraction_parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.