<a href="https://colab.research.google.com/github/chrispoole70/langchain-tutorials/blob/main/extraction/extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Build an Extraction Chain](https://python.langchain.com/docs/tutorials/extraction/)

In [None]:
%pip install --upgrade langchain-core

In [None]:
%pip install -qU "langchain[openai]"

In [37]:
import os
from pprint import pprint
from typing import List, Optional

from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.utils.function_calling import tool_example_to_messages
from pydantic import BaseModel, Field

In [8]:
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Define the schema for the information you want to extract from the unstructured data

In [4]:
class Person(BaseModel):
  """Information about a person."""

  name: Optional[str] = Field(default=None, description='The name of the person')
  hair_color: Optional[str] = Field(default=None, description='The color of the person\'s hair if known')
  height_in_meters: Optional[str] = Field(default=None, description='Height measured in meters')

Create a prompt template

In [17]:
prompt_template = ChatPromptTemplate.from_messages(
    [
        ('system', 'You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute\'s value.'),
        ('human', '{text}')
    ]
)

Create a chat model and assign the output schema to it

In [10]:
llm = init_chat_model('gpt-4o-mini', model_provider='openai')

In [11]:
structured_llm = llm.with_structured_output(schema=Person)

In [43]:
type(structured_llm).__name__

'RunnableSequence'

In [14]:
len(structured_llm.steps)

2

Format the prompt template with unstructured data about a person

In [15]:
text = 'Alan Smith is 6 feet tall and has blond hair.'

In [18]:
prompt = prompt_template.invoke({'text': text})
prompt

ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.", additional_kwargs={}, response_metadata={}), HumanMessage(content='Alan Smith is 6 feet tall and has blond hair.', additional_kwargs={}, response_metadata={})])

Send the input message to the LLM

In [19]:
response = structured_llm.invoke(prompt)

The LLM was able to extract the person's name, hair color, and convert their height to meters

In [20]:
response

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

## Multiple Entities

In most cases your schema should be a list of elements that an LLM can extract info for. If there are multiple `People` in the unstructured data, than one way to extract this is to nest `People` inside another `Pydantic` class.

In [22]:
class Data(BaseModel):
  """Extracted data about people."""

  people: List[Person]

In [23]:
structured_llm = llm.with_structured_output(schema=Data)

Use unstructured data about multiple people

In [24]:
text = 'My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.'

In [25]:
prompt = prompt_template.invoke({'text': text})
prompt

ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.", additional_kwargs={}, response_metadata={}), HumanMessage(content='My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.', additional_kwargs={}, response_metadata={})])

In [26]:
response = structured_llm.invoke(prompt)

The output will have info about two people

In [27]:
response

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

## Reference Examples

Give the LLM examples of input and their expected output. One way of sending multiple input/output combinations to an LLM is using the `tool_example_to_messages` function which states whether a tool call was made correctly or not.

In [30]:
examples = [
    (
        'The ocean is vast and blue. It\'s more than 20,000 feet deep.',
        Data(people=[])
    ),
    (
        'Fiona traveled far from France to Spain.',
        Data(people=[Person(name='Fiona', hair_color=None, height_in_meters=None)])
    )
]

In [39]:
for txt, tool_call in examples:
  print(f'Text: {txt}')
  print(f'Tool call: {tool_call}')

Text: The ocean is vast and blue. It's more than 20,000 feet deep.
Tool call: people=[]
Text: Fiona traveled far from France to Spain.
Tool call: people=[Person(name='Fiona', hair_color=None, height_in_meters=None)]


In [32]:
messages = []

for txt, tool_call in examples:
  if tool_call.people:
    ai_response = 'Detected people.'
  else:
    ai_response = 'Detected no people.'

  messages.extend(tool_example_to_messages(input=txt, tool_calls=[tool_call], ai_response=ai_response))

  messages.extend(tool_example_to_messages(input=txt, tool_calls=[tool_call], ai_response=ai_response))


In [38]:
for message in messages:
  pprint(message)

HumanMessage(content="The ocean is vast and blue. It's more than 20,000 feet deep.", additional_kwargs={}, response_metadata={})
AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '17968c68-f73f-4d20-be5e-6353e7c7b7f3', 'type': 'function', 'function': {'name': 'Data', 'arguments': '{"people":[]}'}}]}, response_metadata={}, tool_calls=[{'name': 'Data', 'args': {'people': []}, 'id': '17968c68-f73f-4d20-be5e-6353e7c7b7f3', 'type': 'tool_call'}])
ToolMessage(content='You have correctly called this tool.', tool_call_id='17968c68-f73f-4d20-be5e-6353e7c7b7f3')
AIMessage(content='Detected no people.', additional_kwargs={}, response_metadata={})
HumanMessage(content='Fiona traveled far from France to Spain.', additional_kwargs={}, response_metadata={})
AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '74cc3254-f739-4dea-addf-c92f2e786301', 'type': 'function', 'function': {'name': 'Data', 'arguments': '{"people":[{"name":"Fiona","hair_color":null,"height_in_meters":null

Sending input to an LLM without examples sometimes returns an incorrect output. In this text, the LLM sometimes confuses Earth as a person's name.

In [41]:
message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon.",
}

structured_llm = llm.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])

Data(people=[])

By including examples of correct input/output combinations, the LLM should become more accurate

In [42]:
structured_llm.invoke(messages + [message_no_extraction])

Data(people=[])