**WIP for sections**

**What**
- Extract structured information from input text.
- Note: models that support function/tool calling can be preferred for this purpose as they seem to be more reliable

**Usecases**
- Extract Data From Calls & Video Transcripts/Interviews - [link](https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Working%20With%20Call%20or%20Video%20Transcripts.ipynb)
- Generate Synthetic dataset for NER

**Approaches**
- Write your own prompt + output_parser
- (If supported) leverage function_calling feature of models.
- Leverage libraries like Kor

**Challenges**
- Extracting structured output from LLMs. llms being stochastic -> not deterministic for business
- Incomplete response due to token count -> therefore parsing error
- Long text/files


**Going Deeper - Research**

**Examples**

**Others**
- langchain-extract [github](https://github.com/langchain-ai/langchain-extract)
- kor library [github](https://github.com/eyurtsev/kor)


In [2]:
from dotenv import load_dotenv
from pathlib import Path

dotenv_path = Path("/home/ubuntu/config.env")

load_dotenv(dotenv_path=dotenv_path)

True

### simple

1. define the schema to extract
2. construct extraction_chain
3. invoke

In [35]:
# 1. define the schema

from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height measured in meters")

In [36]:
# 2. a. construct prompt
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    (
        "system", 
        "You are an expert at extracting relevant information from text. "
        "If you do not know the value of an attribute asked to extract, "
        "return null for the attribute's value."
        "\n{format_instructions}",
    ),
    (
        "human",
        "{text}",
    )
])

prompt.pretty_print()


You are an expert at extracting relevant information from text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.
[33;1m[1;3m{format_instructions}[0m


[33;1m[1;3m{text}[0m


In [37]:
# 2. b. define the model
import boto3
from langchain_community.chat_models import BedrockChat

llm = BedrockChat(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    # model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    client=boto3.client("bedrock-runtime"),
    model_kwargs={"temperature": 0.0, "max_tokens":512}
)

In [38]:
# 2. c. define the output parser
from langchain.output_parsers import PydanticOutputParser

output_parser = PydanticOutputParser(pydantic_object=Person)
print(output_parser.get_format_instructions())

prompt = prompt.partial(format_instructions=output_parser.get_format_instructions())
# prompt.pretty_print()

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Information about a person.", "properties": {"name": {"title": "Name", "description": "The name of the person", "type": "string"}, "hair_color": {"title": "Hair Color", "description": "The color of the person's hair if known", "type": "string"}, "height_in_meters": {"title": "Height In Meters", "description": "Height measured in meters", "type": "string"}}, "required": ["name", "hair_color", "height_in_meters"]}
```


In [39]:
# 2. c. construct the extraction chain
extract_chain = prompt | llm | output_parser

In [40]:
# 3. invoke
text = "Alan Smith is 6 feet tall and has blond hair."
extract_chain.invoke({"text": text})

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

In [41]:
# notice the issue: chain should be able to extract 2 Person entities.
text = "Alan Smith is 6 feet tall and has blond hair. Aastha has beautiful black hair."
extract_chain.invoke({"text": text})

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

In [46]:
# we will redefine the schema to get list of Person entities
from typing import List
class Persons(BaseModel):
    "Extracted data about people"
    people: List[Person]

output_parser = PydanticOutputParser(pydantic_object=Persons)
prompt = prompt.partial(format_instructions=output_parser.get_format_instructions())
extract_chain = prompt | llm | output_parser

text = "Alan Smith is 6 feet tall and has blond hair. Aastha has beautiful black hair."
extract_chain.invoke({"text": text}) # Voila !

Persons(people=[Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83'), Person(name='Aastha', hair_color='black', height_in_meters=None)])

In [48]:
# notice the issue: quality of extractions for complicated inputs
text = "Aastha has beautiful black hair. Alan Smith and has blond hair and used to be 6 feet tall as teenager, but grew 4 more inches after graduating from Princeton."

extract_chain.invoke({"text": text}) # Voila !

Persons(people=[Person(name='Aastha', hair_color='black', height_in_meters=None), Person(name='Alan Smith', hair_color='blond', height_in_meters=None)])

### **Next steps**
1. Add examples - observation: quality of extractions not reliable
2. Handle long text
3. Handle Files
4. Use a better parsing approach - observation: frequest parsing errors
5. Guidelines

### add examples

In [83]:
# one way to improve quality of extraction by providing reference exampls to the llm
# 1. a. start by editing the previous prompt
# 1. b. create examples -> convert to list of Messages and feed it to prompt

from langchain_core.prompts import MessagesPlaceholder
from langchain.schema import AIMessage, HumanMessage

prompt = ChatPromptTemplate.from_messages([
    (
        "system", 
        "You are an expert at extracting relevant information from text. "
        "If you do not know the value of an attribute asked to extract, "
        "return null for the attribute's value."
        "\n{format_instructions}\n"
        "Below are examples: "
    ),
    MessagesPlaceholder(variable_name="examples"),
    (
        "human",
        "{text}",
    )
])

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Person(name=None, height_in_meters=None, hair_color=None),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Person(name="Fiona", height_in_meters=None, hair_color=None),
    ),
    (
        "Alan Smith was 5 feet tall as teenager, but grew 1 more feet after graduating from Princeton.",
        Person(name="Alan Smith", height_in_meters="1.83", hair_color=None),
    )
]

examples_messages = []
for ex in examples:
    examples_messages.append(HumanMessage(content=ex[0], role="H"))
    examples_messages.append(AIMessage(content=ex[1].json(), role="A"))

print(examples_messages)

prompt = prompt.partial(examples=examples_messages)
# prompt.pretty_print()

[HumanMessage(content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.", role='H'), AIMessage(content='{"name": null, "hair_color": null, "height_in_meters": null}', role='A'), HumanMessage(content='Fiona traveled far from France to Spain.', role='H'), AIMessage(content='{"name": "Fiona", "hair_color": null, "height_in_meters": null}', role='A'), HumanMessage(content='Alan Smith was 5 feet tall as teenager, but grew 1 more feet after graduating from Princeton.', role='H'), AIMessage(content='{"name": "Alan Smith", "hair_color": null, "height_in_meters": "1.83"}', role='A')]


In [85]:
# 2. redefine the extract_chain. rest all remains same
from typing import List
class Persons(BaseModel):
    "Extracted data about people"
    people: List[Person]

output_parser = PydanticOutputParser(pydantic_object=Persons)
prompt = prompt.partial(format_instructions=output_parser.get_format_instructions())
extract_chain = prompt | llm | output_parser

In [86]:
# text = "Alan Smith is 6 feet tall and has blond hair. Aastha has beautiful black hair."
text = "Aastha has beautiful black hair. Alan Smith and has blond hair and used to be 6 feet tall as teenager, but grew 4 more inches after graduating from Princeton."
extract_chain.invoke({"text": text}) # Voila !

Persons(people=[Person(name='Aastha', hair_color='black', height_in_meters=None), Person(name='Alan Smith', hair_color='blond', height_in_meters='1.93')])

### handle files

In [88]:
# extract from file types - html, pdf, plain text etc
# parse content based on file type that user uploads - MimeTypeBasedParser

In [1]:
! pip install python-magic beautifulsoup4 lxml --quiet
! pip install -qU  "amazon-textract-caller>=0.2.0"
! pip install -qU amazon-textract-textractor

[0m

In [41]:
# data - html content
import requests

response = requests.get("https://en.wikipedia.org/wiki/Car")
html_data = response.content
html_data[:20]


b'<!DOCTYPE html>\n<htm'

In [38]:
# data - pdf content
import boto3
from langchain.document_loaders.parsers.pdf import AmazonTextractPDFParser
from textractor.data.text_linearization_config import TextLinearizationConfig

pdf_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"

parser = AmazonTextractPDFParser(
    # textract_features=["TABLES", "LAYOUT"], # some issue with tc.Textract_Features("TABLES")
    client = boto3.client("textract", region_name="us-east-2"), # us-east-1 gives InvalidS3Objectexception. learning: s3 object and textract region must match
    linearization_config=TextLinearizationConfig(
        hide_header_layout=True,
        hide_footer_layout=True,
        hide_figure_layout=True,
    ),
)

blob = Blob.from_path(path=pdf_path)
parser.parse(blob=blob)

[Document(page_content='LayoutParser: A Unified Toolkit for Deep\n\n\nLearning Based Document Image Analysis\n\n\nZejiang Shen¹ (X), Ruochen Zhang2, Melissa Dell³ Benjamin Charles Germain\n\n\nLee4, Jacob Carlson3 and Weining Li5\n\n\n1\n\n\nAllen Institute for AI\n\n\nshannons@allenai.org\n\n\n2\n\n\nBrown University\n\n\nruochen_zhang@brown.edu\n\n\n3\n\n\nHarvard University\n\n\n{melissadell,jacobcarlson}@fas.harvard.edu\n\n\n4\n\n\nUniversity of Washington\n\n\nbcgl@cs.washington.edu\n\n\n5\n\n\nUniversity of Waterloo\n\n\nw422li@uwaterloo.ca\n\n\nAbstract. Recent advances in document image analysis (DIA) have been\n\n\nprimarily driven by the application of neural networks. Ideally, research\n\n\noutcomes could be easily deployed in production and extended for further\n\n\ninvestigation. However, various factors like loosely organized codebases\n\n\nand sophisticated model configurations complicate the easy reuse of im-\n\n\nportant innovations by a wide audience. Though there hav

In [39]:
# data - text

with open("../data/state_of_the_union.txt") as file:
    text_data = file.read()

print(text_data[:20])

Madam Speaker, Madam


In [43]:
# mime parser for all 3 types of data

import magic
from langchain.document_loaders.parsers import (
    BS4HTMLParser, PDFMinerParser
)
from langchain.document_loaders.parsers.txt import TextParser
from langchain_community.document_loaders import Blob
from langchain.document_loaders.parsers.generic import MimeTypeBasedParser

pdf_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"

textract_parser = AmazonTextractPDFParser(
    # textract_features=["TABLES", "LAYOUT"], # some issue with tc.Textract_Features("TABLES")
    client = boto3.client("textract", region_name="us-east-2"), # us-east-1 gives InvalidS3Objectexception. learning: s3 object and textract region must match
    linearization_config=TextLinearizationConfig(
        hide_header_layout=True,
        hide_footer_layout=True,
        hide_figure_layout=True,
    ),
)

handlers = {
    # "application/pdf": PDFMinerParser(),
    "application/pdf": textract_parser,
    "text/plain": TextParser(),
    "text/html": BS4HTMLParser(features="html.parser"),
}

mime_parser = MimeTypeBasedParser(
    handlers=handlers,
    fallback_parser=None
)

mime = magic.Magic(mime=True)


def parse_data(data=None, mime_type=None, pdf_path=None):
    if pdf_path is None:
        blob= Blob.from_data(
            data=data,
            mime_type=mime_type
        )
        mime_type=mime.from_buffer(data)
    else:
        blob = Blob.from_path(path=pdf_path)
        mime_type = "application/pdf"
        
    parser = handlers[mime_type]
    docs = parser.parse(blob=blob)
    return docs

In [46]:
# docs = parse_data(data=html_data)
# docs = parse_data(data=text_data)
docs = parse_data(pdf_path=pdf_path)
print(docs[0].page_content)

LayoutParser: A Unified Toolkit for Deep


Learning Based Document Image Analysis


Zejiang Shen¹ (X), Ruochen Zhang2, Melissa Dell³ Benjamin Charles Germain


Lee4, Jacob Carlson3 and Weining Li5


1


Allen Institute for AI


shannons@allenai.org


2


Brown University


ruochen_zhang@brown.edu


3


Harvard University


{melissadell,jacobcarlson}@fas.harvard.edu


4


University of Washington


bcgl@cs.washington.edu


5


University of Waterloo


w422li@uwaterloo.ca


Abstract. Recent advances in document image analysis (DIA) have been


primarily driven by the application of neural networks. Ideally, research


outcomes could be easily deployed in production and extended for further


investigation. However, various factors like loosely organized codebases


and sophisticated model configurations complicate the easy reuse of im-


portant innovations by a wide audience. Though there have been on-going


efforts to improve reusability and simplify deep learning (DL) model


develop

### parsing

In [None]:
# 1. add this below text before {format_instructions} in the prompt
# option1: Wrap the output in `json` tags
# option2: "Make sure to wrap the answer in ```json and ``` tags",

# 2. custom parser
def extract_json(message: AIMessage) -> List[dict]:
    """Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.

    Parameters:
        text (str): The text containing the JSON content.

    Returns:
        list: A list of extracted JSON strings.
    """
    text = message.content
    # Define the regular expression pattern to match JSON blocks
    pattern = r"```json(.*?)```"

    # Find all non-overlapping matches of the pattern in the string
    matches = re.findall(pattern, text, re.DOTALL)

    # Return the list of matched JSON strings, stripping any leading or trailing whitespace
    try:
        return [json.loads(match.strip()) for match in matches]
    except Exception:
        raise ValueError(f"Failed to parse: {message}")

# 3. Use Kor library

### guidelines

- set temperature=0
- work on prompting
    - schema - if too big, try breaking into multiple smaller schemas -> run separate extractions and merge the results
    - reference examples - if too many, semantic search get relevant ones
- start eval & benchmark early
    - use langsmith
    - langchain-benchmarks
    - aiming for perfect quality -> plan for human in the loop
- add verification/correction steps -> ask llms to correct or verify the results
- if extracting info from a single structured source, using llm is probably not a good idea -> traditional web-scraping will be much cheaper and reliable.