<a href="https://colab.research.google.com/github/andreabenevenut/LLM_workshop/blob/main/notebooks/2_Parsing_structured_information_via_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Output Parsers



One of the compelling use cases that demonstrate the power of Large Language models is their capability to efficiently extract pertinent information from text and subsequently parse it into structured formats.

Textual data comes in various forms: articles, research papers, social media posts, and more—often containing critical insights that are buried within the unstructured content. Extracting this information manually can be time-consuming and error-prone. Here's where language models equipped with natural language understanding and processing capabilities can play an important role.

# 0: SET UP

In [2]:
!git clone https://github.com/andreabenevenut/LLM_workshop

Cloning into 'LLM_workshop'...
remote: Enumerating objects: 63, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 63 (delta 20), reused 43 (delta 11), pack-reused 0[K
Receiving objects: 100% (63/63), 33.04 KiB | 593.00 KiB/s, done.
Resolving deltas: 100% (20/20), done.


In [3]:
!pip install -r "/content/LLM_workshop/requirements.txt"

Collecting langchain==0.0.340 (from -r /content/LLM_workshop/requirements.txt (line 1))
  Downloading langchain-0.0.340-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai==1.3.5 (from -r /content/LLM_workshop/requirements.txt (line 2))
  Downloading openai-1.3.5-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.8/220.8 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb==0.4.18 (from -r /content/LLM_workshop/requirements.txt (line 3))
  Downloading chromadb-0.4.18-py3-none-any.whl (502 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.4/502.4 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken==0.5.1 (from -r /content/LLM_workshop/requirements.txt (line 4))
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [9

In [4]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [5]:
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.indexes.vectorstore import VectorstoreIndexCreator

# 1: JSON Output Parser
Suppose that we have a piec of text and we want to extract information from it in a specific format.

One of the most popular formats is JSON.

To that, we simply need one to specify the desired output schema and make use of one of the models that langchain provides to interact with LLMs.

In [6]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

In [7]:
sample_text = """
  Alex (25 years old) is 1.83 meters tall and loves playing basketball. He comes from Orlando and has quite a big family (5 siblings). He has a red Audi and works in the financial sector.

  His friend, Claudia, is one year younger than him and has the passion for travelling. She has a dog named Nellie and owns a green BMW. Claudia is very outgoing, sporty and has a good sense of humor.

  Ben, Claudia's dad, is 63 and is into peotry and music.
  His sophisticated silver Volvo S90 gleams in the driveway, reflecting his appreciation for elegance and refinement. He worked as a librarian for almost 40 years at the local school in town.
  """

In [8]:
# LLM
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# Schema
schema = {
    "properties": {
        "name": {"type": "string"},
        "hobby": {"type": "string"},
        "age": {"type": "integer"},
    },
    "required": ["name", "age"],
}

# Extraction chain
chain = create_extraction_chain(schema, llm)

chain.run(sample_text)

[{'name': 'Alex', 'hobby': 'playing basketball', 'age': 25},
 {'name': 'Claudia', 'hobby': 'travelling', 'age': 24},
 {'name': 'Ben', 'hobby': 'poetry and music', 'age': 63}]

In [9]:
chain

LLMChain(prompt=ChatPromptTemplate(input_variables=['input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template="Extract and save the relevant entities mentioned in the following passage together with their properties.\n\nOnly extract the properties mentioned in the 'information_extraction' function.\n\nIf a property is not present and is not required in the function parameters, do not include it in the output.\n\nPassage:\n{input}\n"))]), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7c0235f5c5e0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7c0235f7eb00>, temperature=0.0, openai_api_key='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', openai_proxy=''), output_parser=JsonKeyOutputFunctionsParser(key_name='info'), llm_kwargs={'functions': [{'name': 'information_extraction', 'description': 'Extracts the relevant information from the passage.', 'parameters': {'type': 'ob

### Multiple entity types

It is also possible to define multiple entities and require our LLM to extract information about each entity separately. To do that, we can make use of prefixes that will indicate the entity name.

Suppose that we want to differentiate between people and cars.

We can add `person_` and `car_` prefixes for each property

In [10]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_hobby": {"type": "string"},
        "person_age": {"type": "integer"},
        "car_model": {"type": "string"},
        "car_color": {"type": "string"},
    },
    "required": ["person_name", "person_age", "car_model"],
}

chain = create_extraction_chain(schema, llm)

chain.run(sample_text)

[{'person_name': 'Alex',
  'person_age': 25,
  'person_hobby': 'playing basketball',
  'car_model': 'Audi',
  'car_color': 'red'},
 {'person_name': 'Claudia',
  'person_age': 24,
  'person_hobby': 'travelling',
  'car_model': 'BMW',
  'car_color': 'green'},
 {'person_name': 'Ben',
  'person_age': 63,
  'person_hobby': 'poetry and music',
  'car_model': 'Volvo S90',
  'car_color': 'silver'}]

### Extra information

The power of functions (relative to using parsers alone) lies in the ability to perform semantic extraction.

In particular, `we can ask for things that are not explicitly enumerated in the schema`.

Suppose we want unspecified additional information about dogs.

We can use add a placeholder for unstructured extraction, `dog_extra_info`.

In [11]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_hobby": {"type": "string"},
        "person_age": {"type": "integer"},
        "person_family": {"type": "string"},
        "person_dog": {"type": "string"},
        "car_model": {"type": "string"},
        "car_color": {"type": "string"},
    },
    "required": ["person_name", "person_age", "car_model"],
}

chain = create_extraction_chain(schema, llm)

output = chain.run(sample_text)
output

[{'person_name': 'Alex',
  'person_age': 25,
  'person_hobby': 'playing basketball',
  'person_family': '5 siblings',
  'car_model': 'Audi',
  'car_color': 'red'},
 {'person_name': 'Claudia',
  'person_age': 24,
  'person_hobby': 'travelling',
  'person_dog': 'Nellie',
  'car_model': 'BMW',
  'car_color': 'green'},
 {'person_name': 'Ben',
  'person_age': 63,
  'person_hobby': 'poetry and music',
  'car_model': 'Volvo S90',
  'car_color': 'silver'}]

In [12]:
import pandas as pd
df = pd.DataFrame(output)
df

Unnamed: 0,person_name,person_age,person_hobby,person_family,car_model,car_color,person_dog
0,Alex,25,playing basketball,5 siblings,Audi,red,
1,Claudia,24,travelling,,BMW,green,Nellie
2,Ben,63,poetry and music,,Volvo S90,silver,


### Pydantic

Pydantic is a data validation and settings management library for Python.

It allows you to create data classes with attributes that are automatically validated when you instantiate an object.

Lets define a class with attributes annotated with types.

In [13]:
from typing import Optional, List
from langchain.chains import create_extraction_chain_pydantic
from langchain.pydantic_v1 import BaseModel

# Pydantic data class
class Properties(BaseModel):
    person_name: str
    person_hobby: List[str]
    person_age: int
    person_family: Optional[str]
    person_dog: Optional[str]
    car_model: Optional[str]
    car_color: Optional[str]

# Extraction
chain = create_extraction_chain_pydantic(pydantic_schema=Properties, llm=llm, verbose=True)

# Run
output = chain.run(sample_text)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: Extract and save the relevant entities mentioned in the following passage together with their properties.

Only extract the properties mentioned in the 'information_extraction' function.

If a property is not present and is not required in the function parameters, do not include it in the output.

Passage:

  Alex (25 years old) is 1.83 meters tall and loves playing basketball. He comes from Orlando and has quite a big family (5 siblings). He has a red Audi and works in the financial sector.
  
  His friend, Claudia, is one year younger than him and has the passion for travelling. She has a dog named Nellie and owns a green BMW. Claudia is very outgoing, sporty and has a good sense of humor.
  
  Ben, Claudia's dad, is 63 and is into peotry and music.
  His sophisticated silver Volvo S90 gleams in the driveway, reflecting his appreciation for elegance and refinement. He worked as a librarian for alm

In [14]:
output

[Properties(person_name='Alex', person_hobby=['playing basketball'], person_age=25, person_family='big', person_dog=None, car_model='Audi', car_color='red'),
 Properties(person_name='Claudia', person_hobby=['travelling'], person_age=24, person_family='', person_dog='Nellie', car_model='BMW', car_color='green'),
 Properties(person_name='Ben', person_hobby=['poetry', 'music'], person_age=63, person_family='', person_dog=None, car_model='Volvo S90', car_color='silver')]

In [15]:
df = pd.DataFrame([vars(obj) for obj in output])
df

Unnamed: 0,person_name,person_hobby,person_age,person_family,person_dog,car_model,car_color
0,Alex,[playing basketball],25,big,,Audi,red
1,Claudia,[travelling],24,,Nellie,BMW,green
2,Ben,"[poetry, music]",63,,,Volvo S90,silver


## Format instructions for a chain

[Output parsers](/docs/modules/model_io/output_parsers/) are classes that help structure language model responses.

As shown above, they are used to parse the output of the OpenAI function calls in `create_extraction_chain`.

But, they can be used independent of functions.

### Pydantic

Just as a above, let's parse a generation based on a Pydantic data class.

In [29]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import Sequence
from langchain.chains import LLMChain



llm = ChatOpenAI(temperature=1, model="gpt-3.5-turbo")

# Define your desired data structure.
class PetInfo(BaseModel):
    pet_name: str = Field(description="This is the name of the pet")
    reasoning: str = Field(description="This is the reasons for the score")
    likelihood_of_success: int = Field(description="This is an integer score between 1-10")

    # You can add custom validation logic easily with Pydantic.
    @validator('likelihood_of_success')
    def check_score(cls, field):
        if field >10:
            raise ValueError("Badly formed Score")
        return field


class PetNames(BaseModel):
    pet_names: Sequence[PetInfo] = Field(..., description="The pet names")


# Set up a parser + inject instructions into the prompt template.
pydantic_parser = PydanticOutputParser(pydantic_object=PetNames)

format_instructions = pydantic_parser.get_format_instructions()


my_prompt = PromptTemplate(
    input_variables = ["pet_description", "format_instructions"],
    template = """
    You are a wizard at inventing adorable and hilarious names for pets!
    Your talent lies in crafting names that are both charming and amusing.
    Take the pet's description below, enclosed by triple backticks, and use it as inspiration to create 4 awesome names for the pet.

    pet description: ```{pet_description}```

    After crafting the new pet names, rate their potential success on a scale of 1 to 10 based on how catchy and memorable they are!

    {format_instructions}
    """
)

pet_chain = LLMChain(llm=llm, prompt=my_prompt)

In [30]:
answer = pet_chain({"pet_description": "My cat is 10 years old, it is quite grumpy if I do not give him food, sleeps a lot. He has a gray hair and blue eyes.",
                    "format_instructions": format_instructions})
output = answer['text']
parsed_output = pydantic_parser.parse(output)
parsed_output

PetNames(pet_names=[PetInfo(pet_name='Grumpykins', reasoning="The name reflects the cat's grumpy nature and adds a cute and playful twist with the suffix 'kins'.", likelihood_of_success=9), PetInfo(pet_name='Sleepy Smokey', reasoning="The name combines the cat's love for sleeping with its gray hair, creating a memorable and endearing name.", likelihood_of_success=8), PetInfo(pet_name='Blueberry Grump', reasoning="The name incorporates the cat's blue eyes and grumpy personality, resulting in a fun and catchy name.", likelihood_of_success=7), PetInfo(pet_name='Sir Purrington', reasoning="The name adds a touch of sophistication and humor to the cat's grumpy demeanor, making it both charming and amusing.", likelihood_of_success=9)])

In [32]:
df = pd.DataFrame([dict(obj) for obj in parsed_output.pet_names])
df


Unnamed: 0,pet_name,reasoning,likelihood_of_success
0,Grumpykins,The name reflects the cat's grumpy nature and ...,9
1,Sleepy Smokey,The name combines the cat's love for sleeping ...,8
2,Blueberry Grump,The name incorporates the cat's blue eyes and ...,7
3,Sir Purrington,The name adds a touch of sophistication and hu...,9


## EXERCISE