<a href="https://colab.research.google.com/github/andreabenevenut/LLM_workshop/blob/main/notebooks/2_Parsing_structured_information_via_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Output Parsers



One of the compelling use cases that demonstrate the power of Large Language models is their capability to efficiently extract pertinent information from text and subsequently parse it into structured formats.

Textual data comes in various forms: articles, research papers, social media posts, and more—often containing critical insights that are buried within the unstructured content. Extracting this information manually can be time-consuming and error-prone. Here's where language models equipped with natural language understanding and processing capabilities can play an important role.

# 0: SET UP

In [None]:
!git clone https://github.com/andreabenevenut/LLM_workshop

In [None]:
!pip install -r "/content/LLM_workshop/requirements.txt"

In [None]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.indexes.vectorstore import VectorstoreIndexCreator

# 1: JSON Output Parser
Suppose that we have a piec of text and we want to extract information from it in a specific format.

One of the most popular formats is JSON.

To that, we simply need one to specify the desired output schema and make use of one of the models that langchain provides to interact with LLMs.

In [None]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

In [None]:
sample_text = """
  Alex (25 years old) is 1.83 meters tall and loves playing basketball. He comes from Orlando and has quite a big family (5 siblings). He has a red Audi and works in the financial sector.

  His friend, Claudia, is one year younger than him and has the passion for travelling. She has a dog named Nellie and owns a green BMW. Claudia is very outgoing, sporty and has a good sense of humor.

  Ben, Claudia's dad, is 63 and is into peotry and music.
  His sophisticated silver Volvo S90 gleams in the driveway, reflecting his appreciation for elegance and refinement. He worked as a librarian for almost 40 years at the local school in town.
  """

In [None]:
# LLM
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# Schema
schema = {
    "properties": {
        "name": {"type": "string"},
        "hobby": {"type": "string"},
        "age": {"type": "integer"},
    },
    "required": ["name", "age"],
}

# Extraction chain
chain = create_extraction_chain(schema, llm)

chain.run(sample_text)

In [None]:
chain

### Multiple entity types

It is also possible to define multiple entities and require our LLM to extract information about each entity separately. To do that, we can make use of prefixes that will indicate the entity name.

Suppose that we want to differentiate between people and cars.

We can add `person_` and `car_` prefixes for each property

In [None]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_hobby": {"type": "string"},
        "person_age": {"type": "integer"},
        "car_model": {"type": "string"},
        "car_color": {"type": "string"},
    },
    "required": ["person_name", "person_age", "car_model"],
}

chain = create_extraction_chain(schema, llm)

chain.run(sample_text)

### Extra information

The power of functions (relative to using parsers alone) lies in the ability to perform semantic extraction.

In particular, `we can ask for things that are not explicitly enumerated in the schema`.

Suppose we want unspecified additional information about dogs.

We can use add a placeholder for unstructured extraction, `dog_extra_info`.

In [None]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_hobby": {"type": "string"},
        "person_age": {"type": "integer"},
        "person_family": {"type": "string"},
        "person_dog": {"type": "string"},
        "car_model": {"type": "string"},
        "car_color": {"type": "string"},
    },
    "required": ["person_name", "person_age", "car_model"],
}

chain = create_extraction_chain(schema, llm)

output = chain.run(sample_text)
output

In [None]:
import pandas as pd
df = pd.DataFrame(output)
df

### Pydantic

Pydantic is a data validation and settings management library for Python.

It allows you to create data classes with attributes that are automatically validated when you instantiate an object.

Lets define a class with attributes annotated with types.

In [None]:
from typing import Optional, List
from langchain.chains import create_extraction_chain_pydantic
from langchain.pydantic_v1 import BaseModel

# Pydantic data class
class Properties(BaseModel):
    person_name: str
    person_hobby: List[str]
    person_age: int
    person_family: Optional[str]
    person_dog: Optional[str]
    car_model: Optional[str]
    car_color: Optional[str]

# Extraction
chain = create_extraction_chain_pydantic(pydantic_schema=Properties, llm=llm, verbose=True)

# Run
output = chain.run(sample_text)

In [None]:
output

In [None]:
df = pd.DataFrame([vars(obj) for obj in output])
df

## Format instructions for a chain

[Output parsers](/docs/modules/model_io/output_parsers/) are classes that help structure language model responses.

As shown above, they are used to parse the output of the OpenAI function calls in `create_extraction_chain`.

But, they can be used independent of functions.

### Pydantic

Just as a above, let's parse a generation based on a Pydantic data class.

In [None]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import Sequence
from langchain.chains import LLMChain



llm = ChatOpenAI(temperature=1, model="gpt-3.5-turbo")

# Define your desired data structure.
class PetInfo(BaseModel):
    pet_name: str = Field(description="This is the name of the pet")
    reasoning: str = Field(description="This is the reasons for the score")
    likelihood_of_success: int = Field(description="This is an integer score between 1-10")

    # You can add custom validation logic easily with Pydantic.
    @validator('likelihood_of_success')
    def check_score(cls, field):
        if field >10:
            raise ValueError("Badly formed Score")
        return field


class PetNames(BaseModel):
    pet_names: Sequence[PetInfo] = Field(..., description="The pet names")


# Set up a parser + inject instructions into the prompt template.
pydantic_parser = PydanticOutputParser(pydantic_object=PetNames)

format_instructions = pydantic_parser.get_format_instructions()


my_prompt = PromptTemplate(
    input_variables = ["pet_description", "format_instructions"],
    template = """
    You are a wizard at inventing adorable and hilarious names for pets!
    Your talent lies in crafting names that are both charming and amusing.
    Take the pet's description below, enclosed by triple backticks, and use it as inspiration to create 4 awesome names for the pet.

    pet description: ```{pet_description}```

    After crafting the new pet names, rate their potential success on a scale of 1 to 10 based on how catchy and memorable they are!

    {format_instructions}
    """
)

pet_chain = LLMChain(llm=llm, prompt=my_prompt)

In [None]:
answer = pet_chain({"pet_description": "My cat is 10 years old, it is quite grumpy if I do not give him food, sleeps a lot. He has a gray hair and blue eyes.",
                    "format_instructions": format_instructions})
output = answer['text']
parsed_output = pydantic_parser.parse(output)
parsed_output

In [None]:
df = pd.DataFrame([dict(obj) for obj in parsed_output.pet_names])
df


## EXERCISE