# 📖 TABLE OF CONTENTS

- [1. Introduction]()
- [2. Installing Dependencies]()
- [3. Mount Google Drive & Load API Keys]()
- [4. Output Parsers]()
  - [1. `PydanticOutputParser`]()
    - [Multiple Outputs Example]()
  - [2. `CommaSeparatedListOutputParser`]()
  - [3. `StructuredOutputParser`]()
- [5. Fixing Errors]()
  - [1. OutputFixingParser]()
  - [2. RetryOutputParser]()
- [6. Conclusion]()

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 1. Introduction

While the language models can only generate textual outputs, a predictable data structure is always preferred in a production environment. For example, imagine you are creating a thesaurus application and want to generate a list of possible substitute words based on the context. The LLMs are powerful enough to generate many suggestions easily. Here is a sample output from the ChatGPT for several words with close meaning to the term “behavior.”

In [None]:
Here are some substitute words for "behavior":

Conduct
Manner
Demeanor
Attitude
Disposition
Deportment
Etiquette
Protocol
Performance
Actions

The problem is the lack of a method to extract relevant information from the mentioned string dynamically. You might say we can split the response by a new line and ignore the first two lines. However, there is no guarantee that the response have the same format every time. The list might be numbered, or there could be no introduction line.

The Output Parsers help create a data structure to define the expectations from the output precisely. We can ask for a list of words in case of the word suggestion application or a combination of different variables like a word and the explanation of why it fits. The parser can extract the expected information for you.

This lesson covers the different types of parsing objects and the troubleshooting processing.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 2. Installing Dependencies

In [1]:
!pip3 install langchain==0.0.208 deeplake openai==0.27.8 python-dotenv tiktoken

Collecting langchain==0.0.208
  Downloading langchain-0.0.208-py3-none-any.whl.metadata (13 kB)
Collecting deeplake
  Downloading deeplake-3.9.15.tar.gz (607 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m607.9/607.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting openai==0.27.8
  Downloading openai-0.27.8-py3-none-any.whl.metadata (13 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain==0.0.208)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl.metadata (22 kB)
Collecting langchainplus-sdk>=0.0.13 (from langchain==0.0.208)
  Downloading langchainplus_s

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 3. Mount Google Drive & Load API Keys

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


All the API Keys are stored in file "llm_env". It's contents are as below:

ACTIVELOOP_TOKEN=<[Your Activeloop API Key](https://app.activeloop.ai/register)>

OPENAI_API_KEY=<[Your OpenAI API Key](https://platform.openai.com/)>

GOOGLE_API_KEY=<[Your Google API Key](https://console.cloud.google.com/apis/credentials)>

GOOGLE_CSE_ID=<[Your Google Custom Search Engine ID](https://programmablesearchengine.google.com/controlpanel/create)>

HUGGINGFACEHUB_API_TOKEN=<[Your Hugging Face Access Token](https://huggingface.co/settings/tokens)>

In [2]:
from dotenv import load_dotenv

# Load API Keys for Deep Lake Vector Database, OpenAI, Google & Hugging Face
load_dotenv('/content/drive/MyDrive/ancilcleetus-github/llm_env')

True

In [None]:
import os

print(f"os.environ['ACTIVELOOP_TOKEN']: \n{os.environ['ACTIVELOOP_TOKEN']}")
print(f"os.environ['OPENAI_API_KEY']: \n{os.environ['OPENAI_API_KEY']}")
print(f"os.environ['GOOGLE_API_KEY']: \n{os.environ['GOOGLE_API_KEY']}")
print(f"os.environ['GOOGLE_CSE_ID']: \n{os.environ['GOOGLE_CSE_ID']}")
print(f"os.environ['HUGGINGFACEHUB_API_TOKEN']: \n{os.environ['HUGGINGFACEHUB_API_TOKEN']}")

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 4. Output Parsers

There are three classes that we will introduce in this section. While the Pydrantic parser is the most powerful and flexible wrapper, knowing the other options for less complicated problems is beneficial. We will implement the thesaurus application in each section to better understand the details of each approach.

## 1. `PydanticOutputParser`

This class instructs the model to generate its output in a JSON format and then extract the information from the response. You will be able to treat the parser's output as a list, meaning it will be possible to index through the results without worrying about formatting.

This class uses the Pydantic library, which helps define and validate data structures in Python. It enables us to characterize the expected output with a name, type, and description. We need a variable that can store multiple suggestions in the thesaurus example. It can be easily done by defining a class that inherits from the Pydantic’s BaseModel class.

In [4]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List

# Define your desired data structure.
class Suggestions(BaseModel):
    words: List[str] = Field(description="list of substitue words based on context")

    # Throw error in case of receiving a numbered-list from API
    @validator('words')
    def not_start_with_number(cls, field):
        for item in field:
            if item[0].isnumeric():
                raise ValueError("The word can not start with numbers!")
        return field

parser = PydanticOutputParser(pydantic_object=Suggestions)

We always import and follow the necessary libraries by creating the `Suggestions` schema class. There are two essential parts to this class:

1. **Expected Outputs:** Each output is defined by declaring a variable with desired type, like a list of strings (`: List[str]`) in the sample code, or it could be a single string (`: str`) if you are expecting just one word/sentence as the response. Also, It is required to write a simple explanation using the `Field` function's `description` attribute to help the model during inference. (We will see an example of having multiple outputs later in the lesson)

2. **Validators:** It is possible to declare functions to validate the formatting. We ensure that the first character is not a number in the sample code. The function's name is unimportant, but the `@validator` decorator must receive the same name as the variable you want to approve. (like `@validator('words')`) It is worth noting that the `field` variable inside the validator function will be a list if you specify it as one.

We will pass the created class to the `PydanticOutputParser` wrapper to make it a LangChain parser object. The next step is to prepare the prompt.

In [5]:
from langchain.prompts import PromptTemplate

template = """
Offer a list of suggestions to substitue the specified target_word based the presented context.
{format_instructions}
target_word={target_word}
context={context}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["target_word", "context"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

model_input = prompt.format_prompt(
			target_word="behaviour",
			context="The behaviour of the students in the classroom was disruptive and made it difficult for the teacher to conduct the lesson."
)

As discussed in previous lessons, the `template` variable is a string that can have named index placeholders using the following `{variable_name}` format. The template outlines our expectations for the model, including the expected formatting from the parser and the inputs. The `PromptTemplate` receives the template string with the details of each placeholder's type. They could either be 1) `input_variables` whose value is initialized later on using the `.format_prompt()` function, or 2) `partial_variables` to be initialized instantly.

In [7]:
from langchain.llms import OpenAI

# Initialize LLM
model = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0.0)

output = model(model_input.to_string())

parser.parse(output)

Suggestions(words=['conduct', 'manage', 'handle', 'oversee', 'supervise'])

The parser object's `parse()` function will convert the model's string response to the format we specified. There is a list of words that you can index through and use in your applications.

### Multiple Outputs Example

Here is a sample code for Pydantic class to process multiple outputs. It requests the model to suggest a list of words and present the reasoning behind each proposition.

Replace the `template` variable and `Suggestion` class with the following codes to run this example. The template changes will ask the model to present its reasoning, and the suggestion class declares a new output named `reasons`. Also, the validator function manipulates the output to ensure every reasoning ends with a dot. Another use case of the validator function could be output manipulation.

In [None]:
template = """
Offer a list of suggestions to substitute the specified target_word based on the presented context and the reasoning for each word.
{format_instructions}
target_word={target_word}
context={context}
"""

In [None]:
class Suggestions(BaseModel):
    words: List[str] = Field(description="list of substitue words based on context")
    reasons: List[str] = Field(description="the reasoning of why this word fits the context")

    @validator('words')
    def not_start_with_number(cls, field):
      for item in field:
        if item[0].isnumeric():
          raise ValueError("The word can not start with numbers!")
      return field

    @validator('reasons')
    def end_with_dot(cls, field):
      for idx, item in enumerate( field ):
        if item[-1] != ".":
          field[idx] += "."
      return field

Full code is as below:

In [8]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List

# Define your desired data structure.
class Suggestions(BaseModel):
    words: List[str] = Field(description="list of substitue words based on context")
    reasons: List[str] = Field(description="the reasoning of why this word fits the context")

    @validator('words')
    def not_start_with_number(cls, field):
      for item in field:
        if item[0].isnumeric():
          raise ValueError("The word can not start with numbers!")
      return field

    @validator('reasons')
    def end_with_dot(cls, field):
      for idx, item in enumerate( field ):
        if item[-1] != ".":
          field[idx] += "."
      return field

parser = PydanticOutputParser(pydantic_object=Suggestions)

In [9]:
from langchain.prompts import PromptTemplate

template = """
Offer a list of suggestions to substitute the specified target_word based on the presented context and the reasoning for each word.
{format_instructions}
target_word={target_word}
context={context}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["target_word", "context"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

model_input = prompt.format_prompt(
			target_word="behaviour",
			context="The behaviour of the students in the classroom was disruptive and made it difficult for the teacher to conduct the lesson."
)

In [10]:
from langchain.llms import OpenAI

# Initialize LLM
model = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0.0)

output = model(model_input.to_string())

parser.parse(output)

Suggestions(words=['conduct', 'manage', 'handle', 'oversee'], reasons=["These words all imply a sense of control and authority, which is lacking in the original context. They also suggest a more active role in guiding the students' actions.", 'These words all suggest a more organized and structured approach to the situation, which contrasts with the disruptive behaviour of the students.', 'These words all imply a sense of responsibility and leadership, which is lacking in the original context. They also suggest a more proactive approach to addressing the issue.', "These words all suggest a more authoritative and assertive approach to managing the students' behaviour, which may be necessary in this situation."])

## 2. `CommaSeparatedListOutputParser`

It is evident from the name of this class that it manages comma-separated outputs. It handles one specific case: anytime you want to receive a list of outputs from the model.

In [11]:
from langchain.output_parsers import CommaSeparatedListOutputParser

parser = CommaSeparatedListOutputParser()

In [12]:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

# Prepare the Prompt
template = """
Offer a list of suggestions to substitute the word '{target_word}' based the presented the following text: {context}.
{format_instructions}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["target_word", "context"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

model_input = prompt.format(
  target_word="behaviour",
  context="The behaviour of the students in the classroom was disruptive and made it difficult for the teacher to conduct the lesson."
)

# Loading OpenAI API
model = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0.0)

# Send the Request
output = model(model_input)
parser.parse(output)

['1. Conduct\n2. Manner\n3. Demeanor\n4. Conducting\n5. Attitude\n6. Conductance\n7. Deportment\n8. Etiquette\n9. Performance\n10. Actions']

Although most of the sample code has been explained in the previous subsection, two parts might need attention. Firstly, we tried a new format for the prompt's template to show different ways to write a prompt. Secondly, the use of `.format()` instead of `.format_prompt()` to generate the model's input. The main difference compared to the previous subsection's code is that we no longer need to call the `.to_string()` object since the prompt is already in string type.

As you can see, the final output is a list of words that has some overlaps with the `PydanticOutputParser` approach with more variety. However, requesting additional reasoning information using the `CommaSeparatedListOutputParser` class is impossible.

## 3. `StructuredOutputParser`

This is the first output parser implemented by the LangChain team. While it can process multiple outputs, it only supports texts and does not provide options for other data types, such as lists or integers. It can be used when you want to receive one response from the model. For example, only one substitute word in the thesaurus application.

In [None]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

response_schemas = [
    ResponseSchema(name="words", description="A substitue word based on context"),
    ResponseSchema(name="reasons", description="the reasoning of why this word fits the context.")
]

parser = StructuredOutputParser.from_response_schemas(response_schemas)

This class has no advantage since the `PydanticOutputParser` class provides validation and more flexibility for more complex tasks, and the `CommaSeparatedListOutputParser` option covers more straightforward applications.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 5. Fixing Errors

The parsers are powerful tools to dynamically extract the information from the prompt and validate it to some extent. Still, they do not guarantee a response. Imagine a situation where you deployed your application, and the model's response [to a user’s request] is incomplete, causing the parser to throw an error. It is not ideal! In the following subsections, we will introduce two classes acting as fail-safe. They add a layer on top of the model's response to help fix the errors.

**Note**

The following approaches work with the `PydanticOutputParser` class since it is the only one with a validation method.

## 1. OutputFixingParser

This method tries to fix the parsing error by looking at the model’s response and the previous parser. It uses a Large Language Model (LLM) to solve the issue. We will use GPT-3 to be consistent with the rest of the lesson, but it is possible to pass any supported model. Let's start by defining the Pydantic data schema and show a sample error that could occur.

In [13]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

# Define your desired data structure.
class Suggestions(BaseModel):
    words: List[str] = Field(description="list of substitue words based on context")
    reasons: List[str] = Field(description="the reasoning of why this word fits the context")

parser = PydanticOutputParser(pydantic_object=Suggestions)

missformatted_output = '{"words": ["conduct", "manner"], "reasoning": ["refers to the way someone acts in a particular situation.", "refers to the way someone behaves in a particular situation."]}'

parser.parse(missformatted_output)

OutputParserException: Failed to parse Suggestions from completion {"words": ["conduct", "manner"], "reasoning": ["refers to the way someone acts in a particular situation.", "refers to the way someone behaves in a particular situation."]}. Got: 1 validation error for Suggestions
reasons
  field required (type=value_error.missing)

As you can see in the error message, the parser correctly identified an error in our sample response (`missformatted_output`) since we used the word `reasoning` instead of the expected `reasons` key. The `OutputFixingParser` class could easily fix this error.

In [14]:
from langchain.llms import OpenAI
from langchain.output_parsers import OutputFixingParser

model = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0.0)

outputfixing_parser = OutputFixingParser.from_llm(parser=parser, llm=model)
outputfixing_parser.parse(missformatted_output)

Suggestions(words=['conduct', 'manner'], reasons=['refers to the way someone acts in a particular situation.', 'refers to the way someone behaves in a particular situation.'])

The `from_llm()` function takes the old parser and a language model as input parameters. Then, It initializes a new parser for you that has the ability to fix output errors. In this case, it successfully identified the misnamed key and changed it to what we defined.

However, fixing the issues using this class is not always possible. Here is an example of using `OutputFixingParser` class to resolve an error with a missing key.

In [15]:
missformatted_output = '{"words": ["conduct", "manner"]}'

outputfixing_parser = OutputFixingParser.from_llm(parser=parser, llm=model)

outputfixing_parser.parse(missformatted_output)

Suggestions(words=['conduct', 'manner'], reasons=['These words both describe a way of behaving or carrying oneself.'])

Looking at the output, it is evident that the model understood the key `reasons` missing from the response but didn't have the context of the desired outcome. It created a list with one entry, while we expect one reason per word. This is why we sometimes need to use the `RetryOutputParser` class.

## 2. RetryOutputParser

In some cases, the parser needs access to both the output and the prompt to process the full context, as demonstrated in the previous section.

In [16]:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

# Define data structure.
class Suggestions(BaseModel):
    words: List[str] = Field(description="list of substitue words based on context")
    reasons: List[str] = Field(description="the reasoning of why this word fits the context")

parser = PydanticOutputParser(pydantic_object=Suggestions)

# Define prompt
template = """
Offer a list of suggestions to substitue the specified target_word based the presented context and the reasoning for each word.
{format_instructions}
target_word={target_word}
context={context}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["target_word", "context"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

model_input = prompt.format_prompt(target_word="behaviour", context="The behaviour of the students in the classroom was disruptive and made it difficult for the teacher to conduct the lesson.")

# Define Model
model = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0.0)

Now, we can fix the same `missformatted_output` using the `RetryWithErrorOutputParser` class. It receives the old parser and a model to declare the new parser object, as we saw in the previous section. However, the `parse_with_prompt` function is responsible for fixing the parsing issue while requiring the output and the prompt.

In [17]:
from langchain.output_parsers import RetryWithErrorOutputParser

missformatted_output = '{"words": ["conduct", "manner"]}'

retry_parser = RetryWithErrorOutputParser.from_llm(parser=parser, llm=model)

retry_parser.parse_with_prompt(missformatted_output, model_input)

Suggestions(words=['conduct', 'manner'], reasons=['These words both convey a sense of control and order, which is the opposite of disruptive behaviour in a classroom setting.', 'Both words also imply a level of professionalism and respect, which is important in a classroom environment.'])

The outputs show that the `RetryOutputParser` has the ability to fix the issue where the `OuputFixingParser` was not able to. The parser correctly guided the model to generate one reason for each word.

The best practice to incorporate these techniques in production is to catch the parsing error using a `try: ... except: ...` method. It means we can capture the errors in the `except` section and attempt to fix them using the mentioned classes. It will limit the number of API calls and avoid unnecessary costs that are associated with it.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 6. Conclusion

We learned how to validate and extract the information in an easy-to-use format from the language models' responses which are always a string. Additionally, we reviewed LangChain's fail-safe procedures to guarantee the consistency of the output. Combining these approaches will help us write more reliable applications in production environments.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)