<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_05_4_custom_parsers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 5: LangChain: Data Extraction**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 5 Material

* Part 5.1: Structured Output Parser [[Video]](https://www.youtube.com/watch?v=62CSR141VRE) [[Notebook]](t81_559_class_05_1_langchain_data.ipynb)
* Part 5.2: Other Parsers (CSV, JSON, Pandas, Datetime) [[Video]](https://www.youtube.com/watch?v=VXm8gPzU3qc) [[Notebook]](t81_559_class_05_2_parsers.ipynb)
* Part 5.3: Pydantic parser [[Video]](https://www.youtube.com/watch?v=dc4fn-W60hg) [[Notebook]](t81_559_class_05_3_pydantic.ipynb)
* **Part 5.4: Custom Output Parser** [[Video]](https://www.youtube.com/watch?v=jBpkAblQC_U) [[Notebook]](t81_559_class_05_4_custom_parsers.ipynb)
* Part 5.5: Output-Fixing Parser [[Video]](https://www.youtube.com/watch?v=_txWiLjf4bo) [[Notebook]](t81_559_class_05_5_output_fixing_parsers.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [None]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai

Note: using Google CoLab


# 5.4: Custom Output Parsers

In certain scenarios, you might want to create a custom parser to format the model output uniquely.

There are two ways to create a custom parser:

* Using **RunnableLambda** or **RunnableGenerator** in LCEL - This is the recommended approach for most cases.
* Inheriting from one of the base classes for output parsing - This is the more challenging method.

The differences between these approaches are mostly superficial, primarily involving which callbacks are triggered (e.g., on_chain_start vs. on_parser_start) and how a runnable lambda vs. a parser is visualized in a tracing platform like LangSmith.

I suggest using runnable lambdas and runnable generators for parsing.

The following code creates a basic LLM model to use.



In [None]:
from langchain_openai import ChatOpenAI

MODEL = 'gpt-4o-mini'
TEMPERATURE = 0.0

# Initialize the OpenAI LLM with your API key
llm = ChatOpenAI(
    model=MODEL,
    temperature=TEMPERATURE,
    n=1
)

In this section, we will create a simple parser that inverts the case of the model's output.

For example, if the model outputs "Hello World," the parser will transform it to "hELLO wORLD."

In [None]:
from typing import Iterable

from langchain_core.messages import AIMessage, AIMessageChunk

def parse(ai_message: AIMessage) -> str:
    """Parse the AI message."""
    return ai_message.content.swapcase()


chain = llm | parse
chain.invoke("hello")

'hELLO! hOW CAN i ASSIST YOU TODAY?'

## Inherting from Parsing Base Classes

Another way to implement a parser is by inheriting from BaseOutputParser, BaseGenerationOutputParser, or another base parser depending on your needs.

We generally do not recommend this approach for most use cases, as it requires more code without offering significant benefits.

The simplest type of output parser extends the BaseOutputParser class and must implement the following methods:

* **parse**: Takes the string output from the model and parses it.
* **(optional) _type**: Identifies the name of the parser.
When the output from the chat model or LLM is malformed, the parser can throw an OutputParserException to indicate that parsing failed due to bad input. Using this exception allows code utilizing the parser to handle exceptions consistently.

Since BaseOutputParser implements the Runnable interface, any custom parser you create this way will become a valid LangChain Runnable, benefiting from automatic async support, batch interface, logging support, and more.

Here's a simple parser that can parse a string representation of a boolean (e.g., YES or NO) and convert it into the corresponding boolean type.

In [None]:
from langchain_core.exceptions import OutputParserException
from langchain_core.output_parsers import BaseOutputParser


class BooleanOutputParser(BaseOutputParser[bool]):
    """Custom parser to interpret 'YES'/'NO' strings as boolean values."""

    true_val: str = "YES"
    false_val: str = "NO"

    def parse(self, text: str) -> bool:
        """
        Parse the input text and return a boolean value.

        Args:
            text (str): The input text to parse.

        Returns:
            bool: True if text matches true_val, False if it matches false_val.

        Raises:
            OutputParserException: If the text does not match true_val or false_val.
        """
        cleaned_text = text.strip().upper()
        if cleaned_text not in (self.true_val.upper(), self.false_val.upper()):
            raise OutputParserException(
                f"BooleanOutputParser expected output value to be either "
                f"{self.true_val} or {self.false_val} (case-insensitive). "
                f"Received {cleaned_text}."
            )
        return cleaned_text == self.true_val.upper()

    @property
    def _type(self) -> str:
        """
        Return the type of the parser.

        Returns:
            str: The type of the parser.
        """
        return "boolean_output_parser"


In [None]:
parser = BooleanOutputParser()
parser.invoke("YES")

True

In [None]:
try:
    parser.invoke("MEOW")
except Exception as e:
    print(f"Triggered an exception of type: {type(e)}")

Triggered an exception of type: <class 'langchain_core.exceptions.OutputParserException'>


In [None]:
parser = BooleanOutputParser(true_val="OKAY")
parser.invoke("OKAY")

True

In [None]:
parser.batch(["OKAY", "NO"])

[True, False]

In [None]:
await parser.abatch(["OKAY", "NO"])

[True, False]

In [None]:
llm.invoke("say either OKAY or NO")

AIMessage(content='OKAY', response_metadata={'token_usage': {'completion_tokens': 2, 'prompt_tokens': 13, 'total_tokens': 15}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b969e2cc-df17-4c98-a890-703318c5a4c2-0')

In [None]:
chain = llm | parser
chain.invoke("say either OKAY or NO")

True

### Stripping Non-Python Text

Large Language Models (LLMs) like GPT-4 are capable of generating text that seamlessly intermixes code and explanatory descriptions. While this can be incredibly useful for learning and documentation purposes, it can pose challenges when one needs to extract and execute only the code from such mixed-content outputs. To address this, we will implement a simple function designed to strip non-Python code lines from a given text string.

This approach involves using regular expressions to identify and retain lines that match typical Python syntax while discarding lines that appear to be descriptive text. However, due to the inherent complexity and variability of both Python code and natural language, this method can never be perfect. It relies on heuristic patterns that may sometimes misclassify code as text or vice versa.

In the next section, we will explore how another LLM can assist in the process of stripping non-Python code, potentially offering a more sophisticated and accurate solution. The following sample contains a mixture of both LLM comments and generated code.









In [None]:
# Example usage
mixed_text = """
Yes, you can estimate the value of Pi using various methods in Python. One
common approach is the Monte Carlo method. Here's a simple example:

```python
import random

def estimate_pi(num_samples):
    inside_circle = 0

    for _ in range(num_samples):
        x = random.uniform(0, 1)
        y = random.uniform(0, 1)
        distance = x**2 + y**2

        if distance <= 1:
            inside_circle += 1

    pi_estimate = (inside_circle / num_samples) * 4
    return pi_estimate

num_samples = 1000000
pi_estimate = estimate_pi(num_samples)
print(f"Estimated value of Pi: {pi_estimate}")
```

This code uses the Monte Carlo method to estimate Pi by generating random points
within a unit square and checking how many fall inside a quarter circle. The
ratio of points inside the circle to the total points, multiplied by 4, gives an
estimate of Pi.

Would you like to explore other methods or need further explanation on this
approach?

"""

We now provide a function to strip the non-Python text. The extract_python_code function works by utilizing regular expressions to locate and extract blocks of Python code enclosed within triple backticks. It uses the re.findall function with a pattern that matches text between python and delimiters. The re.DOTALL flag is included to ensure that the regular expression can match newline characters within the code block, allowing for multi-line code extraction. The matched code blocks are then joined into a single string, with any leading or trailing whitespace removed using the strip method. This approach effectively isolates the Python code from the surrounding mixed text, making it easy to extract and use independently.

In [None]:
import re

def extract_python_code(mixed_text):
    code_blocks = re.findall(r'```python(.*?)```', mixed_text, re.DOTALL)
    return "\n".join(code_blocks).strip()

The following shows how we can use the extract_python_code to extract the Python code.



In [None]:
python_code = extract_python_code(mixed_text)
print(python_code)

import random

def estimate_pi(num_samples):
    inside_circle = 0

    for _ in range(num_samples):
        x = random.uniform(0, 1)
        y = random.uniform(0, 1)
        distance = x**2 + y**2

        if distance <= 1:
            inside_circle += 1

    pi_estimate = (inside_circle / num_samples) * 4
    return pi_estimate

num_samples = 1000000
pi_estimate = estimate_pi(num_samples)
print(f"Estimated value of Pi: {pi_estimate}")


### Creating a Code Output Parser.

We now create a custom output parser to remove any non-Python code.

In [None]:
from langchain_core.exceptions import OutputParserException
from langchain_core.output_parsers import BaseOutputParser

class CodeOutputParser(BaseOutputParser[str]):
    """Custom code parser."""

    def parse(self, text):
      return extract_python_code(text)

    @property
    def _type(self) -> str:
        return "CodeOutputParser"

As demonstrated here, only the Python code is output.

In [None]:
from IPython.display import Code, display

parser = CodeOutputParser()
chain = llm | parser
result = chain.invoke("Can I create Python code to estimate the value of Pi.")
display(Code(result, language='python'))