Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce valid_values to Text which alters the prompt, in order to limit which values are returned by predict_and_parse #84

Closed
smwitkowski opened this issue Mar 23, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@smwitkowski
Copy link
Contributor

Hey - thanks for creating kor! I'm eager to start using it in my day job, in particular for aspect-based sentiment analysis.

I'm looking at reviews on food items and would like to label each review with an "aspect" if that aspect is mentioned in the review. You could begin to imagine which aspects are most relevant for this use case, flavor and texture are two that come to mind immediately. I want to limit this labeling exercise to only aspects that I am interested in.

I expect that including these instructions in the prompt would be sufficient, and I think two ways could be incorporated into kor.

The most straightforward way is to allow an end user to alter the prompt, but I'd prefer the second solution listed below.

The second seems to be a better long-term solution perhaps. AbstractSchemaNode could accept a new parameter valid_values, which would indicate what are valid values for a given key defined in the attribute.

schema = Object(
    id="review_aspect",
    description="Extracts aspects from a review.",
    attributes=[
        Text(
            id="aspect",
            description="Aspects mentioned in the review",
            examples=[("Taste was fine it was just a weird texture.", ["Flavor", "Texture"])],
            valid_values=["Flavor", "Texture"],
            many=True
        )
    ]
)

Then, that valid_values would be passed to generate_instruction_segment along with the node, and the prompt would be updated to include instructions on how to restrict which values are returned for aspect.

kor/kor/prompts.py

Lines 89 to 93 in c3066c1

def generate_instruction_segment(self, node: AbstractSchemaNode) -> str:
"""Generate the instruction segment of the extraction."""
type_description = self.type_descriptor.describe(node)
instruction_segment = self.encoder.get_instruction_segment()
return f"{self.prefix}\n\n{type_description}\n\n{instruction_segment}"

Happy to help contribute to this if it seems helpful!

@eyurtsev
Copy link
Owner

Hello! Thanks so much for trying out the library and leaving feedback!

I'm planning on handling this use case via a dedicated Enum type (in the code called Selection as it mirrors HTML Input forms).

The node is here: https://github.com/eyurtsev/kor/blob/main/kor/nodes.py#L161, but support for it might not be plumbed through the entire code at the moment (not sure). I'll update this issue as soon as I release support.

The first pass will only update the information that gets output in the prompt to help guide the LLM as to which values are seen as valid.

There's going to be a separate (and larger) effort to hook up standard validation frameworks, so it easy to specify what constitutes a valid extraction.

Let me know if you have any thoughts / concerns.

Also curious do you extract content from HTMLs, PDFs or raw text? And what is the length of your typical content?

@eyurtsev eyurtsev self-assigned this Mar 23, 2023
@eyurtsev eyurtsev added the enhancement New feature or request label Mar 23, 2023
@smwitkowski
Copy link
Contributor Author

smwitkowski commented Mar 23, 2023

The node is here: https://github.com/eyurtsev/kor/blob/main/kor/nodes.py#L161, but support for it might not be plumbed through the entire code at the moment (not sure). I'll update this issue as soon as I release support.

Awesome, thanks for sharing! I can confirm after using Selection and Option I'm not (so far) only getting those aspects that I've defined. Here's the code I used to define the schema.

aspect_options = [
    Option(
        id="flavor",
        description="Flavor",
        examples=[
            "<EXAMPLES REDACTED>"
            ]
        ), 
    Option(
        id="texture",
        description="Texture",
        examples=[
            "<EXAMPLES REDACTED>"
            ]
        )
    ]

schema = Object(
    id="review_aspect",
    description="Extracts aspects from a review.",
    attributes=[
        Selection(
            id="aspect",
            description="Aspects mentioned in the review",
            options=aspect_options,
            many=True
        )
    ]
)

I couldn't find information for different node types in the documentation. Is this something I could add?

The first pass will only update the information that gets output in the prompt to help guide the LLM as to which values are seen as valid.

There's going to be a separate (and larger) effort to hook up standard validation frameworks, so it [is] easy to specify what constitutes a valid extraction.

I'm curious, could you tell me more about that? Is it that schema is built in a particular way that it's only implied that the only valid options are what's included, and that in the future you mean to include more explicit instruction on what is a valid extraction in the prompt itself?

Also curious do you extract content from HTMLs, PDFs or raw text? And what is the length of your typical content?

I'm primarily focused on raw text. The length varies, sometimes it's only a few words and othertimes it's a paragraph of roughly 500 words. The text is very similar to reviews on Amazon items.

I may find myself working with HTML in the future, but it would only be the text contained within an HTML span, not the actual HTML itself.

I don't plan on working with any PDFs at this time.

@eyurtsev
Copy link
Owner

I couldn't find information for different node types in the documentation. Is this something I could add?

Yes, please do!

I'm curious, could you tell me more about that? Is it that schema is built in a particular way that it's only implied that the only valid options are what's included, and that in the future you mean to include more explicit instruction on what is a valid extraction in the prompt itself?

Currently the schema is designed to support two aspects of prompt generation: (1) the input/output examples, (2) generating the schema in the instruction

Input/output examples:

The schema is designed to support a convenient way to specify extraction examples on individual fields. During prompt generation the schema is traversed (across any level of nesting) to aggregate the examples and produce an appropriate prompt. I don't know how well providing examples on individual fields works in terms of extraction quality (though betting that quality will improve with newer LLMs rapidly).

Schema in the instruction:

The schema is scanned to generate a type definition (e.g., in typescript). I will likely add other type definitions if those will help with extraction (e.g., for tabular extraction it may be that generating a postgres style schema would help the model figure out what's required).

Both of these aspects of prompt generation only control the inputs into the LLM, but they don't help validate the outputs. As the LLMs become better, I'm betting they'll start better understanding the schema and making less mistakes.

Regardless in the meantime, validation needs to happen on the output of the LLM. I didn't implement anything yet, so at the moment it's up to users of the library to do so. But roughly what I'm thinking is to have folks define schema using their favorite python libraries (e.g., pydantic or marshmallow etc.), and having a bit of utility code that maps from something like pydantic to kor's internal representation of objects. This also means that the LLM output can be easily validated using either pydantic or marshmallow.

@eyurtsev
Copy link
Owner

Here's a PR that exposes the selection and option nodes. Looks like the existing code is already working correctly at least for typescript descriptors. #85

@eyurtsev
Copy link
Owner

@smwitkowski Working on a pydnatic adapter here: #86

Shared a screenshot to show conversion from pydantic into internal object, validation not hooked up yet, but shouldn't be difficult to add.

I'll need to work out some details (e.g., are self-referential types allowed etc.)

@smwitkowski
Copy link
Contributor Author

Both of these aspects of prompt generation only control the inputs into the LLM, but they don't help validate the outputs. As the LLMs become better, I'm betting they'll start better understanding the schema and making less mistakes.

Regardless in the meantime, validation needs to happen on the output of the LLM.

Yes to all of this - I also expect including explicit instruction in the instructions, in addition to the schema, would help ensure that the LLM understand what is "valid" and what is not.

I ran my example today on ~10K documents and I was given many aspects that were not defined. While using better LLMs would likely remove this issue (FWIW I used gpt-3.5-turbo, not gpt-4), it's quite costly to default towards higher end LLMs today.

@eyurtsev
Copy link
Owner

I also expect including explicit instruction in the instructions, in addition to the schema, would help ensure that the LLM understand what is "valid" and what is not.

That's probably right.

If you make modifications to the prompt that seem to help, I'd be very interested to know what they are.

Kor will likely need a benchmark dataset at some point to help experimenting with the prompt.

Fwiw at the moment all the important instructions in the system message which openai claims the model doesn't pay as much attention to.

Another possibility Is doing a second pass on the extraction with an LLM to correct deviations from a schema.

@eyurtsev
Copy link
Owner

@eyurtsev
Copy link
Owner

PR merged into main. Going to close this issue for now.

@pedrocr83
Copy link

Hi there, do we have an enum or valid_values attribute for the Text object we can use?

@eyurtsev
Copy link
Owner

@pedrocr83 The easiest way to achieve this is using pydantic. https://eyurtsev.github.io/kor/validation.html

It supports Enums as well as arbitrary validation logic using pydantic field validators

@eyurtsev eyurtsev reopened this Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants