Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MultiQueryRetriever Prompt Rewording Inconsistent Formatting Error #276

Merged
merged 3 commits into from
Jan 19, 2024

Conversation

davidgxue
Copy link
Collaborator

@davidgxue davidgxue commented Jan 19, 2024

Description

  • See MultiQueryRetriever Increasing Error Rate due to Prompt Rewording Format  #272 for details of why the error occurs. This won't be explained again in this PR.
  • This PR tackles the problem where the LineListOutputParser used in LangChain's MultiQueryRetriever not cleaning enough edge cases such as empty lines generated from LLM between reworded prompt lines causing an error thrown downstream (the empty line, along with other 2 reworded prompts are individually sent downstream to be used for vector db documents retrieval but an empty string will error out during this phase).

Technical Changes

  • Add CustomLineListOutputParser, a modified, less error prone implementation of the LineListOutputParser from LangChain. This CustomLineListOutputParser will get rid of lines that are empty lines and also limits the max number of lines or queries the multi-query retriever can generate through rewording.
  • Manually replacing the output_parser inside the llm_chain of the multi-query retriever with the newly added CustomLineListOutputParser after its initialization.

Tests

  1. Tested to see previous user prompts that causes error downstream due to empty line between reworded prompts generated no longer an issue. I have attempted almost all the errored out prompts I can find and seems like this PR has fixed the issue. See one example below.
  1. Ran a sanity check on the original test question set. No errors or quality degradation.

closes #272

@davidgxue davidgxue self-assigned this Jan 19, 2024
@davidgxue davidgxue added this to the 0.3.0 milestone Jan 19, 2024
Copy link
Collaborator

@pankajastro pankajastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But we should address @Lee-W suggestions.

Copy link

cloudflare-pages bot commented Jan 19, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8f23a2d
Status: ✅  Deploy successful!
Preview URL: https://51ba7499.ask-astro.pages.dev
Branch Preview URL: https://fix-multiquery-lines-split-e.ask-astro.pages.dev

View logs

@davidgxue
Copy link
Collaborator Author

Addressed @Lee-W 's comments. Merging this to main. The ruff auto add import in pyproject file will be a separate PR to not bloat this current one.

@davidgxue davidgxue merged commit f825c21 into main Jan 19, 2024
8 checks passed
@davidgxue davidgxue deleted the fix_multiquery_lines_split_error branch January 19, 2024 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiQueryRetriever Increasing Error Rate due to Prompt Rewording Format
4 participants