Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix pipeline with join node #7510

Merged
merged 22 commits into from
Apr 11, 2024
Merged

fix: Fix pipeline with join node #7510

merged 22 commits into from
Apr 11, 2024

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Apr 9, 2024

Related Issues

  • fixes #issue-number

Proposed Changes:

We have started to develop multi prompt node pipelines for clients. Some of the most popular ones include rewriting the query during the pipeline execution. Here is a simple example using a PromptNode as a spell checker for the incoming user query.

from haystack import Pipeline
from haystack.nodes import BM25Retriever, EmbeddingRetriever, PromptNode, Shaper, JoinDocuments, PromptTemplate
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)
dicts = [{"content": "The capital of Germany is Berlin."}, {"content": "The capital of France is Paris."}]
document_store.write_documents(dicts)

query_prompt_node = PromptNode(
    model_name_or_path="gpt-3.5-turbo",
    api_key="",
    default_prompt_template=PromptTemplate("You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: {query}\nSpell Checked Query:")
)
shaper = Shaper(
    func="join_strings",
    inputs={"strings": "results"},
    outputs=["query"],
)
qa_prompt_node = PromptNode(
    model_name_or_path="gpt-3.5-turbo",
    api_key="",
    default_prompt_template=PromptTemplate("Answer the user query. Query: {query}")
)
sparse_retriever = BM25Retriever(
    document_store=document_store,
    top_k=2
)
dense_retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="intfloat/e5-base-v2",
    model_format="sentence_transformers",
    top_k=2
)
document_store.update_embeddings(dense_retriever)

pipeline = Pipeline()
pipeline.add_node(component=query_prompt_node, name="QueryPromptNode", inputs=["Query"])
pipeline.add_node(component=shaper, name="ListToString", inputs=["QueryPromptNode"])
pipeline.add_node(component=sparse_retriever, name="BM25", inputs=["ListToString"])
pipeline.add_node(component=dense_retriever, name="Embedding", inputs=["ListToString"])
pipeline.add_node(
    component=JoinDocuments(join_mode="concatenate"), name="Join", inputs=["BM25", "Embedding"]
)
pipeline.add_node(component=qa_prompt_node, name="QAPromptNode", inputs=["Join"])

out = pipeline.run(query="What is the captial of Grmny?", debug=True)
print(out["invocation_context"])
# Current v1.x branch
# {'query': 'What is the captial of Grmny?',  <-- Original Query!!
#   'results': ['The capital of Germany is Berlin.'],
#   'prompts': ['Answer the user query. Query: What is the captial of Grmny?'],  <-- Original Query!!
# After this PR
# {'query': 'What is the capital of Germany?',  <-- Rewritten Query!!
#   'results': ['The capital of Germany is Berlin.'],
#   'prompts': ['Answer the user query. Query: What is the capital of Germany?'],  <-- Rewritten Query!!

The issue is that the query used in the second prompt node (qa_prompt_node) is the original query and not the one rewritten by the first prompt node (query_prompt_node). This has to do with how information the Pipeline.run method passes through JoinNodes specifically.

This PR fixes this issue by updating the Pipeline.run method to properly pass node_output information through a JoinNode.

How did you test it?

Notes for the reviewer

Checklist

@sjrl sjrl changed the base branch from main to v1.x April 9, 2024 07:28
@sjrl sjrl added 1.x and removed type:documentation Improvements on the docs 2.x Related to Haystack v2.0 labels Apr 9, 2024
@github-actions github-actions bot added the type:documentation Improvements on the docs label Apr 9, 2024
@sjrl sjrl marked this pull request as ready for review April 9, 2024 09:48
@sjrl sjrl requested review from a team as code owners April 9, 2024 09:48
@sjrl sjrl requested review from dfokina and anakin87 and removed request for a team April 9, 2024 09:48
@sjrl
Copy link
Contributor Author

sjrl commented Apr 9, 2024

It seems like the mypy error is unrelated to changes made in this PR

haystack/nodes/file_converter/image.py:122: error: Argument 1 to "_image_to_text" of "ImageToTextConverter" has incompatible type "Image"; expected "PpmImageFile"  [arg-type]

@anakin87
Copy link
Member

anakin87 commented Apr 9, 2024

@sjrl I'm trying to fix the mypy error.

I would prefer that this PR be reviewed by someone who knows the 1.x Pipeline run logic better than I do.

haystack/pipelines/base.py Outdated Show resolved Hide resolved
@sjrl
Copy link
Contributor Author

sjrl commented Apr 10, 2024

@tstadel I added a invocation_context.copy() to both the PromptNode and the Shaper to avoid overwriting values in the _debug dict. Checked that with the changes that running the example in the main description and printing

out["_debug"]["QueryPromptNode"]
{'input': {'query': 'What is the captial of Grmny?', 'debug': True},
 'output': {'results': ['What is the capital of Germany?'],
  'invocation_context': {'query': 'What is the captial of Grmny?',
   'results': ['What is the capital of Germany?'],
   'prompts': ['You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: What is the captial of Grmny?\nSpell Checked Query:']}},
 'runtime': {'prompts_used': ['You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: What is the captial of Grmny?\nSpell Checked Query:']},
 'exec_time_ms': 540.78}

gives the expected debug output. Without the shaper change in particular you get

out["_debug"]["QueryPromptNode"]
{'input': {'query': 'What is the captial of Grmny?', 'debug': True},
 'output': {'results': ['What is the capital of Germany?'],
  'invocation_context': {'query': 'What is the capital of Germany?',  <--- HERE: This is wrong
   'results': ['What is the capital of Germany?'],
   'prompts': ['You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: What is the captial of Grmny?\nSpell Checked Query:']}},
 'runtime': {'prompts_used': ['You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: What is the captial of Grmny?\nSpell Checked Query:']},
 'exec_time_ms': 535.65}

@sjrl sjrl requested a review from tstadel April 10, 2024 11:05
Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional invocation_context fix and test! Looking very good!

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstadel thanks for your help!

I left a comment about the release note.
When all is good, feel free to merge.

@tstadel
Copy link
Member

tstadel commented Apr 10, 2024

Nice! I just checked. This PR will fix the issue regarding batch debug output of nodes previous to JoinNode reported in #2962 (comment) 🎉

sjrl and others added 4 commits April 10, 2024 15:50
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
@sjrl sjrl merged commit 08007df into v1.x Apr 11, 2024
53 checks passed
@sjrl sjrl deleted the fix-pipeline-with-join-node branch April 11, 2024 08:15
@sjrl sjrl added this to the 1.25.3 milestone Apr 22, 2024
vblagoje pushed a commit that referenced this pull request Apr 23, 2024
* Update Pipeline.run method to pass on more info when using join node

* Add release notes

* Update docs and update _arun method as well

* Added support for unique keys

* Simplify the method

* fix mypy error

* ignore mypy in image converter

* Ignore mypy warning

* Update _debug as well when there are more than two streams being joined.

* Add copy to invocation_context instead of overwriting

* Fix the copy

* Add copy to Shaper as well

* Add another test to make sure query stays changed

* Updated test to make sure that it does only pass with the new changes

* Add to run_batch as well

* Fix type annotations

* Make join nodes work when no docs are provided

* Ignore pylint error

* Update haystack/nodes/other/join_docs.py

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* Expand on release note

* Fix test

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants