bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node#3170
Conversation
|
@JeffRisberg apologies for the latency here, this is on my radar I'll come back to you in a couple of days. |
|
Hello @JeffRisberg, thank you for this tricky fix. Much appreciated! I tested out your change and it seems safe to me. However, could you add some tests? They should be added to |
|
@ZanSara I have added test case as requested, and all PR checks ran successful in the last 24-48 hours. Is there anything else you need from me? |
|
Hey @JeffRisberg ! Thanks for the ping, I lost sight of this PR. I'll review it shortly and be back with some feedback. |
|
@ZanSara I have added test case as requested, and all PR checks ran successful in the last 24-48 hours. Is there anything else you need from me? |
ZanSara
left a comment
There was a problem hiding this comment.
Looks good! Thank you and sorry again for the delay
Related Issues
Proposed Changes:
There is subtle bug in the pipeline execution code when a pipeline includes a joinNode followed by another joinNode.
We have a pipeline has four retrievers. They are joined by two pair of JoinDocuments nodes, followed by another JoinDocuments node that uses the results of the prior joins.
However, not all results from the retrievers are processed and returned by the final JoinDocuments node. Documents are lost
The pipeline is built correctly, because all nodes are connected correctly in the DiGraph of class Pipeline.
However, the code at line 526 of pipelines/base.py, builds up a list of inputs. It assumes that the parameters dict does not have a key called "inputs" for the new node.
However, when a joinNode is called, it does have parameter key called "inputs".
This value is returned from execution of the node.
Hence for the second node in the chain, it will receive inputs which include the inputs from the prior node.
Hence the number of inputs is not equal to the number of weights in the join, and the documents are not joined together correctly.
How did you test it?
There is a test located at https://github.com/JeffRisberg/HaystackPipelineTest
Notes for the reviewer
I determined this by putting a breakpoint into the run() method of the JoinNode class, and checking that the inputs are correct.
The solution is at line 258 in nodes/base.py
# add "extra" args that were not used by the node and are not inputs
for k, v in arguments.items():
if k not in output.keys() and k != "inputs":
output[k] = v
Checklist