Skip to content

v2.0.0-beta.8

Pre-release
Pre-release
Compare
Choose a tag to compare
@ZanSara ZanSara released this 22 Feb 10:08
· 261 commits to main since this release
088aa50

Release Notes

v2.0.0-beta.8

Highlights

Introducing a flexible and dynamic approach to creating NLP pipelines with Haystack's new PipelineTemplate class!

This innovative feature utilizes Jinja templated YAML files, allowing users to effortlessly construct and customize complex data processing pipelines for various NLP tasks. From question answering and document indexing to custom pipeline requirements, the PipelineTemplate simplifies configuration and enhances adaptability. Users can now easily override default components or integrate custom settings with simple, straightforward code.

For example, the following pipeline template can be used to create an indexing pipeline:

from haystack.components.embedders import SentenceTransformersDocumentEmbedder 
from haystack.templates import PipelineTemplate, PipelineType 
pt = PipelineTemplate(PipelineType.INDEXING, template_params={"use_pdf_file_converter": True}) 
pt.override("embedder", SentenceTransformersDocumentEmbedder(progress_bar=True)) 
pipe = ptb.build() 
result = pipe.run(data={"sources": ["some_local_dir/and_text_file.txt", "some_other_local_dir/and_pdf_file.pdf"]}) 
print(result) 

In the above example, a PipelineType.INDEXING enum is used to create a pipeline with a custom instance of SentenceTransformersDocumentEmbedder and the PDF file converter enabled.

The pipeline is then run on a list of local files and the result is printed (number of indexed documents). We could have of course used the same PipelineTemplate class to create any other pre-defined pipeline or even a custom pipeline with custom components and settings. On the other hand, the following pipeline template can be used to create a pre-defined RAG pipeline:

from haystack.templates import PipelineTemplate, PipelineType 
pipe = PipelineTemplate(PipelineType.RAG).build() 
result = pipe.run(query="What's the meaning of life?") 
print(result)

_templateSource loads template content from various inputs, including strings, files, predefined templates, and URLs. The class provides mechanisms to load templates dynamically and ensure they contain valid Jinja2 syntax.

⬆️ Upgrade Notes

  • Adopt the new framework-agnostic device management in Sentence Transformers Embedders.

    Before this change:

    from haystack.components.embedders import SentenceTransformersTextEmbedder 
    embedder = SentenceTransformersTextEmbedder(device="cuda:0") 

    After this change:

    from haystack.utils.device import ComponentDevice, Device 
    from haystack.components.embedders import SentenceTransformersTextEmbedder 
    device = ComponentDevice.from_single(Device.gpu(id=0)) # or 
    # device = ComponentDevice.from_str("cuda:0") embedder = SentenceTransformersTextEmbedder(device=device) 
  • Adopt the new framework-agnostic device management in Local Whisper Transcriber.

    Before this change:

    from haystack.components.audio import LocalWhisperTranscriber  
    transcriber = LocalWhisperTranscriber(device="cuda:0") 

    After this change:

    from haystack.utils.device import ComponentDevice, Device from haystack.components.audio import LocalWhisperTranscriber 
    device = ComponentDevice.from_single(Device.gpu(id=0)) # or 
    # device = ComponentDevice.from_str("cuda:0")  transcriber = LocalWhisperTranscriber(device=device) 

🚀 New Features

  • Add FilterRetriever. It retrieves documents that match the provided (either at init or runtime) filters.

  • Add LostInTheMiddleRanker. It reorders documents based on the "Lost in the Middle" order, a strategy that places the most relevant paragraphs at the beginning or end of the context, while less relevant paragraphs are positioned in the middle.

  • Add support for Mean Reciprocal Rank (MRR) Metric to StatisticalEvaluator. MRR measures the mean reciprocal rank of times a label is present in at least one or more predictions.

  • Introducing the OutputAdapter component which enables seamless data flow between pipeline components by adapting the output of one component to match the expected input of another using Jinja2 template expressions. This addition opens the door to greater flexibility in pipeline configurations, facilitating custom adaptation rules and exemplifying a structured approach to inter-component communication.

  • Add is_greedy argument to @component decorator. This flag will change the behaviour of Component`s with inputs that have a `Variadic type when running inside a Pipeline.

    Variadic `Component`s that are marked as greedy will run as soon as they receive their first input. If not marked as greedy instead they'll wait as long as possible before running to make sure they receive as many inputs as possible from their senders.

    It will be ignored for all other `Component`s even if set explicitly.

  • Remove the old evaluation API in favor of a Component based API. We now have SASEvaluator and StatisticalEvaluator replacing the old API.

  • Introduced JsonSchemaValidator to validate the JSON content of ChatMessage against a provided JSON schema. Valid messages are emitted through the 'validated' output, while messages failing validation are sent via the 'validation_error' output, along with useful error details for troubleshooting.

  • Add a new variable called meta_value_type to the MetaFieldRanker that allows a user to parse the meta value into the data type specified as along as the meta value is a string. The supported values for meta_value_type are '"float"', '"int"', '"date"', or 'None'. If None is passed then no parsing is done. For example, if we specified meta_value_type="date" then for the meta value "date": "2015-02-01" we would parse the string into a datetime object.

  • Add TextCleaner Component to clean list of strings. It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers. This is mostly useful to clean generator predictions before evaluation.

⚡️ Enhancement Notes

  • Add __repr__ to all Components to print their I/O. This can also be useful in Jupyter notebooks as this will be shown as a cell output if the it's the last expression in a cell.
  • Add new Pipeline.show() method to generated image inline if run in a Jupyter notebook. If called outside a notebook it will raise a PipelineDrawingError. Pipeline.draw() has also been simplified and the engine argument has been removed. Now all images will be generated using Mermaid.
  • Customize Pipeline.__repr__() to return a nice text representation of it. If run on a Jupyter notebook it will instead have the same behaviour as Pipeline.show().
  • Change Pipeline.run() to check if max_loops_allowed has been reached. If we attempt to run a Component that already ran the number of max_loops_allowed a PipelineMaxLoops will be raised.
  • Merge Pipeline`s definitions into a single `Pipeline class. The class in the haystack.pipeline package has been deleted and only haystack.core.pipeline exists now.
  • Enhanced the OpenAPIServiceConnector to support dynamic authentication handling. With this update, service credentials are now dynamically provided at each run invocation, eliminating the need for pre-configuring a known set of service authentications. This flexibility allows for the introduction of new services on-the-fly, each with its unique authentication, streamlining the integration process. This modification not only simplifies the initial setup of the OpenAPIServiceConnector but also ensures a more transparent and straightforward authentication process for each interaction with different OpenAPI services.

🐛 Bug Fixes

  • Adds api_base_url attribute to OpenAITExtEmbedder. Previously, it was used only for initialization and was not serialized.
  • Previously, when using the same input reference in different components, the Pipeline run logic had an unexpected behavior. This has been fixed by deepcopying the inputs before passing them to the components.