Skip to content

Releases: deepset-ai/haystack

v1.22.0-rc3

05 Nov 17:13
3ad66b5
Compare
Choose a tag to compare
v1.22.0-rc3 Pre-release
Pre-release

Release Notes

v1.22.0-rc2

Bug Fixes

  • Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus ensures that they are included in JSON schema generation.

v1.22.0-rc1

Upgrade Notes

  • This update enables all Pinecone index types to be used, including Starter. Previously, Pinecone Starter index type couldn't be used as document store. Due to limitations of this index type (https://docs.pinecone.io/docs/starter-environment), in current implementation fetching documents is limited to Pinecone query vector limit (10000 vectors). Accordingly, if the number of documents in the index is above this limit, some of PineconeDocumentStore functions will be limited.
  • Removes the audio, ray, onnx and beir extras from the extra group all.

New Features

  • Add experimental support for asynchronous Pipeline run

Enhancement Notes

  • Added support for Apple Silicon GPU acceleration through "mps pytorch", enabling better performance on Apple M1 hardware.
  • Document writer returns the number of documents written.
  • added support for using on_final_answer trough Agent callback_manager
  • Add asyncio support to the OpenAI invocation layer.
  • PromptNode can now be run asynchronously by calling the arun method.
  • Add search_engine_kwargs param to WebRetriever so it can be propagated to WebSearch. This is useful, for example, to pass the engine id when using Google Custom Search.
  • Upgrade Transformers to the latest version 4.34.1. This version adds support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat models.
  • Make JoinDocuments return only the document with the highest score if there are duplicate documents in the list.
  • Add list_of_paths argument to utils.convert_files_to_docs to allow input of list of file paths to be converted, instead of, or as well as, the current dir_path argument.
  • Optimize particular methods from PineconeDocumentStore (delete_documents and _get_vector_count)
  • Update the deepset Cloud SDK to the new endpoint format for new saving pipeline configs.
  • Add alias names for Cohere embed models for an easier map between names

Deprecation Notes

  • Deprecate OpenAIAnswerGenerator in favor of PromptNode. OpenAIAnswerGenerator will be removed in Haystack 1.23.

Bug Fixes

  • Fixed the bug that prevented the correct usage of ChatGPT invocation layer in 1.21.1. Added async support for ChatGPT invocation layer.
  • Added documents_store.update_embeddings call to pipeline examples so that embeddings are calculated for newly added documents.
  • Remove unsupported medium and finance-sentiment models from supported Cohere embed model list

Haystack 2.0 preview

  • Add AzureOCRDocumentConverter to convert files of different types using Azure's Document Intelligence Service.
  • Add ByteStream type to send binary raw data across components in a pipeline.
  • Introduce ChatMessage data class to facilitate structured handling and processing of message content within LLM chat interactions.
  • Adds ChatMessage templating in PromptBuilder
  • Adds HTMLToDocument component to convert HTML to a Document.
  • Adds SimilarityRanker, a component that ranks a list of Documents based on their similarity to the query.
  • Introduce the StreamingChunk dataclass for efficiently handling chunks of data streamed from a language model, encapsulating both the content and associated metadata for systematic processing.
  • Adds TopPSampler, a component selects documents based on the cumulative probability of the Document scores using top p (nucleus) sampling.
  • Add dumps, dump, loads and load methods to save and load pipelines in Yaml format.
  • Adopt Hugging Face token instead of the deprecated use_auth_token. Add this parameter to ExtractiveReader and SimilarityRanker to allow loading private models. Proper handling of token during serialization: if it is a string (a possible valid token) it is not serialized.
  • Add mime_type field to ByteStream dataclass.
  • The Document dataclass checks if id_hash_keys is None or empty in __post_init__. If so, it uses the default factory to set a default valid value.
  • Rework Document.id generation, if an id is not explicitly set it's generated using all Document field' values, score is not used.
  • Change Document's embedding field type from numpy.ndarray to List[float]
  • Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to ignore id_hash_keys and create Documents with duplicate ids if the documents differed only in their metadata.
  • Fix TextDocumentSplitter failing when run with an empty list
  • Better management of API key in GPT Generator. The API key is never serialized. Make the api_base_url parameter really used (previously it was ignored).
  • Add a minimal version of HuggingFaceLocalGenerator, a component that can run Hugging Face models locally to generate text.
  • Migrate RemoteWhisperTranscriber to OpenAI SDK.
  • Add OpenAI Document Embedder. It computes embeddings of Documents using OpenAI models. The embedding of each Document is stored in the embedding field of the Document.
  • Add the TextDocumentSplitter component for Haystack 2.0 that splits a Document with long text into multiple Documents with shorter texts. Thereby the texts match the maximum length that the language models in Embedders or other components can process.
  • Refactor OpenAIDocumentEmbedder to enrich documents with embeddings instead of recreating them.
  • Refactor SentenceTransformersDocumentEmbedder to enrich documents with embeddings instead of recreating them.
  • Remove "api_key" from serialization of AzureOCRDocumentConverter and SerperDevWebSearch.
  • Remove array field from Document dataclass.
  • Remove id_hash_keys field from Document dataclass. id_hash_keys has been also removed from Components that were using it:
    • DocumentCleaner
    • TextDocumentSplitter
    • PyPDFToDocument
    • AzureOCRDocumentConverter
    • HTMLToDocument
    • TextFileToDocument
    • TikaDocumentConverter
  • Enhanced file routing capabilities with the introduction of ByteStream handling, and improved clarity by renaming the router to FileTypeRouter.
  • Rename MemoryDocumentStore to InMemoryDocumentStore Rename MemoryBM25Retriever to InMemoryBM25Retriever Rename MemoryEmbeddingRetriever to InMemoryEmbeddingRetriever
  • Renamed ExtractiveReader's input from document to documents to match its type List[Document].
  • Rename SimilarityRanker to TransformersSimilarityRanker, as there will be more similarity rankers in the future.
  • Allow specifying stopwords to stop text generation for HuggingFaceLocalGenerator.
  • Add basic telemetry to Haystack 2.0 pipelines
  • Added DocumentCleaner, which removes extra whitespace, empty lines, headers, etc. from Documents containing text. Useful as a preprocessing step before sp...
Read more

v1.22.0-rc1

30 Oct 14:38
0fb3b82
Compare
Choose a tag to compare
v1.22.0-rc1 Pre-release
Pre-release

v1.22.0-rc1

Upgrade Notes

  • This update enables all Pinecone index types to be used, including
    Starter. Previously, Pinecone Starter index type couldn't be used as
    document store. Due to limitations of this index type
    (https://docs.pinecone.io/docs/starter-environment), in current
    implementation fetching documents is limited to Pinecone query
    vector limit (10000 vectors). Accordingly, if the number of
    documents in the index is above this limit, some of
    PineconeDocumentStore functions will be limited.
  • Removes the audio,
    ray,
    onnx and
    beir extras from the extra group
    all.

New Features

  • Add experimental support for asynchronous
    Pipeline run

Enhancement Notes

  • Added support for Apple Silicon GPU acceleration through "mps
    pytorch", enabling better performance on Apple M1 hardware.
  • Document writer returns the number of documents written.
  • added support for using
    on_final_answer trough
    Agent
    callback_manager
  • Add asyncio support to the OpenAI invocation layer.
  • PromptNode can now be run asynchronously by calling the
    arun method.
  • Add search_engine_kwargs param to
    WebRetriever so it can be propagated to WebSearch. This is useful,
    for example, to pass the engine id when using Google Custom Search.
  • Upgrade Transformers to the latest version 4.34.1. This version adds
    support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat
    models.
  • Make JoinDocuments return only the document with the highest score
    if there are duplicate documents in the list.
  • Add list_of_paths argument to
    utils.convert_files_to_docs to allow
    input of list of file paths to be converted, instead of, or as well
    as, the current dir_path argument.
  • Optimize particular methods from PineconeDocumentStore
    (delete_documents and _get_vector_count)
  • Update the deepset Cloud SDK to the new endpoint format for new
    saving pipeline configs.
  • Add alias names for Cohere embed models for an easier map between
    names

Deprecation Notes

  • Deprecate OpenAIAnswerGenerator in
    favor of PromptNode.
    OpenAIAnswerGenerator will be removed
    in Haystack 1.23.

Bug Fixes

  • Fixed the bug that prevented the correct usage of ChatGPT invocation
    layer in 1.21.1. Added async support for ChatGPT invocation layer.
  • Added documents_store.update_embeddings call to pipeline examples so
    that embeddings are calculated for newly added documents.
  • Remove unsupported medium and
    finance-sentiment models from
    supported Cohere embed model list

Haystack 2.0 preview

  • Add AzureOCRDocumentConverter to convert files of different types
    using Azure's Document Intelligence Service.
  • Add ByteStream type to send binary raw data across components in a
    pipeline.
  • Introduce ChatMessage data class to facilitate structured handling
    and processing of message content within LLM chat interactions.
  • Adds ChatMessage templating in
    PromptBuilder
  • Adds HTMLToDocument component to convert HTML to a Document.
  • Adds SimilarityRanker, a component that ranks a list of Documents
    based on their similarity to the query.
  • Introduce the StreamingChunk dataclass for efficiently handling
    chunks of data streamed from a language model, encapsulating both
    the content and associated metadata for systematic processing.
  • Adds TopPSampler, a component selects documents based on the
    cumulative probability of the Document scores using top p (nucleus)
    sampling.
  • Add dumps,
    dump,
    loads and
    load methods to save and load
    pipelines in Yaml format.
  • Adopt Hugging Face token instead of
    the deprecated use_auth_token. Add
    this parameter to ExtractiveReader
    and SimilarityRanker to allow loading
    private models. Proper handling of
    token during serialization: if it is
    a string (a possible valid token) it is not serialized.
  • Add mime_type field to
    ByteStream dataclass.
  • The Document dataclass checks if
    id_hash_keys is None or empty in
    __post_init__. If so, it uses the default factory to set a
    default valid value.
  • Rework Document.id generation, if an
    id is not explicitly set it's
    generated using all Document field'
    values, score is not used.
  • Change Document's
    embedding field type from
    numpy.ndarray to
    List[float]
  • Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to
    ignore id_hash_keys and create Documents with duplicate ids if the
    documents differed only in their metadata.
  • Fix TextDocumentSplitter failing when run with an empty list
  • Better management of API key in GPT Generator. The API key is never
    serialized. Make the api_base_url
    parameter really used (previously it was ignored).
  • Add a minimal version of HuggingFaceLocalGenerator, a component that
    can run Hugging Face models locally to generate text.
  • Migrate RemoteWhisperTranscriber to OpenAI SDK.
  • Add OpenAI Document Embedder. It computes embeddings of Documents
    using OpenAI models. The embedding of each Document is stored in the
    embedding field of the Document.
  • Add the TextDocumentSplitter
    component for Haystack 2.0 that splits a Document with long text
    into multiple Documents with shorter texts. Thereby the texts match
    the maximum length that the language models in Embedders or other
    components can process.
  • Refactor OpenAIDocumentEmbedder to enrich documents with embeddings
    instead of recreating them.
  • Refactor SentenceTransformersDocumentEmbedder to enrich documents
    with embeddings instead of recreating them.
  • Remove "api_key" from serialization of AzureOCRDocumentConverter and
    SerperDevWebSearch.
  • Removed implementations of from_dict and to_dict from all components
    where they had the same effect as the default implementation from
    Canals:
    https://github.com/deepset-ai/canals/blob/main/canals/serialization.py#L12-L13
    This refactoring does not change the behavior of the components.
  • Remove array field from
    Document dataclass.
  • Remove id_hash_keys field from
    Document dataclass.
    id_hash_keys has been also removed
    from Components that were using it:
    • DocumentCleaner
    • TextDocumentSplitter
    • PyPDFToDocument
    • AzureOCRDocumentConverter
    • HTMLToDocument
    • TextFileToDocument
    • TikaDocumentConverter
  • Enhanced file routing capabilities with the introduction of
    ByteStream handling, and improved
    clarity by renaming the router to
    FileTypeRouter.
  • Rename MemoryDocumentStore to
    InMemoryDocumentStore Rename
    MemoryBM25Retriever to
    InMemoryBM25Retriever Rename
    MemoryEmbeddingRetriever to
    InMemoryEmbeddingRetriever
  • Renamed ExtractiveReader's input from
    document to
    documents to match its type
    List[Document].
  • Rename SimilarityRanker to
    TransformersSimilarityRanker, as
    there will be more similarity rankers in the future.
  • Allow specifying stopwords to stop text generation for
    HuggingFaceLocalGenerator.
  • Add basic telemetry to Haystack 2.0 pipelines
  • Added DocumentCleaner, which removes extra whitespace, empty lines,
    headers, etc. from Documents containing text. Useful as a
    preprocessing step before splitting into shorter text documents.
  • Add TextLanguageClassifier component so that an input string, for
    example a query, can be routed to different components based on the
    detected language.
  • Upgrade canals to 0.9.0 to support variadic inputs for Joiner
    c...
Read more

v1.21.2

06 Oct 08:02
188262c
Compare
Choose a tag to compare

🐛 Bug Fixes

  • Fixed the bug that prevented the correct usage of ChatGPT invocation layer in 1.21.1.
    Added async support for ChatGPT invocation layer.

v1.21.1

04 Oct 11:10
d9e9925
Compare
Choose a tag to compare

✨ Enhancements

  • Added experimental support for asynchronous Pipeline run.
  • Added asyncio support to the OpenAI invocation layer.
  • PromptNode can now be run asynchronously by calling the arun method.

⏰ Deprecations

  • Deprecated OpenAIAnswerGenerator in favor of PromptNode. OpenAIAnswerGenerator will be removed in Haystack v1.23.0

v1.21.0

27 Sep 12:08
29acd3c
Compare
Choose a tag to compare

⭐ Highlights

🚀 Support for gpt-3.5-turbo-instruct

We are happy to announce that Haystack now supports OpenAI's new gpt-3.5-turbo-instruct model! Simply provide the model name in the PromptNode to use it:

pn = PromptNode("gpt-3.5-turbo-instruct", api_key=os.environ.get("OPENAI_API_KEY"))

2️⃣ Preview Installation Extra

Excited about the upcoming Haystack 2.0? We have introduced a new installation extra called preview which you can install to try out the Haystack 2.0 preview! This extra also makes Haystack's core dependencies leaner and thus speeds up installation. If you would like to start experiencing the new Haystack 2.0 components and pipeline design right away, run:

pip install farm-haystack[preview]

⚡️ WeaviateDocumentStore Performance

We fixed a bottleneck in WeaviateDocumentStore which was slowing down the indexing. The fix led to a notable performance improvement, reducing the indexing time of one million documents by 6 times!

🐣 PineconeDocumentStore Robustness

The PineconeDocumentStore now uses metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels. This is a breaking change and it makes the PineconeDocumentStore more robust to use in Haystack pipelines. If you want to retrieve all documents with an embedding, specify the metadata instead of the namespace as follows:

from haystack.document_stores.pinecone import DOCUMENT_WITH_EMBEDDING
# docs = doc_store.get_all_documents(namespace="vectors") # old way using namespaces
docs = doc_store.get_all_documents(type_metadata=DOCUMENT_WITH_EMBEDDING)

Additionally, if you want to retrieve all documents without an embedding, specify the metadata instead of the namespace:

# docs = doc_store.get_all_documents(namespace="no-vectors") # old way using namespaces
docs = doc_store_.get_all_documents(type_metadata="no-vector")

⬆️ Upgrade Notes

  • SklearnQueryClassifier is removed and users should switch to the more powerful TransformersQueryClassifier instead. #5447

  • Refactor PineconeDocumentStore to use metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels.

✨ Enhancements

  • ci: Fix typos discovered by codespell running in pre-commit.

  • Support OpenAI's new gpt-3.5-turbo-instruct model

🐛 Bug Fixes

  • Fix EntityExtractor output not JSON serializable.

  • Fix model_max_length not being set in the Tokenizer in DefaultPromptHandler.

  • Fixed a bottleneck in Weaviate document store which was slowing down the indexing.

  • gpt-35-turbo-16k model from Azure can integrate correctly.

  • Upgrades tiktoken to 0.5.1 to account for a breaking release.

👁️ Haystack 2.0 preview

  • Add the AnswerBuilder component for Haystack 2.0 that creates Answer objects from the string output of Generators.

  • Adds LinkContentFetcher component to Haystack 2.0. LinkContentFetcher fetches content from a given URL and
    converts it into a Document object, which can then be used within the Haystack 2.0 pipeline.

  • Add MetadataRouter, a component that routes documents to different edges based on the content of their fields.

  • Adds support for PDF files to the Document converter via pypdf library.

  • Adds SerperDevWebSearch component to retrieve URLs from the web. See https://serper.dev/ for more information.

  • Add TikaDocumentConverter component to convert files of different types to Documents.

  • This adds an ExtractiveReader for v2. It should be a replacement where
    FARMReader would have been used before for inference.
    The confidence scores are calculated differently from FARMReader because
    each span is considered to be an independent binary classification task.

  • Introduce GPTGenerator, a class that can generate completions using OpenAI Chat models like GPT3.5 and GPT4.

  • Remove id parameter from Document constructor as it was ignored and a new one was generated anyway.
    This is a backwards incompatible change.

  • Add generators module for LLM generator components.

  • Adds GPT4Generator, an LLM component based on GPT35Generator.

  • Add embedding_retrieval method to MemoryDocumentStore,
    which allows to retrieve the relevant Documents, given a query embedding.
    It will be called the MemoryEmbeddingRetriever.

  • Rename MemoryRetriever to MemoryBM25Retriever
    Add MemoryEmbeddingRetriever, which takes as input a query embedding and
    retrieves the most relevant Documents from a MemoryDocumentStore.

  • Adds proposal for an extended Document class in Haystack 2.0.

  • Adds the implementation of said class.

  • Add OpenAI Text Embedder.
    It is a component that uses OpenAI models to embed strings into vectors.

  • Revert #5826 and optionally take the id in the Document
    class constructor.

  • Create a dedicated dependency list for the preview package, farm-haystack[preview].
    Using haystack-ai is still the recommended way to test Haystack 2.0.

  • Add PromptBuilder component to render prompts from template strings.

  • Add prefix and suffix attributes to SentenceTransformersDocumentEmbedder.
    They can be used to add a prefix and suffix to the Document text before
    embedding it. This is necessary to take full advantage of modern embedding
    models, such as E5.

  • Add support for dates in filters.

  • Add UrlCacheChecker to support Web retrieval pipelines.
    Check if documents coming from a given list of URLs are already present in the store and if so, returns them.
    All URLs with no matching documents are returned on a separate connection.

v1.20.1

12 Sep 13:16
Compare
Choose a tag to compare

Changelog

  • fix: temporary pin tiktoken #5774

Full Changelog: v1.20.0...v1.20.1

v1.20.0

04 Sep 14:36
Compare
Choose a tag to compare

⭐ Highlights


🪄LostInTheMiddleRanker and DiversityRanker

We are excited to introduce two new rankers to Haystack: LostInTheMiddleRanker and DiversityRanker!

LostInTheMiddleRanker is based on the research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. It reorders documents according to the "Lost in the Middle" strategy, which places the most relevant paragraphs at the beginning and end of the context, while less relevant paragraphs are positioned in the middle. This ranker can be used in Retrieval-Augmented Generation (RAG) pipelines. Here is an example of how to use it:

web_retriever = WebRetriever(api_key=search_key, top_search_results=5, mode="preprocessed_documents", top_k=50)

sampler = TopPSampler(top_p=0.97)
diversity_ranker = DiversityRanker()
litm_ranker = LostInTheMiddleRanker(word_count_threshold=1024)

pipeline = Pipeline()
pipeline.add_node(component=web_retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
pipeline.add_node(component=diversity_ranker, name="DiversityRanker", inputs=["Sampler"])
pipeline.add_node(component=litm_ranker, name="LostInTheMiddleRanker", inputs=["DiversityRanker"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["LostInTheMiddleRanker"])

In this example, we have positioned the LostInTheMiddleRanker as the last component before the PromptNode. This is because the LostInTheMiddleRanker is designed to be used in combination with other rankers. It is recommended to place it towards the end of the pipeline (as the last ranker), so that it can reorder the documents that have already been ranked by other rankers.

DiversityRanker is a tool that helps to increase the diversity of a set of documents. It uses sentence-transformer models to calculate semantic embeddings for each document and then ranks them in a way that ensures that each subsequent document is the least similar to the ones that have already been selected. This results in a list where each document contributes the most to the overall diversity of the selected set.

We'll reuse the same example from the LostInTheMiddleRanker to point out that the DiversityRanker can be used in combination with other rankers. It is recommended to place it in the pipeline after the similarity ranker but before the LostInTheMiddleRanker. Note that DiversityRanker is typically used in generative RAG pipelines to ensure that the generated answer is drawn from a diverse set of documents. This setup is typical for Long-Form Question Answering (LFQA) tasks. Check out Enhancing RAG Pipelines in Haystack: Introducing DiversityRanker and LostInTheMiddleRanker article on Haystack Blog for details.

📰 New release note management

We have implemented a new release note management system, reno. From now on, every contributor is responsible for adding release notes for the feature or bugfix they're introducing in Haystack in the same Pull Request containing the code changes. The goal is to encourage detailed and accurate notes for every release, especially when it comes to complex features or breaking changes.

See how to work with the new release notes in our Contribution Guide.

⬆️ Upgrade Notes


  • If you're a Haystack contributor, you need a new tool called reno to manage the release notes.
    Please run pip install -e .[dev] to ensure you have reno available in your environment.

  • The Opensearch custom query syntax changes: the old filter placeholders for custom_query are no longer supported.
    Replace your custom filter expressions with the new ${filters} placeholder:

    Old:

      retriever = BM25Retriever(
        custom_query="""
          {
              "query": {
                  "bool": {
                      "should": [{"multi_match": {
                          "query": ${query},
                          "type": "most_fields",
                          "fields": ["content", "title"]}}
                      ],
                      "filter": [
                          {"terms": {"year": ${years}}},
                          {"terms": {"quarter": ${quarters}}},
                          {"range": {"date": {"gte": ${date}}}}
                      ]
                  }
              }
          }
        """
      )
    
      retriever.retrieve(
          query="What is the meaning of life?",
          filters={"years": [2019, 2020], "quarters": [1, 2, 3], "date": "2019-03-01"}
      )

    New:

      retriever = BM25Retriever(
        custom_query="""
          {
              "query": {
                  "bool": {
                      "should": [{"multi_match": {
                          "query": ${query},
                          "type": "most_fields",
                          "fields": ["content", "title"]}}
                      ],
                      "filter": ${filters}
                  }
              }
          }
        """
      )
    
      retriever.retrieve(
          query="What is the meaning of life?",
          filters={"year": [2019, 2020], "quarter": [1, 2, 3], "date": {"$gte": "2019-03-01"}}
      )
  • This update impacts only those who have created custom invocation layers by subclassing PromptModelInvocationLayer.
    Previously, the invoke() method in your custom layer received all prompt template parameters (like query,
    documents, etc.) as keyword arguments. With this change, these parameters will no longer be passed in as keyword
    arguments. If you've implemented such a custom layer, you'll need to potentially update your code to accommodate
    this change.

🥳 New Features


  • The LostInTheMiddleRanker can be used like other rankers in Haystack. After initializing LostInTheMiddleRanker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order - the most relevant documents are located at the top and bottom of the returned list, while the least relevant documents are found in the middle. We advise that you use this ranker in combination with other rankers, and to place it towards the end of the pipeline.

  • The DiversityRanker can be used like other rankers in Haystack and it can be particularly helpful in cases where you have highly relevant yet similar sets of documents. By ensuring a diversity of documents, this new ranker facilitates a more comprehensive utilization of the documents and, particularly in RAG pipelines, potentially contributes to more accurate and rich model responses.

  • When using custom_query in BM25Retriever along with OpenSearch or Elasticsearch, we added support for dynamic filters, like in regular queries. With this change, you can pass filters at query-time without having to modify the custom_query:
    Instead of defining filter expressions and field placeholders, all you have to do is setting the ${filters} placeholder analogous to the ${query} placeholder into your custom_query.
    For example:

      {
          "query": {
              "bool": {
                  "should": [{"multi_match": {
                      "query": ${query},                 // mandatory query placeholder
                      "type": "most_fields",
                      "fields": ["content", "title"]}}
                  ],
                  "filter": ${filters}                 // optional filters placeholder
              }
          }
      }
  • DeepsetCloudDocumentStore supports searching multiple fields in sparse queries. This enables you to search meta fields as well when using BM25Retriever. For example set search_fields=["content", "title"] to search the title meta field along with the document content.

  • Rework DocumentWriter to remove DocumentStoreAwareMixin. Now we require a generic DocumentStore when initialisating the writer.

  • Rework MemoryRetriever to remove DocumentStoreAwareMixin. Now we require a MemoryDocumentStore when initialisating the retriever.

  • Introduced allowed_domains parameter in WebRetriever for domain-specific searches, thus enabling "talk to a website" and "talk to docs" scenarios.

✨ Enhancements


  • The WebRetriever now employs an enhanced caching mechanism that caches web page content based on search engine results rather than the query.

  • Upgrade transformers to the latest version 4.32.1 so that Haystack benefits from Llama and T5 bugfixes: https://github.com/huggingface/transformers/releases/tag/v4.32.1

  • Upgrade Transformers to the latest version 4.32.0.
    This version adds support for the GPTQ quantization and integrates MPT models.

  • Add top_k parameter to the DiversityRanker init method.

  • Enable setting the max_length value when running PromptNodes using local HF text2text-generation models.

  • Enable passing use_fast to the underlying transformers' pipeline

  • Enhance FileTypeClassifier to detect media file types like mp3, mp4, mpeg, m4a, and similar.

  • Minor PromptNode HFLocalInvocationLayer test improvements

  • Several minor enhancements for LinkContentFetcher:

    • Dynamic content handler resolution
    • Custom User-Agent header (optional, minimize blocking)
    • PDF support
    • Register new content handlers
  • If LinkContentFetcher encounters a block or receives any response code other than HTTPStatus.OK, return the search engine snippet as content, if it's available.

  • Allow loading Tokenizers for prompt models not natively supported ...

Read more

v1.20.0-rc1

30 Aug 12:29
Compare
Choose a tag to compare
v1.20.0-rc1 Pre-release
Pre-release

⭐ Highlights


🪄LostInTheMiddleRanker and DiversityRanker

We are excited to introduce two new rankers to Haystack: LostInTheMiddleRanker and DiversityRanker!

LostInTheMiddleRanker is based on the research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. It reorders documents according to the "Lost in the Middle" strategy, which places the most relevant paragraphs at the beginning and end of the context, while less relevant paragraphs are positioned in the middle. This ranker can be used in Retrieval-Augmented Generation (RAG) pipelines. Here is an example of how to use it:

web_retriever = WebRetriever(api_key=search_key, top_search_results=5, mode="preprocessed_documents", top_k=50)

sampler = TopPSampler(top_p=0.97)
diversity_ranker = DiversityRanker()
litm_ranker = LostInTheMiddleRanker(word_count_threshold=1024)

pipeline = Pipeline()
pipeline.add_node(component=web_retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
pipeline.add_node(component=diversity_ranker, name="DiversityRanker", inputs=["Sampler"])
pipeline.add_node(component=litm_ranker, name="LostInTheMiddleRanker", inputs=["DiversityRanker"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["LostInTheMiddleRanker"])

In this example, we have positioned the LostInTheMiddleRanker as the last component before the PromptNode. This is because the LostInTheMiddleRanker is designed to be used in combination with other rankers. It is recommended to place it towards the end of the pipeline (as the last ranker), so that it can reorder the documents that have already been ranked by other rankers.

DiversityRanker is a tool that helps to increase the diversity of a set of documents. It uses sentence-transformer models to calculate semantic embeddings for each document and then ranks them in a way that ensures that each subsequent document is the least similar to the ones that have already been selected. This results in a list where each document contributes the most to the overall diversity of the selected set.

We'll reuse the same example from the LostInTheMiddleRanker to point out that the DiversityRanker can be used in combination with other rankers. It is recommended to place it in the pipeline after the similarity ranker but before the LostInTheMiddleRanker. Note that DiversityRanker is typically used in generative RAG pipelines to ensure that the generated answer is drawn from a diverse set of documents. This setup is typical for Long-Form Question Answering (LFQA) tasks. Check out Enhancing RAG Pipelines in Haystack: Introducing DiversityRanker and LostInTheMiddleRanker article on Haystack Blog for details.

📰 New release note management

We have implemented a new release note management system, reno. From now on, every contributor is responsible for adding release notes for the feature or bugfix they're introducing in Haystack in the same Pull Request containing the code changes. The goal is to encourage detailed and accurate notes for every release, especially when it comes to complex features or breaking changes.

See how to work with the new release notes in our Contribution Guide.

⬆️ Upgrade Notes


  • If you're a Haystack contributor, you need a new tool called reno to manage the release notes.
    Please run pip install -e .[dev] to ensure you have reno available in your environment.

  • The Opensearch custom query syntax changes: the old filter placeholders for custom_query are no longer supported.
    Replace your custom filter expressions with the new ${filters} placeholder:

    Old:

      retriever = BM25Retriever(
        custom_query="""
          {
              "query": {
                  "bool": {
                      "should": [{"multi_match": {
                          "query": ${query},
                          "type": "most_fields",
                          "fields": ["content", "title"]}}
                      ],
                      "filter": [
                          {"terms": {"year": ${years}}},
                          {"terms": {"quarter": ${quarters}}},
                          {"range": {"date": {"gte": ${date}}}}
                      ]
                  }
              }
          }
        """
      )
    
      retriever.retrieve(
          query="What is the meaning of life?",
          filters={"years": [2019, 2020], "quarters": [1, 2, 3], "date": "2019-03-01"}
      )

    New:

      retriever = BM25Retriever(
        custom_query="""
          {
              "query": {
                  "bool": {
                      "should": [{"multi_match": {
                          "query": ${query},
                          "type": "most_fields",
                          "fields": ["content", "title"]}}
                      ],
                      "filter": ${filters}
                  }
              }
          }
        """
      )
    
      retriever.retrieve(
          query="What is the meaning of life?",
          filters={"year": [2019, 2020], "quarter": [1, 2, 3], "date": {"$gte": "2019-03-01"}}
      )
  • This update impacts only those who have created custom invocation layers by subclassing PromptModelInvocationLayer.
    Previously, the invoke() method in your custom layer received all prompt template parameters (like query,
    documents, etc.) as keyword arguments. With this change, these parameters will no longer be passed in as keyword
    arguments. If you've implemented such a custom layer, you'll need to potentially update your code to accommodate
    this change.

🥳 New Features


  • The LostInTheMiddleRanker can be used like other rankers in Haystack. After initializing LostInTheMiddleRanker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order - the most relevant documents are located at the top and bottom of the returned list, while the least relevant documents are found in the middle. We advise that you use this ranker in combination with other rankers, and to place it towards the end of the pipeline.

  • The DiversityRanker can be used like other rankers in Haystack and it can be particularly helpful in cases where you have highly relevant yet similar sets of documents. By ensuring a diversity of documents, this new ranker facilitates a more comprehensive utilization of the documents and, particularly in RAG pipelines, potentially contributes to more accurate and rich model responses.

  • When using custom_query in BM25Retriever along with OpenSearch or Elasticsearch, we added support for dynamic filters, like in regular queries. With this change, you can pass filters at query-time without having to modify the custom_query:
    Instead of defining filter expressions and field placeholders, all you have to do is setting the ${filters} placeholder analogous to the ${query} placeholder into your custom_query.
    For example:

      {
          "query": {
              "bool": {
                  "should": [{"multi_match": {
                      "query": ${query},                 // mandatory query placeholder
                      "type": "most_fields",
                      "fields": ["content", "title"]}}
                  ],
                  "filter": ${filters}                 // optional filters placeholder
              }
          }
      }
  • DeepsetCloudDocumentStore supports searching multiple fields in sparse queries. This enables you to search meta fields as well when using BM25Retriever. For example set search_fields=["content", "title"] to search the title meta field along with the document content.

  • Rework DocumentWriter to remove DocumentStoreAwareMixin. Now we require a generic DocumentStore when initialisating the writer.

  • Rework MemoryRetriever to remove DocumentStoreAwareMixin. Now we require a MemoryDocumentStore when initialisating the retriever.

  • Introduced allowed_domains parameter in WebRetriever for domain-specific searches, thus enabling "talk to a website" and "talk to docs" scenarios.

✨ Enhancements


  • The WebRetriever now employs an enhanced caching mechanism that caches web page content based on search engine results rather than the query.

  • Upgrade transformers to the latest version 4.32.1 so that Haystack benefits from Llama and T5 bugfixes: https://github.com/huggingface/transformers/releases/tag/v4.32.1

  • Upgrade Transformers to the latest version 4.32.0.
    This version adds support for the GPTQ quantization and integrates MPT models.

  • Add top_k parameter to the DiversityRanker init method.

  • Enable setting the max_length value when running PromptNodes using local HF text2text-generation models.

  • Enable passing use_fast to the underlying transformers' pipeline

  • Enhance FileTypeClassifier to detect media file types like mp3, mp4, mpeg, m4a, and similar.

  • Minor PromptNode HFLocalInvocationLayer test improvements

  • Several minor enhancements for LinkContentFetcher:

    • Dynamic content handler resolution
    • Custom User-Agent header (optional, minimize blocking)
    • PDF support
    • Register new content handlers
  • If LinkContentFetcher encounters a block or receives any response code other than HTTPStatus.OK, return the search engine snippet as content, if it's available.

  • Allow loading Tokenizers for prompt models not natively supported ...

Read more

v1.19.0

26 Jul 16:22
cf25763
Compare
Choose a tag to compare

⭐️ Highlights

🔎 Elasticsearch 8 support

We are thrilled to share that Haystack now supports the latest version of Elasticsearch, Elasticsearch 8, as Document Store backend. To use Haystack with Elasticsearch 8, just install the new elasticsearch8 extra:

pip install farm-haystack[elasticsearch8]

Importing ElasticsearchDocumentStore from haystack.document_stores will automatically choose the correct Document Store based on the version of the installed Elasticsearch client.

🗂️ RecentnessRanker

We're excited to introduce a new feature to Haystack – a document recentness ranking component! We recognized the importance of ranking documents based on their recentness, especially in scenarios where timely information is critical. For instance, when searching through technical documentation for software releases or news articles, it's essential to prioritize the most up-to-date information. 👇

from haystack.nodes import RecentnessRanker

ranker = RecentnessRanker(
    date_meta_field="date",  # Key pointing to the date field in the metadata.
    ranking_mode="score",
    weight=0.5,  # A 0.5 weight means content relevance and age are averaged.
)

For more details, check out the documentation.

🧠 Improved support for Anthropic Claude

We're thrilled to announce an important update to Haystack's Anthropic Claude support! This update follows the latest improvements in Anthropic Claude models, notably support for Claude 2 and their humongous context window sizes.

Moreover, we've integrated Claude models into our example scripts, making it easier for users to test these cutting-edge models. For instance, check out the updated examples/link_content_blog_post_summary.py script for a demo of Claude summarizing blog posts directly from hyperlinks.

We still support the old models (i.e., claude-v1) and the new Claude models. For more details, see the Anthropic Claude documentation.

🚀 Support for Llama 2 on AWS SageMaker

We are excited to share that Haystack now supports models of the Llama 2 family deployed to AWS SageMaker! Once you’ve deployed your Llama 2 models (including the chat variant) in AWS SageMaker, use them with PromptNode by simply providing the inference endpoint name, your aws_profile_name and aws_custom_attributes👇

from haystack.nodes import PromptNode

prompt_node = PromptNode(
    model_name_or_path="sagemaker-llama-2-endpoint-name", 
    model_kwargs={"aws_profile_name": "my_aws_profile_name", 
                                      "aws_custom_attributes":{"accept_eula": True}}
)
result = prompt_node("Berlin is the capital of")
print(result)

# or the Llama 2 chat model
prompt_node = PromptNode(
    model_name_or_path="sagemaker-llama-2-chat-endpoint-name", 
    model_kwargs={"aws_profile_name": "my_aws_profile_name", 
                                      "aws_custom_attributes":{"accept_eula": True}}
)
chat_conversation = [[
    {"role": "user", "content": "what is the recipe of mayonnaise?"},
]]
result = prompt_node(chat_conversation)
print(result)

For more details on model deployment, check out the documentation.

🎉 Now using transformers 4.31.0

With this release, Haystack depends on the latest version of the transformers library, allowing support for Llama 2.

🚫 SklearnQueryClassifier deprecation

Starting from version 1.19, SklearnQueryClassifier is being deprecated and will be removed from Haystack as of version 1.21. We recommend using the more powerful TransformersQueryClassifier instead. See the announcement for more details.

What's Changed

Pipeline

  • feat: globally disable progress bars by @ZanSara in #5207
  • Add cpu-remote-inference Docker image by @vblagoje in #5225
  • fix: Support isolated node eval in run_batch in Generators by @bogdankostic in #5291
  • feat: support OpenAI-Organization for authentication by @anakin87 in #5292
  • docs: Small documentation updates to dense.py by @sjrl in #5305
  • test: Refactor some retriever tests into unit tests by @sjrl in #5306
  • feat: Add support for meta fields that are lists when using embed_meta_fields by @sjrl in #5307
  • refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher by @vblagoje in #5227
  • fix: update WebRetriever docstrings and default mode by @dfokina in #5352
  • added hybrid search example by @nickprock in #5376

DocumentStores

  • fix: Allow filtering on list fields in InMemoryDocumentStore with all operators by @bogdankostic in #5208
  • Fix: FAISSDocumentStore - make write_documents properly work in combination w update_embeddings by @anakin87 in #5221
  • bug: fix for pinecone not working for per document updates by @vblagoje in #5110
  • fix: avoid conflicts with opensearch / elasticsearch magic attributes during bulk requests by @tstadel in #5113
  • ci: Add unit test for Elasticsearch8 by @bogdankostic in #5300
  • feat: Check version of Elasticsearch server and add support for Elasticsearch <= 7.5 by @bogdankostic in #5320

Documentation

  • feat: BM25 retrieval for MemoryDocumentStore by @vblagoje in #5151
  • fix: install inference in REST API tests by @ZanSara in #5252
  • fix: import_utils fetch_archive_from_http - improve url parsing for fetching archive from http by @malte-aws in #5199
  • fix: Improve robustness of get_task HF pipeline invocations by @MichelBartels in #5284
  • feat: introduce Store protocol (v2) by @ZanSara in #5259
  • fix: num_return_sequences should be less than num_beams, not top_k by @faaany in #5280
  • Revert "fix: num_return_sequences should be less than num_beams, not top_k" by @julian-risch in #5434
  • chore: deprecate SklearnQueryClassifier by @anakin87 in #5324
  • fix: Run HFLocalInvocationLayer.supports even if inference packages are not installed by @MichelBartels in #5308
  • fix: a small bug in StopWordsCriteria by @faaany in #5316
  • chore: fix typo in base.py by @eltociear in #5356
  • feat: extend pipeline.add_component to support stores by @ZanSara in #5261
  • proposal: Add RecentnessRanker component by @elundaeva in #5289
  • feat: Add embed_meta_fields to Ranker nodes by @sjrl in #5361
  • feat: Recentness Ranker by @elundaeva in #5301
  • feat: Update Anthropic Claude support with the latest models, new streaming API, context window sizes by @vblagoje in #5406
  • feat: Enable Support for Meta LLama-2 Models in Amazon Sagemaker by @vblagoje in #5437

Other Changes

Read more

v1.19.0-rc3

26 Jul 15:25
Compare
Choose a tag to compare
v1.19.0-rc3 Pre-release
Pre-release
v1.19.0-rc3