⭐ Highlights

🚀 Making Pipelines more scalable

You can now easily scale and distribute Haystack Pipelines thanks to the new integration of the Ray framework (https://ray.io/).
Ray allows distributing a Pipeline's components across a cluster of machines. The individual components of a Pipeline can be independently scaled. For instance, an extractive QA Pipeline deployment can have three replicas of the Reader and a single replica for the Retriever. It enables efficient resource utilization by horizontally scaling Components. You can use Ray via the new RayPipeline class (#1255)

To set the number of replicas, add replicas in the YAML config for the node in a pipeline:

components:
    ...

pipelines:
  - name: ray_query_pipeline
    type: RayPipeline
    nodes:
      - name: ESRetriever
        replicas: 2  # number of replicas to create on the Ray cluster
        inputs: [ Query ]

A RayPipeline currently can only be created with a YAML Pipeline config:

from haystack.pipeline import RayPipeline
pipeline = RayPipeline.load_from_yaml(path="my_pipelines.yaml", pipeline_name="my_query_pipeline")
pipeline.run(query="What is the capital of Germany?")

See docs for more details

😍 Making Pipelines more user-friendly

The old Pipeline design came with a couple of flaws:

Impossible to route certain parameters (e.g. top_k) to dedicated nodes
Incorrect parameters in pipeline.run() are silently swallowed
Hard to understand what is in **kwargs when working with node.run() methods
Hard to debug

We tackled those with a big refactoring of the Pipeline class and changed how data is passed between nodes #1321.
This comes now with a few breaking changes:

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

See breaking changes section and the docs for details

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In our recent EMNLP paper, we proposed "SAS", a cross-encoder-based metric for the estimation of semantic answer similarity. We compared it to seven existing metrics and found that it correlates better with human judgement. See our paper #1338

You can use it in Haystack like this:

...
# initialize the node with a SAS model
eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# define a pipeline 
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalDocuments", inputs=["ESRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalDocuments"])
p.add_node(component=eval_reader, name="EvalAnswers", inputs=["QAReader"])
...

See our updated Tutorial 5 for a full example.

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

More nodes, more use cases:

FARMClassifier node for Document Classification: tag a document at indexing time or add a class downstream in your inference pipeline #1265
SentenceTransformersRanker: Re-Rank your documents after retrieval to maximize the relevance of your results. This implementation uses the popular sentence-transformer models #1209
QuestionGenerator: Question Answering systems are trained to find an answer given a question and a document; but with the recent advances in generative NLP, there are now models that can read a document and suggest questions that can be answered by that document. All this power is available to you now via the QuestionGenerator class.
QuestionGenerator models can be trained using Question Answering datasets. Instead of predicting answers, the QuestionGenerator takes the document as input and is trained to output the questions. This can be useful when you want to add "autosuggest" questions in your search bar or accelerate labeling processes See docs (#1267)

🔭 Better support for OpenSearch

We now support Approximate nearest neighbour (ANN) search in OpenSearch (#1225) and fixed some initialization issues.

📑 New Tutorials

Tutorial 13 - Question Generation:Jupyter noteboook|Colab|Python
Tutorial 14 - Query Classifier:Jupyter noteboook|Colab|Python

⚠️ Breaking Changes

`probability` field removed from results #1340

Having two fields probability and score in answers / documents returned from nodes caused often confusion.
From now on we'll only have one field called score that is in range [0,1]. In QA results, this field is populated with the old probability value, so you can simply switch to this one. These fields have changed in Python and REST API.

Old:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 14.684528350830078,
      "probability": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

New:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

Removed`Finder` #1326

After being deprecated a few months ago, Finder is now gone - R.I.P

Params in `Pipeline.run()` #1321

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

Old:

pipeline.run(query="Why?", top_k_retriever=10, no_ans_boost=0.5)

New:

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.
Old:

pipeline.run(query="Why?", top_k_retriever=10, top_k_reader=5)

New:

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

Also, custom nodes must not have **kwargs in their run methods anymore and should only return the data (e.g. answers) they produce themselves.

🤓 Detailed Changes

Crawler

Serialize crawler output to JSON #1284
Add Crawler support for indexing pipeline #1360

Converter

Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR #1349

Preprocessor

Add PreProcessor optional language parameter. #1160
Improve preprocessing logging #1263
Make PreProcessor.process() work on lists of documents #1163

Pipeline

Add Ray integration for Pipelines #1255
MostSimilarDocumentsPipeline introduced #1413
QoL function: access certain nodes in pipeline #1441
Refactor replicas config for Ray Pipelines #1378
Add simple docs2answer node to allow FAQ style QA / Doc search in API #1361
Allow for batch indexing when using Pipelines fix #1168 #1231

Document Stores

Implement OpenSearch ANN #1225
Bump Weaviate version to 1.7.0 #1412
Catch Elastic's search_phase_execution and raise with descriptive message. #1371
Fix behavior of delete_documents() with filters for Milvus #1354
delete_all_documents() replaced by delete_documents() #1377
Support OpenDistro init #1334
Integrate filters with knn queries in OpenDistroElasticsearchDocumentStore #1301
feat: add support for elastic search to connect without any authentication #1294
Raise warning when labels are overwritten #1257
Fix SQLAlchemy relationship warnings #1289
Added explicit refresh call during refresh_type is false in update em… #1259
Add id in write_labels() for SQLDocumentStore #1253
ElasticsearchDocumentStore get_label_count() bug fixed. #1252
SQLDocumentStore get_label_count() index bug fixed. #1251

Retriever

Adding multi gpu support for DPR inference #1414
Ensure num_hard_negatives is 0 when embedding passages #1402
global_loss_buffer_size to the DensePassageRetriever, fix exceeds max_size #1245

Summarizer

Transformer summarizer truncation bug fixed #1309

Document Classifier

Add FARMClassifier node for Document Classification #1265

Re-Ranker

Add SentenceTransformersRanker with pre-trained Cross-Encoder #1209

Reader

Use Reader's device by default #1208

Generator

Add QuestionGenerator #1267

Evaluation

Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

REST API

Fix handling of filters in Search REST API #1431
Add support for Dense Retrievers in REST API Indexing Pipeline #1430
Add Header in sample REST API Search Request #1293
Fix convert integer CONCURRENT_REQUEST_PER_WORKER #1247
Env var CONCURRENT_REQUEST_PER_WORKER #1235
Small UI and REST API fixes #1223
Add scaffold for defining custom components for Pipelines #1205

Docker

Update DocumentStore env in docker-compose #1450
Enable docker-compose for GPUs & Add public UI image #1406
Fix tesseract installation in Dockerfile #1405

User Interface

Allow multiple files to upload for Haystack UI #1323
Add faq annotation #1333
Upgrade streamlit #1279

Documentation and Tutorials

new docs version for 0.9.0 #1217
Added functionality for Google Colab usecase in Crawler Module #1436
Update sentence transformer model in FAQ tutorial #1401
crawler api docs updated. #1388
Add support for no Docker envs in Tutorial 13 #1365
Rag tutorial fixes #1375
Editing docs read.me for new docs website workflow #1372
Add query classifier usage docs #1348
Adding tutorial 13 and 14 #1364
Remove Finder from tutorials #1329
Tutorial1 remove finder class from import #1328
Update docstring for RAG #1149
Update README.md for tutorial 13 Question Generation #1325
add query classifier colab and jupyter notebook #1324
Remove pipeline eval example script #1297
Change variable names in tutorials #1286
Add links to tutorial 12 to readme #1274
Encapsulate tutorial code in method #1266
Fix Links #1199

Misc

Improve document stores unit test parametrization #1202
Version tag added to Haystack #1216
Add type ignore to resolve mypy errors #1427
Bump pillow from 8.2.0 to 8.3.2 #1423
Add sentence-transformers as mandatory dependency and remove from dev… #1387
Adjust WeaviateDocumentStore import #1379
Update test documentation in readme #1355
Add tests for Crawler #1339
Suppress FAISS logs & apex warnings #1315
Pin Weaviate version #1306
Relax typing for meta data #1224

🙏 Big thanks to all contributors! ❤️

A big thank you to all the contributors for this release: @prikmm @akkefa @MichelBartels @hammer @ramgarg102 @bishalgaire @MarkusSagen @dfhssilva @srevinsaju @demarant @mosheber @MichaelBitard @guillim @vblagoje @stefanondisponibile @cambiumproject @bobvanluijt @tanay1337 @Timoeller @annagruendler @PiffPaffM @oryx1729 @bogdankostic @brandenchan @shahrukhx01 @julian-risch @tholor

We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.0

⭐ Highlights

🚀 Making Pipelines more scalable

😍 Making Pipelines more user-friendly

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

🔭 Better support for OpenSearch

📑 New Tutorials

⚠️ Breaking Changes

`probability` field removed from results #1340

Removed`Finder` #1326

Params in `Pipeline.run()` #1321

🤓 Detailed Changes

Crawler

Converter

Preprocessor

Pipeline

Document Stores

Retriever

Summarizer

Document Classifier

Re-Ranker

Reader

Generator

Evaluation

REST API

Docker

User Interface

Documentation and Tutorials

Misc

🙏 Big thanks to all contributors! ❤️

Contributors

v0.10.0

⭐ Highlights

🚀 Making Pipelines more scalable

😍 Making Pipelines more user-friendly

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

🔭 Better support for OpenSearch

📑 New Tutorials

⚠️ Breaking Changes

probability field removed from results #1340

RemovedFinder #1326

Params in Pipeline.run() #1321

🤓 Detailed Changes

Crawler

Converter

Preprocessor

Pipeline

Document Stores

Retriever

Summarizer

Document Classifier

Re-Ranker

Reader

Generator

Evaluation

REST API

Docker

User Interface

Documentation and Tutorials

Misc

🙏 Big thanks to all contributors! ❤️

Contributors

`probability` field removed from results #1340

Removed`Finder` #1326

Params in `Pipeline.run()` #1321