Skip to content

v0.10.0

Compare
Choose a tag to compare
@tholor tholor released this 16 Sep 08:31

⭐ Highlights

🚀 Making Pipelines more scalable

You can now easily scale and distribute Haystack Pipelines thanks to the new integration of the Ray framework (https://ray.io/).
Ray allows distributing a Pipeline's components across a cluster of machines. The individual components of a Pipeline can be independently scaled. For instance, an extractive QA Pipeline deployment can have three replicas of the Reader and a single replica for the Retriever. It enables efficient resource utilization by horizontally scaling Components. You can use Ray via the new RayPipeline class (#1255)

To set the number of replicas, add replicas in the YAML config for the node in a pipeline:

components:
    ...

pipelines:
  - name: ray_query_pipeline
    type: RayPipeline
    nodes:
      - name: ESRetriever
        replicas: 2  # number of replicas to create on the Ray cluster
        inputs: [ Query ]

A RayPipeline currently can only be created with a YAML Pipeline config:

from haystack.pipeline import RayPipeline
pipeline = RayPipeline.load_from_yaml(path="my_pipelines.yaml", pipeline_name="my_query_pipeline")
pipeline.run(query="What is the capital of Germany?")

See docs for more details

😍 Making Pipelines more user-friendly

The old Pipeline design came with a couple of flaws:

  • Impossible to route certain parameters (e.g. top_k) to dedicated nodes
  • Incorrect parameters in pipeline.run() are silently swallowed
  • Hard to understand what is in **kwargs when working with node.run() methods
  • Hard to debug

We tackled those with a big refactoring of the Pipeline class and changed how data is passed between nodes #1321.
This comes now with a few breaking changes:

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

See breaking changes section and the docs for details

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In our recent EMNLP paper, we proposed "SAS", a cross-encoder-based metric for the estimation of semantic answer similarity. We compared it to seven existing metrics and found that it correlates better with human judgement. See our paper #1338

You can use it in Haystack like this:

...
# initialize the node with a SAS model
eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# define a pipeline 
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalDocuments", inputs=["ESRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalDocuments"])
p.add_node(component=eval_reader, name="EvalAnswers", inputs=["QAReader"])
...

See our updated Tutorial 5 for a full example.

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

More nodes, more use cases:

  • FARMClassifier node for Document Classification: tag a document at indexing time or add a class downstream in your inference pipeline #1265
  • SentenceTransformersRanker: Re-Rank your documents after retrieval to maximize the relevance of your results. This implementation uses the popular sentence-transformer models #1209
  • QuestionGenerator: Question Answering systems are trained to find an answer given a question and a document; but with the recent advances in generative NLP, there are now models that can read a document and suggest questions that can be answered by that document. All this power is available to you now via the QuestionGenerator class.
    QuestionGenerator models can be trained using Question Answering datasets. Instead of predicting answers, the QuestionGenerator takes the document as input and is trained to output the questions. This can be useful when you want to add "autosuggest" questions in your search bar or accelerate labeling processes See docs (#1267)

🔭 Better support for OpenSearch

We now support Approximate nearest neighbour (ANN) search in OpenSearch (#1225) and fixed some initialization issues.

📑 New Tutorials

⚠️ Breaking Changes

probability field removed from results #1340

Having two fields probability and score in answers / documents returned from nodes caused often confusion.
From now on we'll only have one field called score that is in range [0,1]. In QA results, this field is populated with the old probability value, so you can simply switch to this one. These fields have changed in Python and REST API.

Old:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 14.684528350830078,
      "probability": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

New:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

RemovedFinder #1326

After being deprecated a few months ago, Finder is now gone - R.I.P

Params in Pipeline.run() #1321

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

Old:

pipeline.run(query="Why?", top_k_retriever=10, no_ans_boost=0.5)

New:

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.
Old:

pipeline.run(query="Why?", top_k_retriever=10, top_k_reader=5)

New:

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

Also, custom nodes must not have **kwargs in their run methods anymore and should only return the data (e.g. answers) they produce themselves.

🤓 Detailed Changes

Crawler

  • Serialize crawler output to JSON #1284
  • Add Crawler support for indexing pipeline #1360

Converter

  • Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR #1349

Preprocessor

  • Add PreProcessor optional language parameter. #1160
  • Improve preprocessing logging #1263
  • Make PreProcessor.process() work on lists of documents #1163

Pipeline

  • Add Ray integration for Pipelines #1255
  • MostSimilarDocumentsPipeline introduced #1413
  • QoL function: access certain nodes in pipeline #1441
  • Refactor replicas config for Ray Pipelines #1378
  • Add simple docs2answer node to allow FAQ style QA / Doc search in API #1361
  • Allow for batch indexing when using Pipelines fix #1168 #1231

Document Stores

  • Implement OpenSearch ANN #1225
  • Bump Weaviate version to 1.7.0 #1412
  • Catch Elastic's search_phase_execution and raise with descriptive message. #1371
  • Fix behavior of delete_documents() with filters for Milvus #1354
  • delete_all_documents() replaced by delete_documents() #1377
  • Support OpenDistro init #1334
  • Integrate filters with knn queries in OpenDistroElasticsearchDocumentStore #1301
  • feat: add support for elastic search to connect without any authentication #1294
  • Raise warning when labels are overwritten #1257
  • Fix SQLAlchemy relationship warnings #1289
  • Added explicit refresh call during refresh_type is false in update em… #1259
  • Add id in write_labels() for SQLDocumentStore #1253
  • ElasticsearchDocumentStore get_label_count() bug fixed. #1252
  • SQLDocumentStore get_label_count() index bug fixed. #1251

Retriever

  • Adding multi gpu support for DPR inference #1414
  • Ensure num_hard_negatives is 0 when embedding passages #1402
  • global_loss_buffer_size to the DensePassageRetriever, fix exceeds max_size #1245

Summarizer

  • Transformer summarizer truncation bug fixed #1309

Document Classifier

  • Add FARMClassifier node for Document Classification #1265

Re-Ranker

  • Add SentenceTransformersRanker with pre-trained Cross-Encoder #1209

Reader

  • Use Reader's device by default #1208

Generator

  • Add QuestionGenerator #1267

Evaluation

  • Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

REST API

  • Fix handling of filters in Search REST API #1431
  • Add support for Dense Retrievers in REST API Indexing Pipeline #1430
  • Add Header in sample REST API Search Request #1293
  • Fix convert integer CONCURRENT_REQUEST_PER_WORKER #1247
  • Env var CONCURRENT_REQUEST_PER_WORKER #1235
  • Small UI and REST API fixes #1223
  • Add scaffold for defining custom components for Pipelines #1205

Docker

  • Update DocumentStore env in docker-compose #1450
  • Enable docker-compose for GPUs & Add public UI image #1406
  • Fix tesseract installation in Dockerfile #1405

User Interface

  • Allow multiple files to upload for Haystack UI #1323
  • Add faq annotation #1333
  • Upgrade streamlit #1279

Documentation and Tutorials

  • new docs version for 0.9.0 #1217
  • Added functionality for Google Colab usecase in Crawler Module #1436
  • Update sentence transformer model in FAQ tutorial #1401
  • crawler api docs updated. #1388
  • Add support for no Docker envs in Tutorial 13 #1365
  • Rag tutorial fixes #1375
  • Editing docs read.me for new docs website workflow #1372
  • Add query classifier usage docs #1348
  • Adding tutorial 13 and 14 #1364
  • Remove Finder from tutorials #1329
  • Tutorial1 remove finder class from import #1328
  • Update docstring for RAG #1149
  • Update README.md for tutorial 13 Question Generation #1325
  • add query classifier colab and jupyter notebook #1324
  • Remove pipeline eval example script #1297
  • Change variable names in tutorials #1286
  • Add links to tutorial 12 to readme #1274
  • Encapsulate tutorial code in method #1266
  • Fix Links #1199

Misc

  • Improve document stores unit test parametrization #1202
  • Version tag added to Haystack #1216
  • Add type ignore to resolve mypy errors #1427
  • Bump pillow from 8.2.0 to 8.3.2 #1423
  • Add sentence-transformers as mandatory dependency and remove from dev… #1387
  • Adjust WeaviateDocumentStore import #1379
  • Update test documentation in readme #1355
  • Add tests for Crawler #1339
  • Suppress FAISS logs & apex warnings #1315
  • Pin Weaviate version #1306
  • Relax typing for meta data #1224

🙏 Big thanks to all contributors! ❤️

A big thank you to all the contributors for this release: @prikmm @akkefa @MichelBartels @hammer @ramgarg102 @bishalgaire @MarkusSagen @dfhssilva @srevinsaju @demarant @mosheber @MichaelBitard @guillim @vblagoje @stefanondisponibile @cambiumproject @bobvanluijt @tanay1337 @Timoeller @annagruendler @PiffPaffM @oryx1729 @bogdankostic @brandenchan @shahrukhx01 @julian-risch @tholor

We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!