Skip to content

v0.9.0

Compare
Choose a tag to compare
@julian-risch julian-risch released this 21 Jun 16:50
9e4d7bf

⭐ Highlights

Long-Form Question Answering (LFQA)

Haystack now provides LFQA with a Seq2SeqGenerator for generative QA and a Retribert Retriever thanks to community member @vblagoje. #1086
If you would like to ask questions where the answer is not a short phrase explicitly given in one of the documents but a more elaborate answer than LFQA is interesting for you. These elaborate answers are generated by combining information from multiple relevant documents.

Document Re-Ranking

For pure "semantic document search" use cases that do not need question answering functionality but only document ranking, there is now a new type of node: Ranker. While the Retriever is a perfect fit for document retrieval, we can further improve its results with the Ranker. #1025
To this end, the Ranker uses a pre-trained model to calculate the semantic similarity of the question and each of the top-k retrieved documents. Documents with a high semantic similarity are ranked higher. The combination of a Retriever and Ranker is especially powerful if you combine a sparse retriever, e.g., ElasticsearchRetriever based on BM25 and a dense Ranker.
A pipeline with a Ranker and Retriever can be setup in just a few lines of code:

...
retriever = ElasticsearchRetriever(document_store=document_store)
ranker = FARMRanker(model_name_or_path="deepset/gbert-base-germandpr-reranking")

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])
...

Weaviate

Thanks to a contribution by our community member @venuraja79 Weaviate is integrated into Haystack as another DocumentStore #1064
It allows a combination of vector search and scalar filtering, i.e., you can filter for a certain tag and do dense retrieval on that subset. After starting a Weaviate server with docker, it's as simple as:

from haystack.document_store import WeaviateDocumentStore
document_store = WeaviateDocumentStore()

Haystack uses the most recent Weaviate version 1.4.0 and the updating of embeddings has also been optimized #1181

Query Classifier

Some search applications need to distinguish between keyword queries and longer textual questions that come in. If you only want to route longer questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs and route keyword queries to a Document Retriever, you can do that now with a QueryClassifier node thanks to a contribution by @shahrukhx01. #1099
You could use it as shown in this exemplary pipeline:
image

New Tutorials

  1. Tutorial 11: Pipelines #991
  2. Tutorial 12: Generative QA with LFQA #1086

⚠️ Breaking Changes

  • Remove Python 3.6 support #1059
  • Refactor REST APIs to use Pipelines #922
  • Bump to FARM 0.8.0, torch 1.8.1 and transformers 4.6.1 #1192

🤓 Detailed Changes

Connector

  • Add crawler to get texts from websites #775

Preprocessor

  • Add white space normalization warning #1022
  • Preserve whitespace during PreProcessor.split() #1121
  • Fix equality check in preprocessor #969

Pipeline

  • Add validation for root node in Pipeline #987
  • Fix passing a list as parameter value in Pipeline YAML #952
  • Add export of Pipeline YAML config #1003
  • Add config to JoinDocuments node to allow yaml export in pipelines #1134

Document Stores

  • Integrate Weaviate as another DocumentStore #957 #1064
  • Add OpenDistro init #1101
  • Rename all document stores delete_all_documents() method to delete_documents #1047
  • Fix Elasticsearch connection for non-admin users #1028
  • Fix update_embeddings() for FAISSDocumentStore #978
  • Feature: Enable AWS Elasticsearch IAM connection #965
  • Fix optional FAISS import #971
  • Make FAISS import conditional #970
  • Benchmark milvus #850
  • Improve Milvus HNSW Performance #1127
  • Update Milvus benchmarks #1128
  • Upgrade milvus to 1.1.0 #1066
  • Update tests for FAISSDocumentStore #999
  • Add L2 support for FAISS HNSW #1138
  • Improve the speed of FAISSDocumentStore.delete_documents() #1095
  • Add options for handling duplicate documents (skip, fail, overwrite) #1088
  • Update Embeddings - Use update instead of replace #1181
  • Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() #1063
  • Using text hash as id to prevent document duplication #1000

Retriever

  • DPR Training parameter #989
  • Removed single_model_path; added infer_tokenizer to dpr load() #1060
  • Integrate sentence transformers into benchmarks #843
  • added use_amp to the train method, in order to use mixed precision training #1048

Ranker

  • Re-ranking component for document search without QA #1025
  • Remove quickfix from reader and ranker #1196
  • Distinguish labels for calculating similarity scores #1124

Query Classifier

  • Fix typo in Query Classifier Exception Message #1190
  • Add QueryClassifier incl. baseline models #1099

Reader

  • Filtering duplicate answers #1021
  • Add ONNXRuntime support #157
  • Remove unused function _get_pseudo_prob #1201

Generator

  • Integrate LFQA with Haystack - inferencing #1086

Evaluation Nodes

  • Reduce precision in pipeline eval print functions #943
  • Fix division by zero error in EvalRetriever #938
  • Add evaluation nodes for Pipelines #904
  • Add More top_k handling to EvalDocuments #1133
  • Prevent merge of same questions on different documents during evaluation #1119

REST API

  • adding root_path option #982
  • Add PDF converter dependencies Docker #1107
  • Disable Gunicorn preload option #960

User Interface

  • change file-upload response to sidebar #1018
  • Add File Upload Functionality in UI #995
  • Streamlit UI Evaluation mode #920
  • Fix evaluation mode in UI #1024
  • Fix typo in streamlit UI #1106

Documentation and Tutorials

  • Add about sections to Tutorial 12 #1195
  • Tutorial update #1166
  • Documentation update #1162
  • Add FAQ page #1151
  • Refresh API docs #1152
  • Add docu of confidence scores and calibration method #1131
  • Adding indentation to markup files #947
  • Update preprocessing.md #1087
  • Add badges to readme #1136
  • Regen api docs #1015
  • Docs: Add usage information detailes for aws elastic search service #1008
  • Add tutorial pages #1013
  • Pipelines tutorial #991
  • knowledge graph documentation #979
  • knowledge graph example #934
  • Add Milvus to the retriever / document store table #931
  • New docs version #964
  • Update Documentation #976
  • update api markdown files and add markdown file for ranker #1198
  • Reformat FAQ page #1177
  • Minor change with a link to the Weaviate docs #1180
  • Add links to GitHub Discussion and SO #984
  • Update milvus links and docstrings #959
  • Fixed link to dpr #962
  • Removed comma from last item in json list #1114
  • Fixing inconsistency #926

Misc

  • Squad tools #1029
  • Bugfix setting of device by defaulting to "cpu" #1182
  • Fixing issues caused due to mypy upgrade #1165
  • Remove Duplicate Benchmark Run #1132
  • Fixing grpcio-tools to version of colab's pre-installed grpcio #1113
  • Update farm version #936

🙏 Big thanks to all contributors! ❤️

A big thank you to all the contributors for this release: @PiffPaffM @oryx1729 @jacksbox @guillim @Timoeller @aantti @tholor @brandenchan @julian-risch @bhadreshpsavani @akkefa @mosheber @lalitpagaria @Avi777 @MichaelBitard @AlviseSembenico @shahrukhx01 @venuraja79 @bobvanluijt @vblagoje @cvgoudar

We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!