Skip to content

Releases: deepset-ai/haystack

v1.5.0

02 Jun 15:37
4ca331c
Compare
Choose a tag to compare

⭐ Highlights

Generative Pseudo Labeling

Dense retrievers excel when finetuned on a labeled dataset of the target domain. However, such datasets rarely exist and are costly to create from scratch with human annotators. Generative Pseudo Labeling solves this dilemma by creating labels automatically for you, which makes it a super fast and low-cost alternative to manual annotation. Technically speaking, it is an unsupervised approach for domain adaptation of dense retrieval models. Given a corpus of unlabeled documents from that domain, it automatically generates queries on that corpus and then uses a cross-encoder model to create pseudo labels for these queries. The pseudo labels can be used to adapt retriever models that domain. Here is a code example that shows how to do that in Haystack:

from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.question_generator.question_generator import QuestionGenerator
from haystack.nodes.label_generator.pseudo_label_generator import PseudoLabelGenerator

# Initialize any document store and fill it with documents from your domain - no labels needed.
document_store = InMemoryDocumentStore()
document_store.write_documents(...) 

# Calculate and store a dense embedding for each document
retriever = EmbeddingRetriever(document_store=document_store, 
                               embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b", 
                               max_seq_len=200)
document_store.update_embeddings(retriever)

# Use the new PseudoLabelGenerator to automatically generate labels and train the retriever on them
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1", max_length=64, split_length=200, batch_size=12)
psg = PseudoLabelGenerator(qg, retriever)
output, _ = psg.run(documents=document_store.get_all_documents()) 
retriever.train(output["gpl_labels"])

#2388

Batch Processing with Query Pipelines

Every query pipeline now has a run_batch() method, which allows to pass multiple queries to the pipeline at once.
Together with a list of queries, you can either provide a single list of documents or a list of lists of documents. In the first case, answers are returned for each query-document pair. In the second case, each query is applied to its corresponding list of documents based on same index in the list. A third option is to have a list containing a single query, which is then applied to each list of documents separately.
Here is an example with a pipeline:

from haystack.pipelines import ExtractiveQAPipeline
...
pipe = ExtractiveQAPipeline(reader, retriever)
predictions = pipe.pipeline.run_batch(
        queries=["Who is the father of Arya Stark?","Who is the mother of Arya Stark?"], params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
    )

And here is an example with a single reader node:

from haystack.nodes import FARMReader
from haystack.schema import Document

FARMReader.predict_batch(
    queries=["1st sample query", "2nd sample query"]
    documents=[[Document(content="sample doc1"), Document(content="sample doc2")], [Document(content="sample doc3"), Document(content="sample doc4")]]

{"queries": ["1st sample query", "2nd sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], ...]}

#2481 #2575

Pipeline Evaluation with Advanced Label Scopes

Typically, a predicted answer is considered correct if it matches the gold answer in the set of evaluation labels. Similarly, a retrieved document is considered correct if its ID matches the gold document ID in the labels. Sometimes however, these simple definitions of "correctness" are not sufficient and you want to further specify the "scope" within which an answer or a document is considered correct.
For this reason, EvaluationResult.calculate_metrics() accepts the parameters answer_scope and document_scope.

As an example, you might consider an answer to be correct only if it stems from a specific context of surrounding words. You can specify answer_scope="context" in calculate_metrics() in that case. See the updated docstrings with a description of the different label scopes or the updated tutorial on evaluation.

...
document_store.add_eval_data(
        filename="data/tutorial5/nq_dev_subset_v2.json",
        preprocessor=preprocessor,
    )
...
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)
eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
metrics = eval_result.calculate_metrics(answer_scope="context")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')

#2482

Support of DeBERTa Models

Haystack now supports DeBERTa models! These kind of models come with some smart architectural improvements over BERT and RoBERTa, such as encoding the relative and absolute position of a token in the input sequence. Only the following three lines are needed to train a DeBERTa reader model on the SQuAD 2.0 dataset. And compared to a RoBERTa model trained on that dataset, you can expect a boost in F1-score from ~84% to ~88% ("microsoft/deberta-v3-large" even gets you to an F1-score as high as ~92%).

from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="microsoft/deberta-v3-base")
reader.train(data_dir="data/squad20", train_filename="train-v2.0.json", dev_filename="dev-v2.0.json", save_dir="my_model")

#2097

⚠️ Breaking Changes

Other Changes

Pipeline

DocumentStores

  • Make DeepsetCloudDocumentStore work with non-existing index by @bogdankostic in #2513
  • [Weaviate] Exit the while loop when we query less documents than available by @masci in #2537
  • Fix knn params for aws managed opensearch by @tstadel in #2581
  • Fix number of returned values in get_metadata_values_by_key by @bogdankostic in #2614

Retriever

Documentation

Other Changes

Read more

v1.4.0

05 May 10:48
081b886
Compare
Choose a tag to compare

⭐ Highlights

Logging Evaluation Results to MLflow

Logging and comparing the evaluation results of multiple different pipeline configurations is much easier now thanks to the newly implemented MLflowTrackingHead. With our public MLflow instance you can log evaluation metrics and metadata about pipeline, evaluation set and corpus. Here is an example log file. If you have your own MLflow instance you can even store the pipeline YAML file and the evaluation set as artifacts. In Haystack, all you need is the execute_eval_run() method:

eval_result = Pipeline.execute_eval_run(
    index_pipeline=index_pipeline,
    query_pipeline=query_pipeline,
    evaluation_set_labels=labels,
    corpus_file_paths=file_paths,
    corpus_file_metas=file_metas,
    experiment_tracking_tool="mlflow",
    experiment_tracking_uri="http://localhost:5000",
    experiment_name="my-query-pipeline-experiment",
    experiment_run_name="run_1",
    pipeline_meta={"name": "my-pipeline-1"},
    evaluation_set_meta={"name": "my-evalset"},
    corpus_meta={"name": "my-corpus"}.
    add_isolated_node_eval=True,
    reuse_index=False
)

#2337

Filtering Answers by Confidence in FARMReader

The FARMReader got a parameter confidence_threshold to filter out predictions below this threshold.
The threshold is disabled by default but can be set between 0 and 1 when initializing the FARMReader:

from haystack.nodes import FARMReader
model = "deepset/roberta-base-squad2"
reader = FARMReader(model, confidence_threshold=0.5)

#2376

Deprecating Milvus1DocumentStore & Renaming ElasticsearchRetriever

The Milvus1DocumentStore is deprecated in favor of the newer Milvus2DocumentStore. Besides big architectural changes that impact performance and reliability Milvus version 2.0 supports the filtering by scalar data types.
For Haystack users this means you can now run a query using vector similarity and filter for some meta data at the same time! See the Milvus documentation for more details if you need to migrate from Milvus1DocumentStore to Milvus2DocumentStore. #2495

The ElasticsearchRetriever node does not only work with the ElasticsearchDocumentStore but also with the OpenSearchDocumentStore and so it is only logical to rename the ElasticsearchRetriever. Now it is called
BM25Retriever after the underlying BM25 ranking function. For the same reason, ElasticsearchFilterOnlyRetriever is now called FilterRetriever. The deprecated names and the new names are both working but we will drop support of the deprecated names in a future release. An overview of the different DocumentStores in Haystack can be found here. #2423 #2461

Fixing Evaluation Discrepancies

The evaluation of pipeline nodes with pipeline.eval(add_isolated_node_eval=True) and alternatively with retriever.eval() and reader.eval() gave slightly different results due to a bug in handling no_answers. This bug is fixed now and all different ways to run the evaluation give the same results. #2381

⚠️ Breaking Changes

  • Change return types of indexing pipeline nodes by @bogdankostic in #2342
  • Upgrade weaviate-client to 3.3.3 and fix get_all_documents by @ZanSara in #1895
  • Align TransformersReader defaults with FARMReader by @julian-risch in #2490
  • Change default encoding for PDFToTextConverter from Latin 1 to UTF-8 by @ZanSara in #2420
  • Validate YAML files without loading the nodes by @ZanSara in #2438

Other Changes

Pipeline

  • Add tests for missing __init__ and super().__init__() in custom nodes by @ZanSara in #2350
  • Forbid usage of *args and **kwargs in any node's __init__ by @ZanSara in #2362
  • Change YAML version exception into a warning by @ZanSara in #2385
  • Make sure that debug=True and params={'debug': True} behaves the same way by @ZanSara in #2442
  • Add support for positional args in pipeline.get_config() by @tstadel in #2478
  • enforce same index values before and after saving/loading eval dataframes by @tstadel in #2398

DocumentStores

  • Fix sparse retrieval with filters returns results without any text-match by @tstadel in #2359
  • EvaluationSetClient for deepset cloud to fetch evaluation sets and la… by @FHardow in #2345
  • Update launch script for Milvus from 1.x to 2.x by @ZanSara in #2378
  • Use ElasticsearchDocumentStore.get_all_documents in ElasticsearchFilterOnlyRetriever.retrieve by @adri1wald in #2151
  • Fix and use delete_index instead of delete_documents in tests by @tstadel in #2453
  • Update docs of DeepsetCloudDocumentStore by @tholor in #2460
  • Add support for aliases in elasticsearch document store by @ZeJ0hn in #2448
  • fix dot_product metric by @jamescalam in #2494
  • Deprecate Milvus1DocumentStore by @bogdankostic in #2495
  • Fix OpenSearchDocumentStore's __init__ by @ZanSara in #2498

Retriever

  • Rename dataset to evaluation_set when logging to mlflow by @tstadel in #2457
  • Linearize tables in EmbeddingRetriever by @MichelBartels in #2462
  • Print warning in EmbeddingRetriever if sentence-transformers model used with different model format by @mpangrazzi in #2377
  • Add flag to disable scaling scores to probabilities by @tstadel in #2454
  • changing the name of the retrievers from es_retriever to retriever by @TuanaCelik in #2487
  • Replace dpr with embeddingretriever tut14 by @mkkuemmel in #2336
  • Support conjunctive queries in sparse retrieval by @tstadel in #2361
  • Fix: Auth token not passed for EmbeddingRetriever by @mathislucka in #2404
  • Pass use_auth_token to sentence transformers EmbeddingRetriever by @MichelBartels in #2284

Reader

  • Fix TableReader for tables without rows by @bogdankostic in #2369
  • Match answer sorting in QuestionAnsweringHead with FARMReader by @tstadel in #2414
  • Fix reader.eval() and reader.eval_on_file() output by @tstadel in #2476
  • Raise error if torch-scatter is not installed or wrong version is installed by @MichelBartels in #2486

Documentation

Other Changes

New Contributors

Read more

v1.3.0

23 Mar 16:46
bf71f03
Compare
Choose a tag to compare

⭐ Highlights

Pipeline YAML Syntax Validation

The syntax of pipeline configurations as defined in YAML files can now be validated. If the validation fails, erroneous components/parameters are identified to make it simple to fix them. Here is a code snippet to manually validate a file:

from pathlib import Path
from haystack.pipelines.config import validate_yaml
validate_yaml(Path("rest_api/pipeline/pipelines.haystack-pipeline.yml"))

Your IDE can also take care of the validation when you edit a pipeline YAML file. The suffix *.haystack-pipeline.yml tells your IDE that this YAML contains a Haystack pipeline configuration and enables some checks and autocompletion features if the IDE is configured that way (YAML plugin for VSCode, Configuration Guide for PyCharm). The schema used for validation can be found in SchemaStore pointing to the schema files for the different Haystack versions. Note that an update of the Haystack version might sometimes require to do small changes to the pipeline YAML files. You can set version: 'unstable' in the pipeline YAML to circumvent the validation or set it to the latest Haystack version if the components and parameters that you use are compatible with the latest version. #2226

Pinecone DocumentStore

We added another DocumentStore to Haystack: PineconeDocumentStore! 🎉 Pinecone is a fully managed service for very large scale dense retrieval. To this end, embeddings and metadata are stored in a hosted Pinecone vector database while the document content is stored in a local SQL database. This separation simplifies infrastructure setup and maintenance. In order to use this new document store, all you need is an API key, which you can obtain by creating an account on the Pinecone website. #2254

import os
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key=os.environ["PINECONE_API_KEY"])

BEIR Integration

Fresh from the 🍻 cellar, Haystack now has an integration with our favorite BEnchmarking Information Retrieval tool BEIR. It contains preprocessed datasets for zero-shot evaluation of retrieval models in 17 different languages, which you can use to benchmark your pipelines. For example, a DocumentSearchPipeline can now be evaluated by calling Pipeline.eval_beir() after having installed Haystack with the BEIR dependency via pip install farm-haystack[beir]. Cheers! #2333

from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)

index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

query_pipeline = DocumentSearchPipeline(retriever=retriever)

ndcg, _map, recall, precision = Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)

Breaking Changes

  • Make Milvus2DocumentStore compatible with pymilvus>=2.0.0 by @MichelBartels in #2126
  • Set provider parameter when instantiating onnxruntime.InferenceSession and make device a torch.device in internal methods by @cjb06776 in #1976

Pipeline

Models

  • Update LFQA with the latest LFQA seq2seq and retriever models by @vblagoje in #2210

DocumentStores

Documentation

Tutorials

Other Changes

New Contributors

❤️ Big thanks to all contributors and the whole community!

v1.2.0

23 Feb 16:02
d21b6a5
Compare
Choose a tag to compare

⭐ Highlights

Brownfield Support of Existing Elasticsearch Indices

You have an existing Elasticsearch index from other projects and now want to try out Haystack? The newly added method es_index_to_document_store provides brownfield support of existing Elasticsearch indices by converting each of the records in the provided index to Haystack Document objects and writing them to the specified DocumentStore.

document_store = es_index_to_document_store(
    document_store=InMemoryDocumentStore(), #or any other Haystack DocumentStore
    original_index_name="existing_index",
    original_content_field="content",
    original_name_field="name",
    included_metadata_fields=["date_field"],
    index="new_index",
)

It can even be used on a regular basis in order to add new records of the Elasticsearch index to the DocumentStore! #2229

Tapas Reader With Scores

The new model class TapasForScoredQA introduced in #1997 supports Tapas Reader models that return confidence scores. When you load a Tapas Reader model, Haystack automatically infers whether the model supports confidence scores and chooses the correct model class under the hood. The returned answers are sorted first by a general table score and then by answer span scores. To try it out, just use one of the new TableReader models:

reader = TableReader(model_name_or_path="deepset/tapas-large-nq-reader", max_seq_len=512) #or
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader", max_seq_len=512)

Extended Meta Data Filtering

We extended the filter capabilities of all(*) document stores to support more complex filter expressions than previously. Besides simple selections on multiple fields you can now use more complex comparison expressions and connect these using boolean operators. For people having used mongodb the new syntax should look familiar. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name.

Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value.

If no logical operator is provided, "$and" is used as default operation.
If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

Therefore, we don't have any breaking changes and you can keep on using your existing filter expressions.

Example:

filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}

(*) FAISSDocumentStore and MilvusDocumentStore currently do not support filters during search.

Code Style and Linting

In addition to mypy we already had for static type checking, we now use pylint for linting and the Haystack code base does now comply with Black formatting standards. As a result, the code is formatted in a consistent way and easier to read. When you would like to contribute to Haystack you don't need to worry about that though - our CI will automatically format your code changes correctly. Our contributor guidelines give more details in case you would like to run the checks locally. #2115 #2130

Installation with fewer dependencies

Installing Haystack has become easier and faster thanks to optional dependencies. From now on, there is no need to install all dependencies if you don't need them. For example, pip3 install farm-haystack will install the latest release together with only a small subset of packages required for basic Pipelines with an ElasticsearchDocumentStore. As another example, if you are experimenting with FAISSDocumentStore in a colab notebook, you can install Haystack from the master branch together with FAISS dependency by running: !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]. The installation guide reflects these updates and the full list of subsets of dependencies can be found here. Keep in mind, though, that this system works best with pip versions above 22 #1994

⚠️ Known Issues

Installing haystack with all dependencies results in heavy pip backtracking that might never finish.
This is due to a dependency conflict that was introduced by a new release of one of our sub dependencies.
To circumvent this problem install haystack like this:

pip install farm-haystack[all] "azure-core<1.23"

This might also be needed for other non-default dependencies (e.g. farm-haystack[dev] "azure-core<1.23").
See #2280 for more information.

⚠️ Breaking Changes

  • Improve dependency management by @ZanSara in #1994
  • Make ui and rest proper packages by @ZanSara in #2098
  • Add aiorwlock to 'ray' extra & fix maximum version for some dependencies by @ZanSara in #2140

🤓 Detailed Changes

Pipeline

Models

DocumentStores

Read more

v1.1.0

20 Jan 16:24
c6f23dc
Compare
Choose a tag to compare

⭐ Highlights

Model Distillation for Reader Models

With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.

To distil your own model, just follow these steps:

  1. Call python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20 where augment_squad.py is our data augmentation script.
  2. Run student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json") where student is a small model and teacher is a highly accurate, larger reader model.
  3. Run student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json") with the same teacher and student.

For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.

Integrated vs. Isolated Pipeline Evaluation Modes

When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline is the bottleneck.

eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
pipeline.print_eval_report(eval_result)
================== Evaluation Report ==================
=======================================================
                      Query
                        |
                      Retriever
                        |
                        | recall_single_hit:   ...
                        |
                      Reader
                        |
                        | f1 upper  bound:   0.78
                        | f1:   0.65
                        |
                      Output

As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.

Row-Column-Intersection Model for TableQA

Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:

reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
                   column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")

The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader.
Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.

Advanced File Converters

Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now:
The ParsrConverter based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).

converter = ParsrConverter()
docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")

⚠️ Breaking Changes

  • Custom id hashing on documentstore level by @ArzelaAscoIi in #1910
  • Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL by @ZanSara in #1990

🤓 Detailed Changes

Pipeline

Models

DocumentStores

REST API

  • Rely api healthcheck on status code rather than json decoding by @fabiolab in #1871
  • Bump version in REST api by @tholor in #1875

UI / Demo

Documentation

Other Changes

Read more

1.0.0

08 Dec 08:05
8cb513c
Compare
Choose a tag to compare

🎁 Haystack 1.0

We worked hard to bring you an early Christmas present: 1.0 is out! In the last months, we re-designed many essential parts of Haystack, introduced new features, and simplified many user-facing methods. We believe Haystack is now much easier to use and a solid base for many exciting upcoming features that we plan. This release is a major milestone on our journey with you, the community, and we want to thank you again for all the great contributions, discussions, questions, and bug reports that helped us to build a better Haystack. This journey has just started 🚀

⭐ Highlights

Improved Evaluation of Pipelines

Evaluation helps you find out how well your system is doing on your data. This includes Pipeline level evaluation to ensure that the system's output is really what you're after, but also Node level evaluation so that you can figure out whether it's your Reader or Retriever that is holding back the performance.

In this release, evaluation is much simpler and cleaner to perform. All the functionality is now baked into the Pipeline class and you can kick off the process by providing Label or MultiLabel objects to the Pipeline.eval() method.

eval_result = pipeline.eval(
    labels=labels,
    params={"Retriever": {"top_k": 5}},
)

The output is an EvaluationResult object which stores each Node's prediction for each sample in a Pandas DataFrame - so you can easily inspect granular predictions and potential mistakes without re-running the whole thing. There is a EvaluationResult.calculate_metrics() method which will return the relevant metrics for your evaluation and you can print a convenient summary report via the new .

metrics = eval_result.calculate_metrics()

pipeline.print_eval_report(eval_result)

If you'd like to start evaluating your own systems on your own data, check out our Evaluation Tutorial!

Table QA

A lot of valuable information is stored in tables - we've heard this again and again from the community. While they are an efficient structured data format, it hasn't been possible to search for table contents using traditional NLP techniques. But now, with the new TableTextRetriever and TableReader our users have all the tools they need to query for relevant tables and perform Question Answering.

The TableTextRetriever is the result of our team's research into table retrieval methods which you can read about in this paper that was presented at EMNLP 2021. Behind the scenes, it uses three transformer-based encoders - one for text passages, one for tables, and one for the query. However, in Haystack, you can swap it in for any other dense retrieval model and start working with tables. The TableReader is built upon the TAPAS model and when handed table containing Documents, it can return a single cell as an answer or perform an aggregation operation on a set of cells to form a final answer.

retriever = TableTextRetriever(
    document_store=document_store,
    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
    embed_meta_fields=["title", "section_title"]
)
reader = TableReader(
		model_name_or_path="google/tapas-base-finetuned-wtq",
		max_seq_len=512
)

Have a look at the Table QA documentation if you'd like to learn more or dive into the Table QA tutorial to start unlocking the information in your table data.

Improved Debugging of Pipelines & Nodes

We've made debugging much simpler and also more informative! As long as your node receives a boolean debug argument, it can propagate its input, output or even some custom information to the output of the pipeline. It is now a built-in feature of all existing nodes and can also easily be inherited by your custom nodes.

result = pipeline.run(
        query="Who is the father of Arya Stark?",
        params={
            "debug": True
        }
    )
{'ESRetriever': {'input': {'debug': True,
                           'query': 'Who is the father of Arya Stark?',
                           'root_node': 'Query',
                           'top_k': 1},
                 'output': {'documents': [<Document: {'content': "\n===In the Riverlands===\nThe Stark army reaches the Twins, a bridge strong", ...}>]
                            ...}

To find out more about this feature, check out debugging. To learn how to define custom debug information, have a look at custom debugging.

FARM Migration

Those of you following Haystack from its first days will know that Haystack first evolved out of the FARM framework. While FARM is designed to handle diverse NLP models and tasks, Haystack gives full end-to-end support to search and question answering use cases with a focus on coordinating all components that take a proof-of-concept into production.

Haystack has always relied on FARM for much lower-level processing and modeling. To reduce the implementation overhead and simplify debugging, we have migrated the relevant parts of FARM into the new haystack/modeling package.

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

With the release of v1.0, we decided to make some bold changes.
We believe this has brought a significant improvement in usability and makes the project more future-proof.
While this does come with a few breaking changes, and we do our best to guide you on how to go from v0.x to v1.0.
For more details see the Migration Guide and if you need more guidance, just reach out via Slack.

New Package Structure & Changed Imports

Due to the ever-increasing number of Nodes and Document Stores being integrated into Haystack,
we felt the need to implement a repository structure that makes it easier to navigate to what you're looking for. We've also shortened the length of the imports.

haystack.document_stores

  • All Document Stores can now be directly accessed from here
  • Note the pluralization of document_store to document_stores

haystack.nodes

  • This directory directly contains any class that can be used as a node
  • This includes File Converters and PreProcessors

haystack.pipelines

  • This contains all the base, custom and pre-made pipeline classes
  • Note the pluralization of pipeline to pipelines

haystack.utils

  • Any utility functions

➡️ For the large majority of imports, the old style still works but this will be deprecated in future releases!

Primitive Objects

Instead of relying on dictionaries, Haystack now standardizes more of the inputs and outputs of Nodes using the following primitive classes:

With these, there is now support for data structures beyond text and the REST API schema is built around their structure.
Using these classes also allows for the autocompletion of fields in your IDE.

Tip: To see examples of these primitive classes being returned, have a look at Ready-Made Pipelines.

Many of the fields in these classes have also been renamed or removed.
You can see a more comprehensive list of them in this Github issue.
Below, we will go through a few cases that are likely to impact established workflows.

Input Document Format

This dictionary schema used to be the recommended way to prepare your data to be indexed.
Now we strongly recommend using our dedicated Document class as a replacement.
The text field has been renamed content to accommodate for cases where it is used for another data format,
for example in Table QA.

Click here to see code example

v0.x:

doc = {
	'text': 'DOCUMENT_TEXT_HERE',
	'meta': {'name': DOCUMENT_NAME, ...}
}

v1.0:

doc = Document(
    content='DOCUMENT_TEXT_HERE',
    meta={'name': DOCUMENT_NAME, ...}
)

From here, you can take the same steps to write Documents into your Document Store.

document_store.write_documents([doc])

Response format of Reader

All Reader Nodes now return Answer objects instead of dictionaries.

Click here to see code example

v0.x:

[
    {
        'answer': 'Fang',
        'score': 13.26807975769043,
        'probability': 0.9657130837440491,
        'context': """Криволапик (Kryvolapyk, kryvi lapy "crooked paws")
            ===Fang (Hagrid's dog)===
            *Chinese (PRC): 牙牙 (ya2 ya) (from 牙 "tooth", 牙,"""
    }
]

v1.0:

[
    <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'co...
Read more

v0.10.0

16 Sep 08:31
Compare
Choose a tag to compare

⭐ Highlights

🚀 Making Pipelines more scalable

You can now easily scale and distribute Haystack Pipelines thanks to the new integration of the Ray framework (https://ray.io/).
Ray allows distributing a Pipeline's components across a cluster of machines. The individual components of a Pipeline can be independently scaled. For instance, an extractive QA Pipeline deployment can have three replicas of the Reader and a single replica for the Retriever. It enables efficient resource utilization by horizontally scaling Components. You can use Ray via the new RayPipeline class (#1255)

To set the number of replicas, add replicas in the YAML config for the node in a pipeline:

components:
    ...

pipelines:
  - name: ray_query_pipeline
    type: RayPipeline
    nodes:
      - name: ESRetriever
        replicas: 2  # number of replicas to create on the Ray cluster
        inputs: [ Query ]

A RayPipeline currently can only be created with a YAML Pipeline config:

from haystack.pipeline import RayPipeline
pipeline = RayPipeline.load_from_yaml(path="my_pipelines.yaml", pipeline_name="my_query_pipeline")
pipeline.run(query="What is the capital of Germany?")

See docs for more details

😍 Making Pipelines more user-friendly

The old Pipeline design came with a couple of flaws:

  • Impossible to route certain parameters (e.g. top_k) to dedicated nodes
  • Incorrect parameters in pipeline.run() are silently swallowed
  • Hard to understand what is in **kwargs when working with node.run() methods
  • Hard to debug

We tackled those with a big refactoring of the Pipeline class and changed how data is passed between nodes #1321.
This comes now with a few breaking changes:

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

See breaking changes section and the docs for details

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In our recent EMNLP paper, we proposed "SAS", a cross-encoder-based metric for the estimation of semantic answer similarity. We compared it to seven existing metrics and found that it correlates better with human judgement. See our paper #1338

You can use it in Haystack like this:

...
# initialize the node with a SAS model
eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# define a pipeline 
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalDocuments", inputs=["ESRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalDocuments"])
p.add_node(component=eval_reader, name="EvalAnswers", inputs=["QAReader"])
...

See our updated Tutorial 5 for a full example.

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

More nodes, more use cases:

  • FARMClassifier node for Document Classification: tag a document at indexing time or add a class downstream in your inference pipeline #1265
  • SentenceTransformersRanker: Re-Rank your documents after retrieval to maximize the relevance of your results. This implementation uses the popular sentence-transformer models #1209
  • QuestionGenerator: Question Answering systems are trained to find an answer given a question and a document; but with the recent advances in generative NLP, there are now models that can read a document and suggest questions that can be answered by that document. All this power is available to you now via the QuestionGenerator class.
    QuestionGenerator models can be trained using Question Answering datasets. Instead of predicting answers, the QuestionGenerator takes the document as input and is trained to output the questions. This can be useful when you want to add "autosuggest" questions in your search bar or accelerate labeling processes See docs (#1267)

🔭 Better support for OpenSearch

We now support Approximate nearest neighbour (ANN) search in OpenSearch (#1225) and fixed some initialization issues.

📑 New Tutorials

⚠️ Breaking Changes

probability field removed from results #1340

Having two fields probability and score in answers / documents returned from nodes caused often confusion.
From now on we'll only have one field called score that is in range [0,1]. In QA results, this field is populated with the old probability value, so you can simply switch to this one. These fields have changed in Python and REST API.

Old:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 14.684528350830078,
      "probability": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

New:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

RemovedFinder #1326

After being deprecated a few months ago, Finder is now gone - R.I.P

Params in Pipeline.run() #1321

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

Old:

pipeline.run(query="Why?", top_k_retriever=10, no_ans_boost=0.5)

New:

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.
Old:

pipeline.run(query="Why?", top_k_retriever=10, top_k_reader=5)

New:

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

Also, custom nodes must not have **kwargs in their run methods anymore and should only return the data (e.g. answers) they produce themselves.

🤓 Detailed Changes

Crawler

  • Serialize crawler output to JSON #1284
  • Add Crawler support for indexing pipeline #1360

Converter

  • Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR #1349

Preprocessor

  • Add PreProcessor optional language parameter. #1160
  • Improve preprocessing logging #1263
  • Make PreProcessor.process() work on lists of documents #1163

Pipeline

  • Add Ray integration for Pipelines #1255
  • MostSimilarDocumentsPipeline introduced #1413
  • QoL function: access certain nodes in pipeline #1441
  • Refactor replicas config for Ray Pipelines #1378
  • Add simple docs2answer node to allow FAQ style QA / Doc search in API #1361
  • Allow for batch indexing when using Pipelines fix #1168 #1231

Document Stores

  • Implement OpenSearch ANN [#12...
Read more

v0.9.0

21 Jun 16:50
9e4d7bf
Compare
Choose a tag to compare

⭐ Highlights

Long-Form Question Answering (LFQA)

Haystack now provides LFQA with a Seq2SeqGenerator for generative QA and a Retribert Retriever thanks to community member @vblagoje. #1086
If you would like to ask questions where the answer is not a short phrase explicitly given in one of the documents but a more elaborate answer than LFQA is interesting for you. These elaborate answers are generated by combining information from multiple relevant documents.

Document Re-Ranking

For pure "semantic document search" use cases that do not need question answering functionality but only document ranking, there is now a new type of node: Ranker. While the Retriever is a perfect fit for document retrieval, we can further improve its results with the Ranker. #1025
To this end, the Ranker uses a pre-trained model to calculate the semantic similarity of the question and each of the top-k retrieved documents. Documents with a high semantic similarity are ranked higher. The combination of a Retriever and Ranker is especially powerful if you combine a sparse retriever, e.g., ElasticsearchRetriever based on BM25 and a dense Ranker.
A pipeline with a Ranker and Retriever can be setup in just a few lines of code:

...
retriever = ElasticsearchRetriever(document_store=document_store)
ranker = FARMRanker(model_name_or_path="deepset/gbert-base-germandpr-reranking")

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])
...

Weaviate

Thanks to a contribution by our community member @venuraja79 Weaviate is integrated into Haystack as another DocumentStore #1064
It allows a combination of vector search and scalar filtering, i.e., you can filter for a certain tag and do dense retrieval on that subset. After starting a Weaviate server with docker, it's as simple as:

from haystack.document_store import WeaviateDocumentStore
document_store = WeaviateDocumentStore()

Haystack uses the most recent Weaviate version 1.4.0 and the updating of embeddings has also been optimized #1181

Query Classifier

Some search applications need to distinguish between keyword queries and longer textual questions that come in. If you only want to route longer questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs and route keyword queries to a Document Retriever, you can do that now with a QueryClassifier node thanks to a contribution by @shahrukhx01. #1099
You could use it as shown in this exemplary pipeline:
image

New Tutorials

  1. Tutorial 11: Pipelines #991
  2. Tutorial 12: Generative QA with LFQA #1086

⚠️ Breaking Changes

  • Remove Python 3.6 support #1059
  • Refactor REST APIs to use Pipelines #922
  • Bump to FARM 0.8.0, torch 1.8.1 and transformers 4.6.1 #1192

🤓 Detailed Changes

Connector

  • Add crawler to get texts from websites #775

Preprocessor

  • Add white space normalization warning #1022
  • Preserve whitespace during PreProcessor.split() #1121
  • Fix equality check in preprocessor #969

Pipeline

  • Add validation for root node in Pipeline #987
  • Fix passing a list as parameter value in Pipeline YAML #952
  • Add export of Pipeline YAML config #1003
  • Add config to JoinDocuments node to allow yaml export in pipelines #1134

Document Stores

  • Integrate Weaviate as another DocumentStore #957 #1064
  • Add OpenDistro init #1101
  • Rename all document stores delete_all_documents() method to delete_documents #1047
  • Fix Elasticsearch connection for non-admin users #1028
  • Fix update_embeddings() for FAISSDocumentStore #978
  • Feature: Enable AWS Elasticsearch IAM connection #965
  • Fix optional FAISS import #971
  • Make FAISS import conditional #970
  • Benchmark milvus #850
  • Improve Milvus HNSW Performance #1127
  • Update Milvus benchmarks #1128
  • Upgrade milvus to 1.1.0 #1066
  • Update tests for FAISSDocumentStore #999
  • Add L2 support for FAISS HNSW #1138
  • Improve the speed of FAISSDocumentStore.delete_documents() #1095
  • Add options for handling duplicate documents (skip, fail, overwrite) #1088
  • Update Embeddings - Use update instead of replace #1181
  • Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() #1063
  • Using text hash as id to prevent document duplication #1000

Retriever

  • DPR Training parameter #989
  • Removed single_model_path; added infer_tokenizer to dpr load() #1060
  • Integrate sentence transformers into benchmarks #843
  • added use_amp to the train method, in order to use mixed precision training #1048

Ranker

  • Re-ranking component for document search without QA #1025
  • Remove quickfix from reader and ranker #1196
  • Distinguish labels for calculating similarity scores #1124

Query Classifier

  • Fix typo in Query Classifier Exception Message #1190
  • Add QueryClassifier incl. baseline models #1099

Reader

  • Filtering duplicate answers #1021
  • Add ONNXRuntime support #157
  • Remove unused function _get_pseudo_prob #1201

Generator

  • Integrate LFQA with Haystack - inferencing #1086

Evaluation Nodes

  • Reduce precision in pipeline eval print functions #943
  • Fix division by zero error in EvalRetriever #938
  • Add evaluation nodes for Pipelines #904
  • Add More top_k handling to EvalDocuments #1133
  • Prevent merge of same questions on different documents during evaluation #1119

REST API

  • adding root_path option #982
  • Add PDF converter dependencies Docker #1107
  • Disable Gunicorn preload option #960

User Interface

  • change file-upload response to sidebar #1018
  • Add File Upload Functionality in UI #995
  • Streamlit UI Evaluation mode #920
  • Fix evaluation mode in UI #1024
  • Fix typo in streamlit UI #1106

Documentation and Tutorials

  • Add about sections to Tutorial 12 #1195
  • Tutorial update #1166
  • Documentation update #1162
  • Add FAQ page #1151
  • Refresh API docs #1152
  • Add docu of confidence scores and calibration method #1131
  • Adding indentation to markup files #947
  • Update preprocessing.md #1087
  • Add badges to readme [#1136](...
Read more

v0.8.0

13 Apr 15:04
Compare
Choose a tag to compare

⭐ Highlights

This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:

Milvus Document Store

Milvus is an open-source vector database. With the MilvusDocumentStore contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.

Knowledge Graph

An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever to convert natural language queries to SPARQL.

Pipeline configuration with YAML

The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.

REST APIs

The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.

Confidence Scores

The answers now have a probability score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.

Web Crawler

A Selenium based web crawler is now part of Haystack, thanks to @DIVYA-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.

⚠️ Breaking Changes

REST APIs

The REST APIs got a major revamp with this release.

  • /doc-qa & /faq-qa endpoints are replaced with a more generic POST /query endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured at rest_api/pipeline.yaml.

  • The new /query endpoint expects a single query per request instead of a list of query strings.
    The new request format is:

    {
        "query": "Why did the revenue change?"
    }

    and the response looks like this:

    {
        "query": "Why did the revenue change?",
        "answers": [
            {
                "answer": "rapid technological change and evolving industry standards",
                "question": null,
                "score": 0.543937623500824,
                "probability": 0.014070278964936733,
                "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and     evolving industry standards.",
                "offset_start": 91,
                "offset_end": 149,
                "offset_start_in_doc": 511,
                "offset_end_in_doc": 569,
                "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57",
                "meta": {
                    "_split_id": "7"
                }
            },
            {
                 // other answers
            }
        ]
    }
  • The /doc-qa-feedback & /faq-qa-feedback endpoints are replaced with a new generic /feedback endpoint.

Created At Timestamp

Previously, all documents/labels in SQLDocumentStore and FAISSDocumentStore had a field called created to store the creation timestamp, while ElasticsearchDocumentStore did not have any timestamp field. Now, all document stores have a created_at field for documents and labels.

RAGenerator

The top_k_answers parameter in the RAGenerator is renamed to top_k for consistency across Haystack components.

Custom Query for Elasticsearch

The placeholder terms in custom_query should not have quotes around them. See more details here.

🤓 Detailed Changes

Pipeline

Document Store

  • Fixes elasticsearch auth #871 (@grafke)
  • Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
  • Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
  • Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
  • Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
  • Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
  • Milvus integration #771 (@lalitpagaria)
  • Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
  • Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
  • Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)

Retriever

Modeling

REST API

File Converter

  • Add Markdown file convertor #875 (@lalitpagaria)
  • Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)

Crawler

Knowledge Graph

Annotation Tool

Search UI

  • Fix UI when API returns fewer answers than expected #828(@tholor)

CI

Misc Fixes

Read more

v0.7.0

21 Jan 17:42
Compare
Choose a tag to compare

⭐ Highlights

New Slack Channel

As many people in the community asked us for it, we decided to open a slack channel!
Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!

👉 https://haystack.deepset.ai/community/join

Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

Interacting with large datasets can be challenging for the local memory. Therefore, we ...

  1. ... add batch_size parameters for most methods of the document store that allow to only load smaller chunks of documents at a time
  2. ... add a get_all_documents_generator() method that "streams" documents one by one from your document store.
    Both help to lower the memory footprint significantly- especially when calling methods like update_embeddings() on datasets > 1 Mio docs.

Add Simple Demo UI (#671)

Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...

Image

Support for summarization models (#698)

Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...

... standalone:

docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
                    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
                    the shutoffs which were expected to last through at least midday tomorrow.")]

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
summary = summarizer.predict(documents=docs, generate_single_summary=False)

... as a node in your pipeline:

...
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])

... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:

...
pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
pipe.run()

We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.

New Tutorials

  1. Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
  2. Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.

⚠️ Breaking Changes

Dropping index_buffer_size from FAISSDocumentStore

We removed the arg index_buffer_size from the init of FAISSDocumentStore. "Buffering" is now handled via the new batch_size arguments that you can pass to most methods like write_documents(), update_embeddings() and get_all_documents().

Renaming of Preprocessor arg

Old:

PreProcessor(..., split_stride=5)

New:

PreProcessor(..., split_overlap=5)

🤓 Detailed Changes

Preprocessing / File Conversion

  • Using PreProcessor functions on eval data #751

DocumentStore

  • Support filters for DensePassageRetriever + InMemoryDocumentStore #754
  • use Path class in add_eval_data of haystack.document_store.base.py #745
  • Make batchwise adding of evaluation data possible #717
  • Change signature and docstring for ca_certs parameter #730
  • Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
  • Fix SQLite errors in tests #723
  • Add support for custom embedding field for InMemoryDocumentStore #640
  • Using Columns names instead of ORM to get all documents #620

Other

  • Generate docstrings and deploy to branches to Staging (Website) #731
  • Script for releasing docs #736
  • Increase FARM to Version 0.6.2 #755
  • Reduce memory consumption of fetch_archive_from_http #737
  • Add links to more resources #746
  • Fix Tutorial 9 #734
  • Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
  • Add ID to Label schema #727
  • Automate docstring and tutorial generation with every push to master #718
  • Pass custom label index name to REST API #724
  • Correcting pypi download badge #722
  • Fix GPU docker build #703
  • Remove sourcerer.io widget #702
  • Haystack logo is not visible on github mobile app #697
  • Update pipeline documentation and readme #693
  • Enable GPU args in tutorials #692
  • Add docs v0.6.0 #689

Big thanks to all contributors ❤️ !

@Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch