# Long-Form Question Answering

Follow this tutorial to learn how to build and use a pipeline for Long-Form Question Answering (LFQA). LFQA is a variety of the generative question answering task. LFQA systems query large document stores for relevant information and then use this information to generate accurate, multi-sentence answers. In a regular question answering system, the retrieved documents related to the query (context passages) act as source tokens for extracted answers. In an LFQS system, context passages provide the context the system uses to generate original, abstractive, long-form answers.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

You can double check whether the GPU runtime is enabled with the following command:

In [1]:
%%bash

nvidia-smi

Wed Jan  4 13:38:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

To start, install the latest release of Haystack with `pip`:

In [2]:
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 68.1 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-22.3.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab,faiss]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-s1rnujr4/farm-haystack_583840abb1744ecdb0e47c494cbd43f0
  Resolved https://github.com/deepset-ai/haystack.git to commit a2c160e7d8e706cd8184eb984db5882350d9d876
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements t

  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-s1rnujr4/farm-haystack_583840abb1744ecdb0e47c494cbd43f0


## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

### Document Store

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

In [4]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str="Flat")

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry


### Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [12]:
from haystack.utils import convert_files_to_docs, fetch_archive_from_http, clean_wiki_text


# Let's first get some files that we want to use
# doc_dir = "data/tutorial12"
# s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt12.zip"
# fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
doc_dir = "data/custom"
# Convert files to dicts
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(docs)

INFO:haystack.utils.preprocessing:Converting data/custom/Julian_Assange.txt
INFO:haystack.utils.preprocessing:Converting data/custom/Ramakrishna.txt
INFO:haystack.utils.preprocessing:Converting data/custom/Swami_Vivekananda.txt


Writing Documents:   0%|          | 0/93 [00:00<?, ?it/s]

### Initialize Retriever and Reader/Generator

#### Retriever

We use a `DensePassageRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`



In [13]:
from haystack.nodes import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
    passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
)

document_store.update_embeddings(retriever)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.document_stores.faiss:Updating embeddings for 2450 docs...


Updating Embedding:   0%|          | 0/2450 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/2464 [00:00<?, ? Docs/s]

Before we blindly use the `DensePassageRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents.

In [15]:
from haystack.utils import print_documents
from haystack.pipelines import DocumentSearchPipeline

p_retrieval = DocumentSearchPipeline(retriever)
res = p_retrieval.run(query="who is Vivekananda?", params={"Retriever": {"top_k": 10}})
print_documents(res, max_text_len=512)


Query: who is Vivekananda?

{   'content': '\n'
               "=== Parliament of the World's Religions ===\n"
               "The Parliament of the World's Religions opened on 11 September "
               "1893 at the Art Institute of Chicago, as part of the World's "
               'Columbian Exposition. On this day, Vivekananda gave a brief '
               'speech representing India and Hinduism. He was initially '
               'nervous, bowed to Saraswati (the Hindu goddess of learning) '
               'and began his speech with "Sisters and brothers of America!". '
               'At these words, Vivekananda received a two-minute standing '
               'ovation from the crowd of seven thousand. Acc...',
    'name': 'Swami_Vivekananda.txt'}

{   'content': '\n'
               '== Teachings and philosophy ==\n'
               'While synthesizing and popularizing various strands of '
               'Hindu-thought, most notably classical yoga and (Advaita) '
               'V

#### Reader/Generator

Similar to previous Tutorials we now initalize our reader/generator.

Here we use a `Seq2SeqGenerator` with the *vblagoje/bart_lfqa* model (see: https://huggingface.co/vblagoje/bart_lfqa)



In [16]:
from haystack.nodes import Seq2SeqGenerator


generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa")

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [17]:
from haystack.pipelines import GenerativeQAPipeline

pipe = GenerativeQAPipeline(generator, retriever)

## Voilà! Ask a question!

In [21]:
# pipe.run(
#     query="How did Arya Stark's character get portrayed in a television adaptation?", params={"Retriever": {"top_k": 3}}
# )

pipe.run(
    query="why vivekananda rebelled against Ramakrishna ?", params={"Retriever": {"top_k": 5}}
)

{'query': 'why vivekananda rebelled  against Ramakrishna ?',
 'answers': [<Answer {'answer': "Vivekananda rebelled against Ramakrishna because he didn't like the way he was being treated. He didn't want to be a part of a monastic order, he wanted to live his life as a free man. He wanted to be able to do what he wanted, and not be bound by the rules of the order.", 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_id': None, 'meta': {'doc_ids': ['5cc91b096718db5a1b852ff4b3c444da', '88094ab98a156c1790824ac88da9df4c', '86c2bc5511976fe69f0c29e8ffd3526', '17cff40e080b9b2cc0c50a525cbbc440', '557423284967059e4311c7ac868cbad9'], 'doc_scores': [0.5615968776830655, 0.5612102090059436, 0.5600596466662053, 0.5585489118063406, 0.5570334590045094], 'content': ['\n=== Vivekananda ===\nAmong the Europeans who were influenced by Ramakrishna was Principal Dr. William Hastie of the Scottish Church College, Kolkata. In the course of e

In [11]:
pipe.run(query="Why is Arya Stark an unusual character?", params={"Retriever": {"top_k": 3}})

{'query': 'Why is Arya Stark an unusual character?',
 'answers': [<Answer {'answer': 'Arya is the third child of Lord Eddard Stark and his wife Lady Catelyn Stark. She is the only one out of her full-siblings to inherit the Stark features and is said to resemble her late aunt Lyanna in both looks and temperament. Unlike her sister Sansa, who favors activities traditionally befitting a noblewoman and expresses disdain for outdoor activities, Arya shows no interest in dancing, singing and sewing, and revels in fighting and exploring. She wields a smallsword named Needle, and is trained in the Braavosi style of sword fighting by Syrio Forel.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_id': None, 'meta': {'doc_ids': ['2ee56bdd46dfd30b23f91bcc046456a4', 'a64bb94eab347d5cc10686c16b52a4dd', '50c34729eb25b43fe5953f90e3f492b8'], 'doc_scores': [0.5678043629384324, 0.5643267524623801, 0.5635598183959124], 'content': ["

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Discord](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)