In [1]:
import pathlib
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)


In [2]:
# Configure the logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

# https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline

Haystack is an end-end production ready NLP solution allowing rapid development with popular modules from Huggingface. In this tutorial we will cover the key components of haystack and how to interact with it. 

Haystack itself provides the relevant components to handle the data, pass it into a text pipeline, retrain the model, and interact with the model object.

## Data Handling

For data handling, Haystack uses the _DocumentStore_ module. Here the _DocumentStore_ acts as a database that stores the text and meta data in a format that can be rapidly queried when needing to retrieve the data for the model. There are multiple types of document stores such as _Elasticsearch_, FAISS (Meta's ), OpenSearch, SQL and a few more. If you want to learn more about the document stores capability you should check out the [Haystack documentation](https://docs.haystack.deepset.ai/docs/document_store). To integrate the _DocumentStore_, it needs to be initialized first. In the following example we will be applying the in memor _DocumentStore_ which is the simplest form of it.

In [2]:
# Haystack Imports
from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import fetch_archive_from_http
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
from haystack.nodes import BM25Retriever
from haystack.nodes import FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import print_answers

In [3]:
document_store = InMemoryDocumentStore(use_bm25=True)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


Next we import the text documents as a set of text files. If you have your own dataset you need to convert the data from whatever base format to the desired textfile format. Ultimately you can fetch the data from most available databases or cloud storages. In our case we will download the Wikipedia articles of the FTSE listed companies from a AWS S3 bucket from our lab and save them locally as `.txt` files. The data had previously been scraped and cleaned from wikipedia. The texts are then feed into the _TextIndexingPipeline_ to convert them into haystack document objects. Of course you can add your own `.txt` files to train the model.

In [4]:
data_path = pathlib.Path.cwd().parent.joinpath("text_dir")

# fetch_archive_from_http(
#     url="<YOUR S3 BUCKET",
#     output_dir=data_path
#     )

print("Data download was completed!")
file_list=[]
for file in data_path.iterdir():
    # print(file)
    file_list.append(file)

# Feed input into the the pipeline.
text_pipeline = TextIndexingPipeline(document_store)
text_pipeline.run_batch(file_paths=file_list)

INFO - haystack.pipelines.base -  It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Data download was completed!


Converting files:   0%|          | 0/755 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/755 [00:00<?, ?docs/s]

INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '83eafa6329d30d1a8384666265ef6f4a' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'cbc6c11a8dd8ffd09ef535738a004b29' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '17e329b5a7705aad6932d5aa3d1b0a46' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'd1c854be4332bdf8cb81c24699666224' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'c57a30b519ba93b45039556b54634216' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'bfc1955641f36c490f3ee11f5ad43849' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'c0caacef110d445e7e6de68d279d8520'

Updating BM25 representation...:   0%|          | 0/8845 [00:00<?, ? docs/s]

{'documents': [<Document: {'content': "HBOS plc was a banking and insurance company in the United Kingdom, a wholly owned subsidiary of the Lloyds Banking Group, having been taken over in January 2009. It was the holding company for Bank of Scotland plc, which operated the Bank of Scotland and Halifax brands in the UK, as well as HBOS Australia and HBOS Insurance & Investment Group Limited, the group's insurance division. HBOS was formed by the 2001 merger of Halifax plc and the Bank of Scotland. The formation of HBOS was heralded as creating a fifth force in British banking as it created a company of comparable size and stature to the established Big Four UK retail banks. It was also the UK's largest mortgage lender. The HBOS Group Reorganisation Act 2006 saw the transfer of Halifax plc and Capital Bank plc to the Bank of Scotland, which had by then become a registered public limited company, Bank of Scotland plc. Although officially HBOS was not an acronym of any specific words, it i

A retriever will be used for search system. The haystack retriever will filter through all available `.txt` files th at are later relevant for answering a given question. In our case  we will apply the Embedding retriever but you can check the [retrieved documentation](https://docs.haystack.deepset.ai/docs/retriever) to identify the right one for your use case. 

In [5]:
retriever = BM25Retriever(document_store=document_store)

Next the reader also needs to be initilized. The reader is in effect the scanner that extracts the highest ranking answer candidate. Readers tend to be much slower then the retriever since they are bassed on complex deep learning models. For more information to the reader you can refer to the [documentation](https://docs.haystack.deepset.ai/docs/reader#models) but generally you can select any BERT-based architecture model that can also be found on [Huggingface](https://huggingface.co/models). In our case we will use the `deepset/roberta-base-squad2` reader since it is out of the box already a solid model. With the reader you can also select if you want to select the GPU to accelerate the training process. In our case we will not use the GPU since our dataset is not huge and time we can accept longer compute times. 

In [6]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -   * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
INFO - haystack.modeling.model.language_model -  Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


Lastly, we need to generate the retriever & reader pipeline connecting the retriever with the reader. This is done to assure speed and accuracy while processing the documents. Different pipelines at haystack also have different purposes, here you can refer to the [documentation](https://docs.haystack.deepset.ai/docs/pipelines) for the pipeline selection.

In [7]:
pipeline = ExtractiveQAPipeline(reader, retriever)

Now that we have build the complete pipeline we can start using our Q&A pipeline by asking questions to the dataset. For this we run our query through the pipeline with some configured parameters. The `top_k` parameter specifies how many results are returned and pased to the next stage. You can adjust the parameter to fine tune your model. Overall a value of '10' overall provided good performance as default for Haystack. Of course you need the speed of retrieval depends on the complexity of the model and the infrastructure you running it on. 

In [16]:
answer_prediction = pipeline.run(
    query="What type of company is Astra Zeneca?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

print_answers(answer_prediction, details="minimum")



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Query: What type of company is Astra Zeneca?
Answers:
[   {   'answer': 'pharmaceutical',
        'context': 'he publicly listed companies: ICI and Zeneca—Zeneca would '
                   'later go onto merge with Astra AB, forming the current '
                   'pharmaceutical company, AstraZeneca. '},
    {   'answer': 'independent',
        'context': 'es, seeds and biological products were all transferred '
                   'into a new and independent company called Zeneca. Zeneca '
                   'subsequently merged with Astra AB to f'},
    {   'answer': 'British multinational',
        'context': 'Zeneca (officially Zeneca Group PLC) was a British '
                   'multinational pharmaceutical company headquartered in '
                   'London, United Kingdom. It was formed in June'},
    {   'answer': 'British-Swedish multinational pharmaceutical and '
                  'biotechnology',
        'context': 'AstraZeneca plc () is a British-Swedish multinational 

The answer provided in this case was "pharmaceutical" as the highest ranking answer. You can always have a look through the other answers as well provide quality measurements from the prediction and then fine tune the model. 

Above you can train on your own data and integrate and deploy into your production solution for basic Q&A. Of course the answers are not as well phrased as the recen Chat-GPT but for simple prototype integration on your own data sets this can come a long way trading off speed, accuracy and resource requirements. Further you can start playing with the components, testing out different retrievers and combining them with readers. You can access the GitRepo via this [link](https://github.com/drpochs/qanda_haystack).