Initialise DocumentStore

In [71]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

Use TextIndexingPipeline to convert text files into Haystack Document objects and write them into DocumentStore

In [72]:
doc_dir = "data/bulletins"

In [73]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + file for file in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store=document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

Converting files:   0%|          | 0/11 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/11 [00:00<?, ?docs/s]



Updating BM25 representation...:   0%|          | 0/59 [00:00<?, ? docs/s]

{'documents': [<Document: {'content': 'Main points\n\nTotal turnover generated by UK services industries increased by £478.1 billion (19.0%) from £2,511.0 billion in 2020 to £2,989.1 billion in 2021.\n\nThe wholesale and retail trade sector, including the repair of motor vehicles and motorcycles, generated the most turnover in 2021 at £1,408.9 billion, an increase of £266.0 billion (23.3%) since 2020.\n\nTurnover generated by businesses outside their main industrial classification (referred to as off-diagonal turnover), was estimated at £454.2 billion in 2021, representing 15.2% of total turnover (£2,989.1 billion), compared with 15.8% in 2020.\n\nThe largest proportional increase in turnover by service type was from the creative, arts and entertainment industries, which showed an increase of 126.2% between 2020 and 2021.\n\nTotal production (manufacturing) turnover generated by businesses within UK services industries was estimated at £71.6 billion in 2021 compared with £61.8 billion 

Initialise a Retriever. This sifts through documents and returns the best documents relative to the question. This uses the BM25 algorithm. 

In [74]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

Initialise the reader. This reads through the documents and returns the best answer to the question. This uses the RoBERTa question answering model.

In [75]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

Run a ready made pipeline called ExtractiveQAPipeline on the documents to find answers for our questions in the Wikipedia articles

In [76]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

**Asking a question**

The pipeline is now ready to ask questions. We can ask a question by using the pipeline's run function to retrieve a prediction. We can also specify how many answers we want to get back with top_k_retriever and top_k_reader.


In [77]:
query = "What proportion of firms think their costs will go up?"

In [78]:
prediction = pipe.run(
    query=query, 
    params={
        "Retriever": {"top_k": 1}, # pick single document
        "Reader": {"top_k": 1} # only one answer
    }
)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Print out the answers

In [79]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="all"
)

'Query: What proportion of firms think their costs will go up?'
'Answers:'
[   <Answer {'answer': 'more than one in six (18%) trading businesses expected to raise the prices of goods or services they sell in June 2023, down from 23% for May 2023. While the proportion of trading businesses that reported they expect prices will stay the same increased by 7 percentage points to 60%', 'type': 'extractive', 'score': 0.024097830057144165, 'context': 'more than one in six (18%) trading businesses expected to raise the prices of goods or services they sell in June 2023, down from 23% for May 2023. While the proportion of trading businesses that reported they expect prices will stay the same increased by 7 percentage points to 60%', 'offsets_in_document': [{'start': 722, 'end': 1004}], 'offsets_in_context': [{'start': 0, 'end': 282}], 'document_ids': ['d7ebe3df07971ba4333e9ca06b5126a8'], 'meta': {'_split_id': 3}}>]
