<a href="https://colab.research.google.com/github/UCREL/Session_2_Question-Answering-Information-Retrieval/blob/main/USS_QA_IR_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # UCREL SUMMER SCHOOL USS 2024

 ## Session 2: Tutorial on QA and IR

- **Goal**: After completing this tutorial, you'll have learned how to build QA-IR pipelines that uses extractive and generative models alongside retrievers to answer given questions.

> This tutorial uses Haystack 2.0. To learn more, visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).



## Extractive Question Answering

What is extractive question answering? The short answer is that extractive models pull verbatim answers out of text. It's good for use cases where accuracy is paramount, and you need to know exactly where in the text that the answer came from.

In this tutorial you'll create a pipeline that extracts answers to questions, based on the provided documents.

To get data into the extractive pipeline, you'll also build an indexing pipeline to ingest the [Wikipedia pages of Seven Wonders of the Ancient World dataset](https://en.wikipedia.org/wiki/Wonders_of_the_World).

## Preparing the Colab Environment

- Login with your Google account
- Enable GPU Runtime in Colab (T4 is ok as well)
- Read and run the notebook cell by cell

#Installation


In [None]:
%%bash

pip install -q haystack-ai accelerate deepeval-haystack "sentence-transformers>=2.2.0" "datasets>=2.6.1" &> /dev/null

# Complete Pipeline

## Load data into the `DocumentStore`

Before you can use this data in the extractive pipeline, you'll use an indexing pipeline to fetch it, process it, and load it into the document store.


The data has already been cleaned and preprocessed, so turning it into Haystack `Documents` is fairly straightfoward.

Using an `InMemoryDocumentStore` here keeps things simple. However, this general approach would work with [any document store that Haystack 2.0 supports](https://docs.haystack.deepset.ai/docs/document-store).

The `SentenceTransformersDocumentEmbedder` transforms each `Document` into a vector. Here we've used [`sentence-transformers/multi-qa-mpnet-base-dot-v1`](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1). You can substitute any embedding model you like, as long as you use the same one in your extractive pipeline.

Lastly, the `DocumentWriter` writes the vectorized documents to the `DocumentStore`.


In [None]:
from datasets import load_dataset
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.readers import ExtractiveReader
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter


dataset = load_dataset("bilgeyucel/seven-wonders", split="train")

documents = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]

model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()

indexing_pipeline.add_component(instance=SentenceTransformersDocumentEmbedder(model=model), name="embedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
indexing_pipeline.connect("embedder.documents", "writer.documents")

indexing_pipeline.run({"documents": documents})

## Build an Extractive QA Pipeline

Your extractive QA pipeline will consist of three components: an embedder, retriever, and reader.

- The `SentenceTransformersTextEmbedder` turns a query into a vector, usaing the same embedding model defined above.

- Vector search allows the retriever to efficiently return relevant documents from the document store. Retrievers are tightly coupled with document stores; thus, you'll use an `InMemoryEmbeddingRetriever`to go with the `InMemoryDocumentStore`.

- The `ExtractiveReader` returns answers to that query, as well as their location in the source document, and a confidence score.


In [None]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.readers import ExtractiveReader
from haystack.components.embedders import SentenceTransformersTextEmbedder


retriever = InMemoryEmbeddingRetriever(document_store=document_store)
reader = ExtractiveReader()
reader.warm_up()

extractive_qa_pipeline = Pipeline()

extractive_qa_pipeline.add_component(instance=SentenceTransformersTextEmbedder(model=model), name="embedder")
extractive_qa_pipeline.add_component(instance=retriever, name="retriever")
extractive_qa_pipeline.add_component(instance=reader, name="reader")

extractive_qa_pipeline.connect("embedder.embedding", "retriever.query_embedding")
extractive_qa_pipeline.connect("retriever.documents", "reader.documents")

Try extracting some answers.

In [None]:
query = "Who was Pliny the Elder?"
output = extractive_qa_pipeline.run(
    data={"embedder": {"text": query}, "retriever": {"top_k": 3}, "reader": {"query": query, "top_k": 2}}
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
for i,answer in enumerate(output['reader']['answers']):
  a = answer.to_dict()['init_parameters']['data']
  print(f"Answer {i}: {a}")
  if a:
    c = answer.to_dict()['init_parameters']['document']['content'].replace('\n',' ')
    print(f"Context {i}: {c}")
    print(f"Score {i}: {answer.to_dict()['init_parameters']['score']}\n")


Answer 0: Roman writer
Context 0: The Roman writer Pliny the Elder, writing in the first century AD, argued that the Great Pyramid had been raised, either "to prevent the lower classes from remaining unoccupied", or as a measure to prevent the pharaoh's riches from falling into the hands of his rivals or successors.[60] Pliny does not speculate as to the pharaoh in question, explicitly noting that "accident [has] consigned to oblivion the names of those who erected such stupendous memorials of their vanity".[61] In pondering how the stones could be transported to such a vast height he gives two explanations: That either vast mounds of nitre and salt were heaped up against the pyramid which were then melted away with water redirected from the river. Or, that "bridges" were constructed, their bricks afterwards distributed for erecting houses of private individuals, arguing that the level of the river is too low for canals to ever bring water up to the pyramid. Pliny also recounts how "in

## `ExtractiveReader`: a closer look

Here's an example answer:
```python
[ExtractedAnswer(query='Who was Pliny the Elder?', score=0.8306006193161011, data='Roman writer', document=Document(id=bb2c5f3d2e2e2bf28d599c7b686ab47ba10fbc13c07279e612d8632af81e5d71, content: 'The Roman writer Pliny the Elder, writing in the first century AD, argued that the Great Pyramid had...', meta: {'url': 'https://en.wikipedia.org/wiki/Great_Pyramid_of_Giza', '_split_id': 16}
```

The confidence score ranges from 0 to 1. Higher scores mean the model has more confidence in the answer's relevance.

The Reader sorts the answers based on their probability scores, with higher probability listed first. You can limit the number of answers the Reader returns in the optional `top_k` parameter.

By default, the Reader sets a `no_answer=True` parameter. This param returns an `ExtractedAnswer` with no text, and the probability that none of the returned answers are correct.

```python
ExtractedAnswer(query='Who was Pliny the Elder?', score=0.04606167031102615, data=None, document=None, context=None, document_offset=None, context_offset=None, meta={})]}}
```

`.0.04606167031102615` means the model is fairly confident the provided answers are correct in this case. You can disable this behavior and return only answers by setting the `no_answer` param to `False` when initializing your `ExtractiveReader`.


# Sentence Level Retrieval

In [75]:
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import SentenceWindowRetrieval
from haystack.components.preprocessors import DocumentSplitter
from haystack.document_stores.in_memory import InMemoryDocumentStore

splitter = DocumentSplitter(split_by="word", split_length=2)
text = (
        "This is a text with some words. There is a second sentence. And there is also a third sentence. "
        "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence"
)
doc = Document(content=text)
docs = splitter.run([doc])
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"])


rag = Pipeline()
rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1))
rag.add_component("sentence_window_retriever", SentenceWindowRetrieval(document_store=doc_store, window_size=1))
rag.connect("bm25_retriever", "sentence_window_retriever")

output = rag.run({'bm25_retriever': {"query":"third"}})
print(output['sentence_window_retriever']['context_windows'][0])

is alsoa thirdsentence. It 


# Reranking Pipeline

In [None]:
from haystack import Pipeline
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.components.rankers import TransformersSimilarityRanker

# Assuming 'document_store' is already defined and populated with your dataset

retriever = InMemoryBM25Retriever(document_store=document_store)
reranker = TransformersSimilarityRanker(model="cross-encoder/ms-marco-MiniLM-L-6-v2")  # Replace with your desired cross-encoder model

reranking_pipeline = Pipeline()
reranking_pipeline.add_component(instance=retriever, name="retriever")
reranking_pipeline.add_component(instance=reranker, name="ranker")
reranking_pipeline.connect("retriever.documents", "ranker.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1e96383c70>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - ranker: TransformersSimilarityRanker
🛤️ Connections
  - retriever.documents -> ranker.documents (List[Document])

In [None]:
query = "When was Rhodes raided?"
output = reranking_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
                                   "ranker": {"query": query, "top_k": 2}})

In [None]:
for i, document in enumerate(output['ranker']['documents']):
  d = document.to_dict()
  print(f"Document {i}: {d['content']}")
  print(f"Link {i}: {d['url']}")
  print(f"Score {i}: {d['score']}\n")

Document 0: In 653, an Arab force under Muslim general Muawiyah I raided Rhodes, and according to the Chronicle of Theophanes the Confessor,[7] the remains of the statue constituted part of the booty, being melted down and sold to a Jewish merchant of Edessa who loaded the bronze onto 900 camels.[8] The same story is recorded by Bar Hebraeus, writing in Syriac in the 13th century in Edessa[25] (after the Arab pillage of Rhodes): "And a great number of men hauled on strong ropes which were tied around the brass Colossus which was in the city and pulled it down. And they weighed from it three thousand loads of Corinthian brass, and they sold it to a certain Jew from Emesa" (the Syrian city of Homs).[26]
Ultimately, Theophanes is the sole source of this account, and all other sources can be traced to him.[27] As Theophanes' source was Syriac, it may have had vague information about a raid and attributed the statue's demise to it, not knowing much more. Or the Arab destruction and the purp

Let's add a reader to the pipline

In [None]:
reranking_pipeline = Pipeline()

retriever = InMemoryBM25Retriever(document_store=document_store)
reranker = TransformersSimilarityRanker(model="cross-encoder/ms-marco-MiniLM-L-6-v2")  # Replace with your desired cross-encoder model
reader = ExtractiveReader()
reader.warm_up()

reranking_pipeline.add_component(instance=retriever, name="retriever")
reranking_pipeline.add_component(instance=reranker, name="ranker")
reranking_pipeline.add_component(instance=reader, name="reader")
reranking_pipeline.connect("retriever.documents", "ranker.documents")
reranking_pipeline.connect("ranker.documents", "reader.documents")

query = "When was Rhodes raided?"
output = reranking_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
                                   "ranker": {"query": query, "top_k": 2},
                                    "reader": {"query": query, "top_k": 2}})

In [None]:
for i,answer in enumerate(output['reader']['answers']):
  a = answer.to_dict()['init_parameters']['data']
  print(f"Answer {i}: {a}")
  if a:
    c = answer.to_dict()['init_parameters']['document']['content'].replace('\n',' ')
    print(f"Context {i}: {c}")
    print(f"Score {i}: {answer.to_dict()['init_parameters']['score']}\n")


Answer 0: 653
Context 0: In 653, an Arab force under Muslim general Muawiyah I raided Rhodes, and according to the Chronicle of Theophanes the Confessor,[7] the remains of the statue constituted part of the booty, being melted down and sold to a Jewish merchant of Edessa who loaded the bronze onto 900 camels.[8] The same story is recorded by Bar Hebraeus, writing in Syriac in the 13th century in Edessa[25] (after the Arab pillage of Rhodes): "And a great number of men hauled on strong ropes which were tied around the brass Colossus which was in the city and pulled it down. And they weighed from it three thousand loads of Corinthian brass, and they sold it to a certain Jew from Emesa" (the Syrian city of Homs).[26] Ultimately, Theophanes is the sole source of this account, and all other sources can be traced to him.[27] As Theophanes' source was Syriac, it may have had vague information about a raid and attributed the statue's demise to it, not knowing much more. Or the Arab destruction

# Generative QA

Large Language Models are best suited for Generative QA
In this example we will use Flan-T5-Large for computational limits

In [87]:
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceLocalGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document

generator = HuggingFaceLocalGenerator(
    model="google/flan-t5-large",
    task="text2text-generation",
    generation_kwargs={"max_new_tokens": 100, "temperature": 0.9})

generator.warm_up()


query = "When was Rhodes raided?"

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ query }}?
"""
pipe = Pipeline()

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", generator)
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

res=pipe.run({
    "prompt_builder": {
        "query": query
    },
    "retriever": {
        "query": query
    }
})

Token indices sequence length is longer than the specified maximum sequence length for this model (2760 > 512). Running this sequence through the model will result in indexing errors


In [88]:
print(res['llm']['replies'][0])

In 653


### API Call (FREE LLMs)

[Login](https://huggingface.co/login) to your HuggingFace account. ([Create](https://huggingface.co/join) one if you dont have it already)

[Generate a new API key](https://huggingface.co/settings/tokens/new) of READ type. Copy and Paste it in the code below.




In [96]:
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
                                    api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
                                    token=Secret.from_token("your_key_here"))

you can use other models e.g., mistralai/Mistral-7B-v0.1

for Mistral and other models you need to go to the [model page](https://huggingface.co/mistralai/Mistral-7B-v0.1) and agree with the terms of use before using it.


In [97]:
query = "When was Rhodes raided?"

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ query }}?
"""
pipe = Pipeline()

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", generator)
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

res=pipe.run({
    "prompt_builder": {
        "query": query
    },
    "retriever": {
        "query": query
    }
})

In [85]:
print(res['llm']['replies'][0])



Answer: In 653, an Arab force under Muslim general Muawiyah I raided Rhodes.


# TASK: Bring Your knowledge base and build your Chatbot
Add PDF files to your colab space and query them

In [None]:
file_names=['file1.pdf','file2.pdf','file3.pdf']

In [None]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PDFMinerToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", PDFMinerToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": file_names}})

In [None]:
#
# COMPLETE THIS
# BY ADDING YOUR
# QA PIPELINE
#