### Weaviate Tutorial

Following the Weaviate tutorial in [this](https://colab.research.google.com/github/semi-technologies/weaviate-examples/blob/main/harrypotter-qa-haystack-weaviate/COLAB-HarryPotter-QA-Haystack-Weaviate.ipynb) colab notebook

In [23]:
from haystack.document_stores import WeaviateDocumentStore
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import EmbeddingRetriever
from haystack.nodes import FARMReader
from haystack.utils import launch_weaviate
from haystack.utils import clean_wiki_text
from haystack.utils import print_answers

import pandas as pd

### Load in the Data

The tutorial makes use of the [Harry Potter Wiki](https://harrypotter.fandom.com/wiki/Main_Page) and have loaded it into an S3 bucket as a CSV

In [2]:
harry_potter_df = pd.read_csv("https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/harry_potter_wiki.csv")

In [3]:
harry_potter_df.shape

(13674, 6)

harry_potter_df.head()

In [4]:
harry_potter_df["text"][0][0:100]

'Gryffindor\n\nGryffindor is one of the four Houses of Hogwarts School of Witchcraft and Wizardry and w'

Tutorial makes use of a `clean_wiki_text` method to clean up the data, which is a Haystack utility

In [5]:
help(clean_wiki_text)

Help on function clean_wiki_text in module haystack.utils.cleaning:

clean_wiki_text(text: str) -> str
    Clean wikipedia text by removing multiple new lines, removing extremely short lines,
    adding paragraph breaks and removing empty paragraphs



### Convert the Data into Required Format

One of the key data structures used in Haystack is a `Document`. They encapsulate the content of a document along with its associated metadata. It typically contains:
- The text of the document.
- Metadata like the document's name, source, or any other custom fields.
- Optionally, embeddings that represent the content in a dense vector format.

The Document class is used within the Haystack framework for various tasks like indexing, retrieval, and answering questions. It provides a standardized way to handle documents across different stages of the information retrieval and question-answering processes.

In [6]:
harry_potter_dicts = [
    {'content': clean_wiki_text(row.text),'meta': {'name': row['name'],'url': row.url}} for ix, row in harry_potter_df.iterrows()
]

### Loading the Data into a Vector Database

A *vector database* is a database designed to efficiently store and retrieve high dimensional data. They're often used for efficient similarity search in applications such as natural language processing by modelling text as high dimensional vectors. [Weaviate](https://weaviate.io/developers/weaviate) is an open source vector database

There are multiple ways to host a Weaviate Vector Database such as self-hosted using a container and their managed service. For the purpose of following the tutorial previously referenced, going to use a local implementation

Note that Docker Daemon needed to be running for the below to work

In [7]:
launch_weaviate()

In [63]:
document_store = WeaviateDocumentStore(recreate_index=True)

In [64]:
document_store.write_documents(documents=harry_potter_dicts, batch_size=100)

Document id 48e81d9a67fd4e0221485586711cc5f0 is not in uuid format. Such ids will be replaced by uuids, in this case c03efb9b-d5fc-9825-35ad-918a543aa525.
No embedding found in Document object being written into Weaviate. A dummy embedding is being supplied so that indexing can still take place. This embedding should be overwritten in order to perform vector similarity searches.
            multi-threading. Setting `batch_size` in `client.batch.configure()`  to an int value will enabled automatic
            batching. See:
            https://weaviate.io/developers/weaviate/current/restful-api-references/batch.html#example-request-1
13700it [00:40, 337.69it/s]                                                                                                      


### Inspecting the Database

Weaviate provide a [console](https://console.semi.technology/console/query) which can be used to inspect the database. Given that the database has been spun up locally, the default URL and port is `http://localhost:8080`. This can be specified when using the Weaviate Console to connect locally

#### Example Queries

GraphQL can be used to query the Weaviate DB. One option to test this out is through the Weaviate Console mentioned above.

- Example GraphQL query to get the first 5 records name and content:
```
{
  Get {
    Document (
       limit: 5
    )
    {
      name
      content
    }
  }
}
```

### Adding Vectors into the DB

Now that we have the documents in the Vector DB, we want to be able to create an embedding and add this in to utilise features like the efficient similarity search

*Aside* - Haystack uses the concepts of a **Reader** and a **Retriever**. 

The **Reader** is a model which reads a the contents of a set of given documents, and given a question it can extract relevant short passages or answers from the given documents

The **Retriever** is a model which can quickly find a relevant set of documents from a large corpus given a query through techniques such as vector search. It can also be used to create the vectors to go into the Vector DB

In the tutorial, an [EmbeddingRetrieval](https://docs.haystack.deepset.ai/docs/retriever#embedding-retrieval-recommended) model is used to create the embeddings

In [65]:
MODEL_FORMAT = "sentence_transformers"
EMBEDDING_MODEL = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

In [66]:
retriever = EmbeddingRetriever(
    document_store=document_store, 
    model_format=MODEL_FORMAT,
    embedding_model=EMBEDDING_MODEL
)

You seem to be using sentence-transformers/multi-qa-mpnet-base-dot-v1 model with the cosine function instead of the recommended dot_product. This can be set when initializing the DocumentStore


In [67]:
document_store.update_embeddings(retriever)

WeaviateDocumentStoreError: Query results contain errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'explorer: list class: search: invalid pagination params: query maximum results exceeded', 'path': ['Get', 'Document']}]

### Creating the QA Pipeline

In [51]:
# Creating a Reader component as defined above
READER_MODEL = "deepset/tinyroberta-squad2"
reader = FARMReader(model_name_or_path=READER_MODEL, use_gpu=True)

In [19]:
question_answering_pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

### Trying Out the Question Answering System

In [34]:
QUESTION = "How many points is catching the Golden Snitch worth?"

In [35]:
prediction = question_answering_pipeline.run(query=QUESTION, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.60it/s]
Inferencing Samples: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.83s/ Batches]


In [36]:
print_answers(prediction)

'Query: How many points is catching the Golden Snitch worth?'
'Answers:'
[   <Answer {'answer': 'ten', 'type': 'extractive', 'score': 0.0017297114245593548, 'context': 'empt to get it through the goal hoops past the Keeper. Each goal is worth ten points. This makes them similar to the forwards in football, as the game', 'offsets_in_document': [{'start': 272, 'end': 275}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_ids': ['39c43bff-d817-681e-e218-dac405cccb43'], 'meta': {'name': 'Chaser', 'url': 'https://harrypotter.fandom.com/wiki/Chaser'}}>,
    <Answer {'answer': '170*to 20', 'type': 'extractive', 'score': 0.0008021701360121369, 'context': 'idditch World Cup - BULGARIA VERSUS NORWAY Quarter-final: Bulgaria won 170*to 20, one of the biggest upsets of the tournament.\n\n\n==Behind the scenes==', 'offsets_in_document': [{'start': 1019, 'end': 1028}], 'offsets_in_context': [{'start': 71, 'end': 80}], 'document_ids': ['48ebacf8-9dec-8cba-1ff1-6c036684b0f4'], 'meta': {'nam

### Trying Out Different Similarity Metric

In [44]:
SIMILARITY_METRIC = "dot_product"
document_store_dot_product = WeaviateDocumentStore(similarity=SIMILARITY_METRIC, recreate_index=True)

In [45]:
retriever = EmbeddingRetriever(
    document_store=document_store_dot_product, 
    model_format=MODEL_FORMAT,
    embedding_model=EMBEDDING_MODEL
)

You seem to be using sentence-transformers/multi-qa-mpnet-base-dot-v1 model with the dot function instead of the recommended dot_product. This can be set when initializing the DocumentStore


In [56]:
document_store_dot_product.update_embeddings(retriever)

In [57]:
question_answering_pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

In [58]:
QUESTION = "How many points is catching the Golden Snitch worth?"

In [59]:
prediction = question_answering_pipeline.run(query=QUESTION, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.26it/s]


In [60]:
print_answers(prediction)

'Query: How many points is catching the Golden Snitch worth?'
'Answers:'
[]
