### Weaviate Tutorial

Following the Weaviate tutorial in [this](https://colab.research.google.com/github/semi-technologies/weaviate-examples/blob/main/harrypotter-qa-haystack-weaviate/COLAB-HarryPotter-QA-Haystack-Weaviate.ipynb) colab notebook

In [15]:
from haystack.document_stores import WeaviateDocumentStore
from haystack.utils import launch_weaviate
from haystack.utils import clean_wiki_text
import pandas as pd

### Load in the Data

The tutorial makes use of the [Harry Potter Wiki](https://harrypotter.fandom.com/wiki/Main_Page) and have loaded it into an S3 bucket as a CSV

In [2]:
harry_potter_df = pd.read_csv("https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/harry_potter_wiki.csv")

In [3]:
harry_potter_df.shape

(13674, 6)

harry_potter_df.head()

In [4]:
harry_potter_df["text"][0][0:100]

'Gryffindor\n\nGryffindor is one of the four Houses of Hogwarts School of Witchcraft and Wizardry and w'

Tutorial makes use of a `clean_wiki_text` method to clean up the data, which is a Haystack utility

In [5]:
help(clean_wiki_text)

Help on function clean_wiki_text in module haystack.utils.cleaning:

clean_wiki_text(text: str) -> str
    Clean wikipedia text by removing multiple new lines, removing extremely short lines,
    adding paragraph breaks and removing empty paragraphs



### Convert the Data into Required Format

One of the key data structures used in Haystack is a `Document`. They encapsulate the content of a document along with its associated metadata. It typically contains:
- The text of the document.
- Metadata like the document's name, source, or any other custom fields.
- Optionally, embeddings that represent the content in a dense vector format.

The Document class is used within the Haystack framework for various tasks like indexing, retrieval, and answering questions. It provides a standardized way to handle documents across different stages of the information retrieval and question-answering processes.

In [23]:
harry_potter_dicts = [
    {'content': clean_wiki_text(row.text),'meta': {'name': row['name'],'url': row.url}} for ix, row in harry_potter_df.iterrows()
]

### Loading the Data into a Vector Database

A *vector database* is a database designed to efficiently store and retrieve high dimensional data. They're often used for efficient similarity search in applications such as natural language processing by modelling text as high dimensional vectors. [Weaviate](https://weaviate.io/developers/weaviate) is an open source vector database

There are multiple ways to host a Weaviate Vector Database such as self-hosted using a container and their managed service. For the purpose of following the tutorial previously referenced, going to use a local implementation

Note that Docker Daemon needed to be running for the below to work

In [18]:
launch_weaviate()

Unable to find image 'semitechnologies/weaviate:latest' locally
latest: Pulling from semitechnologies/weaviate
579b34f0a95b: Already exists
bf5dbc62c20f: Pulling fs layer
f5dd2b338fac: Pulling fs layer
626b3aa8d35f: Pulling fs layer
64adedbebeae: Pulling fs layer
64adedbebeae: Waiting
bf5dbc62c20f: Download complete
bf5dbc62c20f: Pull complete
626b3aa8d35f: Verifying Checksum
626b3aa8d35f: Download complete
64adedbebeae: Verifying Checksum
64adedbebeae: Download complete
f5dd2b338fac: Verifying Checksum
f5dd2b338fac: Download complete
f5dd2b338fac: Pull complete
626b3aa8d35f: Pull complete
64adedbebeae: Pull complete
Digest: sha256:a63841845be2b818d822c1164a3fcaf2ca4ab604d30646a4f59660977e4768a6
Status: Downloaded newer image for semitechnologies/weaviate:latest


a88f9a91fa4342063009d8870bdc6747bc97099f5da8e8aef0248d39ea4bdc14


In [19]:
document_store = WeaviateDocumentStore()

In [22]:
document_store.write_documents(documents=harry_potter_dict, batch_size=100)

Document id 48e81d9a67fd4e0221485586711cc5f0 is not in uuid format. Such ids will be replaced by uuids, in this case c03efb9b-d5fc-9825-35ad-918a543aa525.
No embedding found in Document object being written into Weaviate. A dummy embedding is being supplied so that indexing can still take place. This embedding should be overwritten in order to perform vector similarity searches.
            multi-threading. Setting `batch_size` in `client.batch.configure()`  to an int value will enabled automatic
            batching. See:
            https://weaviate.io/developers/weaviate/current/restful-api-references/batch.html#example-request-1
13700it [00:25, 542.03it/s]                                                                                                      
