# Loading data into the Vector Database, Weaviate

Prerequites:

1. Weaviate Running on AKS
    This notebook assumes you have a Weaviate instance running. https://weaviate.io/developers/weaviate/installation
2. Azure openAI endpoint
    Have a commercial Azure openAI endpoint provisioned, you need 2 deployments both an embedding model and a LLM deployed. https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource?pivots=web-portal

This demo takes research pdfs from blob storage, loads and splits it, and writes them to weaviate given an embedding model.

Below are a list of necessary packages:
- weviate-client is for connecting to weaviate from python. 
- tiktoken is a dependancy for OpenAIEmbeddings.
- openai is for connecting to your AOAI instance.
- langchain is a language model integration framework. 

In [None]:
%pip install weaviate-client
%pip install tiktoken
%pip install openai[datalib]
%pip install langchain
%pip install pymupdf

## Load and split Data archive pdfs from blob storage

This notebook uses this dataset as an example. Your dataset can be used instead. 

In [None]:
BLOB_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=demodatasetsp;AccountKey=QVFgIKPiWB+8f0mH+F7fidVLG7wq1S3WhtAqXOWaMWtr6fZ4frhVgmUzgBSdkmw4VsjoEAo7C2Hn+ASt2Cc5HA==;EndpointSuffix=core.windows.net"
BLOB_SAS_TOKEN="?sv=2022-11-02&ss=bf&srt=sco&sp=rltfx&se=2024-10-02T01:02:07Z&st=2023-08-03T17:02:07Z&spr=https&sig=gLxStXFSY6X29OPpPDpBEhoQDdtJNDrMVExNYJ%2BhmBQ%3D"
BLOB_CONTAINER_NAME = "arxivcs"

In [None]:
%pip install unstructured
%pip install "unstructured[pdf]"

In [None]:
#loads pdfs from arxivcs storage container.

from langchain.document_loaders import AzureBlobStorageContainerLoader

loader = AzureBlobStorageContainerLoader(
    conn_str=BLOB_CONNECTION_STRING,
    container=BLOB_CONTAINER_NAME,
    prefix="000"
    )
    
docs = loader.load_and_split()

## Setting our embedding parameters. 

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    deployment="YOUR_DEPLOYMENT",
    model="YOUR_EMBEDDING_MODEL",
    openai_api_base="YOUR_URL",
    openai_api_type="azure",
    openai_api_key="YOUR_KEY",
    chunk_size = 16
)

## Write to authenticated Weaviate

Note: To use internal IP of the VM hosting docker instance compute instance has to be in same vnet. 
Example: https://10.0.0.4:8080

For an AKS cluster navigate to the resource->services and ingresses->click on your weaviate service-> use the endpoint which will look like the example above

In [None]:
from langchain.vectorstores import Weaviate
import weaviate

WEAVIATE_URL = "YOUR_URL" #example: http://10.244.3.20:8080"
WEAVIATE_API_KEY = "YOUR_KEY"

client = weaviate.Client(url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY))
vectorstore = Weaviate.from_documents(docs, embeddings, client=client, by_text=False)

Verify documents were loaded the format will be: 

Document(
        page_content="Content",
        metadata={"metadata"},
    ),
    
You can query on either the page content or the metadata.

The following is just an example query using similarity, queries can be done in a variety of ways such as relevance, variety or limit the number of retrieved docs. https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore 

In [None]:
query = "What do you know about quantum Mechanics"
vectorstore_output = vectorstore.similarity_search(query)
vectorstore_output

If needed, uncomment the final cell to clear the loaded data. 

In [None]:
#client.schema.delete_all()