# Writing data to the Vector Database

Prerequites:

1. This notebook assumes you have a Weaviate instance running. https://weaviate.io/developers/weaviate/installation
2. Have a commercial Azure openAI endpoint provisioned and model deployed. 

This demo takes a locally stored txt file, splits it, and writes to weaviate given an embedding.

Below are a list of necessary packages:
- weviate-client is for connecting to weaviate from python. 
- tiktoken is a dependancy for OpenAIEmbeddings.
- openai is for connecting to your AOAI instance.
- langchain is a language model integration framework. 

In [None]:
%pip install weaviate-client
%pip install tiktoken
%pip install openai[datalib]
%pip install langchain
%pip install pymupdf

## Load Data from archive pdfs from blob storage

In [None]:
BLOB_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=demodatasetsp;AccountKey=QVFgIKPiWB+8f0mH+F7fidVLG7wq1S3WhtAqXOWaMWtr6fZ4frhVgmUzgBSdkmw4VsjoEAo7C2Hn+ASt2Cc5HA==;EndpointSuffix=core.windows.net"
BLOB_SAS_TOKEN="?sv=2022-11-02&ss=bf&srt=sco&sp=rltfx&se=2024-10-02T01:02:07Z&st=2023-08-03T17:02:07Z&spr=https&sig=gLxStXFSY6X29OPpPDpBEhoQDdtJNDrMVExNYJ%2BhmBQ%3D"
BLOB_CONTAINER_NAME = "arxivcs"

In [None]:
%pip install unstructured
%pip install "unstructured[pdf]"

In [None]:
#this will load a single pdf
# from langchain.document_loaders import AzureBlobStorageFileLoader

# loader = AzureBlobStorageFileLoader(
#     conn_str=BLOB_CONNECTION_STRING,
#     container=BLOB_CONTAINER_NAME,
#     blob_name="0001/0001001v1.pdf",
# )

# docs = loader.load()

In [None]:
#loads the first 10000 pdfs from arxivcs container. To load more will need to do 001, then 01

from langchain.document_loaders import AzureBlobStorageContainerLoader

loader = AzureBlobStorageContainerLoader(
    conn_str=BLOB_CONNECTION_STRING,
    container=BLOB_CONTAINER_NAME,
    prefix="000"
    )
    
docs = loader.load()

## Setting our embedding parameters. 

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

#chuck size is set to max of 16 to satisfy API restrictions
embeddings = OpenAIEmbeddings(
    deployment="textembedding",
    model="text-embedding-ada-002",
    openai_api_base="https://aoaivbd.openai.azure.com/",
    openai_api_type="azure",
    openai_api_key="7ad1367445ed4388932ac7c5edd32dd0",
    chunk_size = 16
)

## Write to authenticated Weaviate

Note: To use internal IP of the VM hosting docker instance compute instance has to be in same vnet. 
Example: https://10.0.0.4:8080

For an AKS cluster navigate to the resource->services and ingresses->click on your weaviate service-> use the endpoint which will look like the example above

In [None]:
import weaviate

WEAVIATE_URL = "http://10.244.3.20:8080"
WEAVIATE_API_KEY = "TJVA95OrM7E20RMHrHDcEfxjoYZgeFONFh7HgQ"

client = weaviate.Client(url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY))
vectorstore = Weaviate.from_documents(docs, embeddings, client=client, by_text=False)

In [None]:
# Add Query to Verify, need to expand further and use the data with LLM
query = "What do you know about Quantom Mechanics"
doc_queried = vectorstore.similarity_search(query)
doc_queried