## Embed your Google Drive Docs in a Weaviate Vector Database with Unstructured!


Author: Nina Lopatina from Unstructured

Nina's X handle: [@NinaLopatina](https://x.com/ninalopatina)

Nina's LinkedIn: https://www.linkedin.com/in/ninalopatina

Last updated: 07.09.24

Weaviate content sections borrowed from @mariakhalusova

Do you have some files in Google Docs that you want to parse, embed, and import to your Weaviate Vector DataBase for RAG? If so, this notebook will guide you through all the steps to do so!

Here are the initial non-code steps:

A. Sign up for your [Unstructured API key](https://app.unstructured.io/) with a 2 week free trial for up to 1000 pages per day. You can find your API credentials in your dashboard.

B. Decide on your [source connector](https://docs.unstructured.io/api-reference/ingest/source-connectors/overview). This notebook uses the [Google Drive source connector](https://docs.unstructured.io/api-reference/ingest/source-connectors/google-drive) but feel free to use the connector of your choice. If you use the connector here, set up your [Google Drive service acount](https://support.google.com/a/answer/7378726?hl=en) or find your json with your login info. Make sure you share the google drive directory your data is stored in with the service account email address.

C. Sign up to get your [Weaviate](https://weaviate.io/) URL and API Key after you create a cluster.

D. Decide on which embeddings to use, and obtain the appropriate API Token as needed (in this notebook we are using HuggingFace for embedding generation).

Set up the any private API keys in a .env file in your Google Drive
_______________




1. Now starting with the code below, we will install all the necessary libraries

In [None]:
!pip install -U -q "unstructured[pdf, google-drive, weaviate, embed-huggingface]" python-dotenv #langchain-community httpx

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/41.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m17.3 MB/s

2. [Mount your Google drive locally](https://colab.research.google.com/notebooks/io.ipynb) -- there will be a pop up asking you to connect to your google drive -- to load your dotenv file, and to store your .json locally in case you want to reference them later.

  The files to process themselves will be pulled via a connector to a service account, which allows for processing of google doc files in addition to standard file formats that can be saved in your Drive.

  The secret parameters to set in your .env file are:
  
  UNSTRUCTURED_API_KEY

  UNSTRUCTURED_PARTITION_ENDPOINT

  WEAVIATE_URL
  
  WEAVIATE_API_KEY

  HF_TOKEN

  

### Note that in this notebook, you are sharing your Google Drive with the colab notebook itself, not with Unstructured or Weaviate.

#### If you prefer not to share your Google Drive, you can access your .env and Drive .json files in another fashion, e.g. by downloading this notebook as a .ipynb and running it locally with local directory access.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import dotenv

dotenv.load_dotenv('/content/drive/MyDrive/.env')

True

2b. Set your HF token


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

3. We will set additional parameters here, that are not secret, that we can modify more easily in a notebook

In [None]:
os.environ['GCP_INGEST_SERVICE_KEY_FILE'] = '/content/drive/MyDrive/secret/your-account-key-json-here.json' # The json you downloaded for your account key
os.environ['GOOGLE_DRIVE_FOLDER_ID'] = 'your-folder-id-here' # The folder where your unstructured data is contained
os.environ['COLLECTION_NAME'] = 'UnstructuredDemo'
os.environ['EMBEDDING_MODEL'] = 'sentence-transformers/all-MiniLM-L6-v2'
os.environ['EMBEDDING_NAME'] = 'title_vector'

4. Connect to Weaviate using [Weaviate Cloud](https://console.weaviate.cloud/), [Weaviate Embedded](https://weaviate.io/developers/weaviate/installation/embedded), or [locally](https://weaviate.io/developers/weaviate/installation/docker-compose) and configure your Weaviate Schema.

In [None]:
# Weaviate Cloud

import weaviate

# Set these environment variables
URL = os.getenv("WEAVIATE_URL")
APIKEY = os.getenv("WEAVIATE_API_KEY")

# Connect to your WCD instance
client = weaviate.connect_to_wcs(
    cluster_url=URL,
    auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
    headers={
        "X-HuggingFace-Api-Key": os.getenv("HF_TOKEN")
    }
)

client.is_ready()

True

In [None]:
import weaviate.classes.config as wc
from weaviate.classes.config import Configure
from weaviate.classes.config import ReferenceProperty

client.collections.create(
    name=os.getenv("COLLECTION_NAME"),

    vectorizer_config=[
        Configure.NamedVectors.text2vec_huggingface(
            name=os.getenv('EMBEDDING_NAME'),
            model=os.getenv('EMBEDDING_MODEL'),
        )
    ],

    # Weaviate can infer schema, but it is considered best practice to define it upfront
    properties=[
        wc.Property(name="type", data_type=wc.DataType.TEXT),
        wc.Property(name="element_id", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="embeddings", data_type=wc.DataType.NUMBER_ARRAY, skip_vectorization=True),
        wc.Property(name="metadata", data_type=wc.DataType.OBJECT, nested_properties=[
            wc.Property(name="filename", data_type=wc.DataType.TEXT),
            wc.Property(name="filetype", data_type=wc.DataType.TEXT),
            wc.Property(name="languages", data_type=wc.DataType.TEXT_ARRAY),
            wc.Property(name="page_number",  data_type=wc.DataType.TEXT, skip_vectorization=True),

        ])
    ],
)

<weaviate.collections.collection.Collection at 0x7fa0d2d031f0>

#### Note that we temporarily have a bug in processing Docs, Sheets, and Slides in Google Docs (.doc, .xlsx, .ppt, etc., would work fine) -- as a temporary workaround, you can use the V1 SDK code, or download and upload your files)

5. Set up Unstructured API access and process the documents as per our [Google Drive source connector](https://docs.unstructured.io/open-source/ingest/source-connectors/google-drive) documentation or another connector of your choice, e.g. [local](https://unstructured-53-docs-21-v2-sources.mintlify.app/api-reference/ingest/source-connectors/local). Set up the [Weaviate destination connector](https://docs.unstructured.io/api-reference/ingest/destination-connector/weaviate).

  At the end of this workflow, your unstructured documents have been extracted, chunked, summarized, embedded, and loaded in your Weaviate DB!

In [None]:
#All of the imports
from unstructured.ingest.v2.interfaces import ProcessorConfig
from unstructured.ingest.v2.pipeline.pipeline import Pipeline
from unstructured.ingest.v2.processes.chunker import ChunkerConfig
from unstructured.ingest.v2.processes.connectors.google_drive import (
    GoogleDriveAccessConfig,
    GoogleDriveIndexerConfig,
    GoogleDriveConnectionConfig,
    GoogleDriveDownloaderConfig,
)
from unstructured.ingest.v2.processes.connectors.weaviate import (
    WeaviateUploaderConfig,
    WeaviateConnectionConfig,
    WeaviateAccessConfig,
    WeaviateUploadStagerConfig,
)
import os

from unstructured.ingest.v2.processes.embedder import EmbedderConfig
from unstructured.ingest.v2.processes.partitioner import PartitionerConfig

In [None]:
Pipeline.from_configs(
    context=ProcessorConfig(
        tqdm=True,
        reprocess=False,
        verbose=False,
        #output_dir='local-output-to-weaviate',
        num_processes=10, # when processing a large number of documents via Unstructured API, set a larger number of workers/processes here),
    ),
    source_connection_config=GoogleDriveConnectionConfig(
        access_config=GoogleDriveAccessConfig(
            service_account_key=os.getenv("GCP_INGEST_SERVICE_KEY_FILE"),
        ),
        drive_id = os.getenv("GOOGLE_DRIVE_FOLDER_ID"),
    ),
    indexer_config=GoogleDriveIndexerConfig(),
    downloader_config=GoogleDriveDownloaderConfig(),
    partitioner_config=PartitionerConfig(
        strategy="fast", #"hi_res" #for  images
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_by_api=True,
        partition_endpoint=os.getenv("UNSTRUCTURED_PARTITION_ENDPOINT"),
        split_pdf_concurrency_level=10,  # Modify split_pdf_concurrency_level to set the number of parallel requests; the max is 35
        ),
    chunker_config=ChunkerConfig(chunking_strategy="by_title"),
    embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
    destination_connection_config=WeaviateConnectionConfig(
        access_config=WeaviateAccessConfig(
            access_token=os.getenv("WEAVIATE_API_KEY"),
        ),
        host_url=os.getenv("WEAVIATE_URL"),
        class_name=os.getenv("COLLECTION_NAME"),
    ),
    stager_config=WeaviateUploadStagerConfig(),
    uploader_config=WeaviateUploaderConfig(),
).run()

## Time to Search!


### Aggregate query

In [None]:
# count how many chunks are in the database

documents = client.collections.get(os.getenv("COLLECTION_NAME"))
response = documents.aggregate.over_all(total_count=True)

print(response.total_count)

98


### Hybrid search (mix of keyword and vector search)

In [None]:
import json

documents = client.collections.get(os.getenv("COLLECTION_NAME"))

response = documents.query.hybrid(
    query="challenges with RAG evaluation",
    alpha=0.5, # equal weighting of BM25 and vector search
    return_properties=['text'],
    auto_limit=2  # autocut after 2 jumps
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "text": "The issue of hallucination in LLMs has been explored in multiple contexts and models [Ji et al., 2023, Kaddour et al., 2023]. As a response, RAG systems have been shown to reduce hallucination [Shuster et al., 2021, Kang et al., 2023]. Previous works have explored automated RAG evaluation frameworks in various settings [Es et al., 2023a, Hoshi et al., 2023, Saad-Falcon et al., 2023a, Zhang et al., 2024]. For example, some studies use LLMs to evaluate the faithfulness, answer relevance, and"
}
{
  "text": "context relevance of RAG systems by using GPT-3.5 as an evaluator [Es et al., 2023b, Saad-Falcon et al., 2023b]. In another study, the authors propose metrics such as noise robustness, negative rejection, information integration, and counterfactual robustness [Chen et al., 2024b]. Multiple studies have shown that RAG can mislead LLMs in the presence of complex or misleading search results and that such models can still make mistakes even when given the correct response [F

### Vector Search

In [None]:
documents = client.collections.get(os.getenv("COLLECTION_NAME"))

response = documents.query.near_text(
    query="challenges with RAG evaluation",
    return_properties=['text'],
    limit=5  # limit to 5
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "text": "The issue of hallucination in LLMs has been explored in multiple contexts and models [Ji et al., 2023, Kaddour et al., 2023]. As a response, RAG systems have been shown to reduce hallucination [Shuster et al., 2021, Kang et al., 2023]. Previous works have explored automated RAG evaluation frameworks in various settings [Es et al., 2023a, Hoshi et al., 2023, Saad-Falcon et al., 2023a, Zhang et al., 2024]. For example, some studies use LLMs to evaluate the faithfulness, answer relevance, and"
}
{
  "text": "context relevance of RAG systems by using GPT-3.5 as an evaluator [Es et al., 2023b, Saad-Falcon et al., 2023b]. In another study, the authors propose metrics such as noise robustness, negative rejection, information integration, and counterfactual robustness [Chen et al., 2024b]. Multiple studies have shown that RAG can mislead LLMs in the presence of complex or misleading search results and that such models can still make mistakes even when given the correct response [F

For OpenAI Embeddings, run this pip install and swap out the below blocks/sections in the above:

In [None]:
!pip install -U -q "unstructured[open-ai]"

In [None]:
# Weaviate Cloud

import weaviate

# Set these environment variables
URL = os.getenv("WEAVIATE_URL")
APIKEY = os.getenv("WEAVIATE_API_KEY")

# Connect to your WCD instance
client = weaviate.connect_to_wcs(
    cluster_url=URL,
    auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")  # Replace with your OpenAI key
    }
)

client.is_ready()

In [None]:
import weaviate.classes.config as wc
from weaviate.classes.config import ReferenceProperty

client.collections.create(
    name="UnstructuredDemo",

    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( # specify the vectorizer and model type you're using
        model="ada",
        model_version="002",
        type_="text"
    ),
    generative_config=wc.Configure.Generative.openai(
        model="gpt-4"  # Optional - Defaults to `gpt-3.5-turbo`
    ),

    # Weaviate can infer schema, but it is considered best practice to define it upfront
    properties=[
        wc.Property(name="type", data_type=wc.DataType.TEXT),
        wc.Property(name="element_id", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="embeddings", data_type=wc.DataType.NUMBER_ARRAY, skip_vectorization=True),
        wc.Property(name="metadata", data_type=wc.DataType.OBJECT, nested_properties=[
            wc.Property(name="filename", data_type=wc.DataType.TEXT),
            wc.Property(name="filetype", data_type=wc.DataType.TEXT),
            wc.Property(name="languages", data_type=wc.DataType.TEXT_ARRAY),
            wc.Property(name="page_number",  data_type=wc.DataType.TEXT, skip_vectorization=True),

        ])
    ],
)

In [None]:
embedding_config=EmbeddingConfig(
        provider="langchain-openai",
        api_key=os.getenv("OPENAI_API_KEY"), # the embeddings model should match the one defined for Weaviate collection, in this case the default is text-embedding-ada-002