## Embed your Google Drive Docs in a DataStax Astra Vector Database with Unstructured!


Author: Nina Lopatina from Unstructured

Nina's X handle: [@NinaLopatina](https://x.com/ninalopatina)

Nina's LinkedIn: https://www.linkedin.com/in/ninalopatina

Last updated: 06.28.24

Do you have some files in Google Docs that you want to parse, embed, and import to your Astra DataBase for RAG? If so, this notebook will guide you through all the steps to do so!

Here are the initial non-code steps:

A. Sign up for your [Unstructured API key](https://app.unstructured.io/) with a 2 week free trial for up to 1000 documents. You can find your API credentials in your dashboard.

B. Create a [Google Drive service acount](https://support.google.com/a/answer/7378726?hl=en) or find your json with your login info. Make sure you share the google drive directory your data is stored in with the service account email address.

C. Sign up to get your [AstraDB](https://www.google.com/url?q=https%3A%2F%2Fwww.datastax.com%2Flp%2Fastra-registration) DB endpoint and token

D. Decide on which embeddings to use, and obtain the appropriate API Token as needed (in this notebook we are using OpenAI for embedding generation).

Set up the any private API keys in a .env file in your Google Drive
_______________




1. Now starting with the code below, we will install all the necessary libraries

In [None]:
!pip install unstructured[all-docs] unstructured[astra] unstructured[openai] unstructured[embed-huggingface] langchain-community httpx python-dotenv

Collecting unstructured[all-docs]
  Downloading unstructured-0.14.9-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting emoji (from unstructured[all-docs])
  Downloa

2. [Mount your Google drive locally](https://colab.research.google.com/notebooks/io.ipynb) -- there will be a pop up asking you to connect to your google drive -- to load your dotenv file, and to store your .json locally in case you want to reference them later.

  The files themselves will be pulled via a connector to a service account, which allows for processing of google doc files in addition to standard file formats that can be saved in your Drive.

  The secret parameters to set in your .env file are:
  
  UNSTRUCTURED_API_KEY

  UNSTRUCTURED_PARTITION_ENDPOINT
  
  ASTRA_DB_TOKEN

  ASTRA_DB_ENDPOINT

  

### Note that in this notebook, you are sharing your Google Drive with the colab notebook itself, not with Unstructured or Weaviate.
  If you prefer not to share your notebook, you can access your .env and Drive .json files in another fashion, e.g. by downloading this notebook as a .ipynb and running it locally with local directory access.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import dotenv

dotenv.load_dotenv('/content/drive/MyDrive/.env')

True

3. We will set additional parameters here, that are not secret, that we can modify more easily in a notebook

In [None]:
os.environ['GCP_INGEST_SERVICE_KEY_FILE'] = '/content/drive/MyDrive/secret/your-account-key-json-here.json' # The json you downloaded for your account key
os.environ['GOOGLE_DRIVE_FOLDER_ID'] = 'your-folder-id-here' # The folder where your unstructured data is contained
os.environ['COLLECTION_NAME'] = 'podcast_VDB'
os.environ['EMBEDDING_DIMENSION'] = '384' #This value depends on the embedding model you choose. For our current default model per provider, the current values are 384 for HF and 1536 for OpenAI

#### Note that we temporarily have a bug in processing Docs, Sheets, and Slides in Google Docs (.doc, .xlsx, .ppt, etc., would work fine) -- as a temporary workaround, you can use the V1 SDK code, or download and upload your files)

4. Set up Unstructured API access and process the documents as per our [Google Drive source connector](https://docs.unstructured.io/open-source/ingest/source-connectors/google-drive) documentation and set up the [Astra destination connector](https://docs.unstructured.io/open-source/ingest/destination-connectors/astra). Note that these will shortly be updated to for our new Serverless API.

  At the end of this workflow, your unstructured documents have been extracted, chunked, summarized, embedded, and loaded in your Astra DB!

In [None]:
#All of the imports
from unstructured.ingest.v2.interfaces import ProcessorConfig
from unstructured.ingest.v2.pipeline.pipeline import Pipeline
from unstructured.ingest.v2.processes.chunker import ChunkerConfig
from unstructured.ingest.v2.processes.connectors.google_drive import (
    GoogleDriveAccessConfig,
    GoogleDriveIndexerConfig,
    GoogleDriveConnectionConfig,
    GoogleDriveDownloaderConfig,
)
from unstructured.ingest.v2.processes.connectors.astra import (
    AstraUploaderConfig,
    AstraConnectionConfig,
    AstraAccessConfig,
    AstraUploadStagerConfig,
)
import os

from unstructured.ingest.v2.processes.embedder import EmbedderConfig
from unstructured.ingest.v2.processes.partitioner import PartitionerConfig

In [None]:
Pipeline.from_configs(
    context=ProcessorConfig(tqdm=True, reprocess=True, verbose=False),
    source_connection_config=GoogleDriveConnectionConfig(
        access_config=GoogleDriveAccessConfig(
            service_account_key=os.getenv("GCP_INGEST_SERVICE_KEY_FILE"),
        ),
        drive_id = os.getenv("GOOGLE_DRIVE_FOLDER_ID"),
    ),
    indexer_config=GoogleDriveIndexerConfig(),
    downloader_config=GoogleDriveDownloaderConfig(),
    partitioner_config=PartitionerConfig(
        strategy="fast", #"hi_res" #for  images
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_by_api=True,
        partition_endpoint=os.getenv("UNSTRUCTURED_PARTITION_ENDPOINT"),
        ),
    chunker_config=ChunkerConfig(chunking_strategy="by_title"),
    embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
    destination_connection_config=AstraConnectionConfig(
        access_config=AstraAccessConfig(
            token=os.getenv("ASTRA_DB_TOKEN"), api_endpoint=os.getenv("ASTRA_DB_ENDPOINT")
        )
    ),
    stager_config=AstraUploadStagerConfig(),
    uploader_config=AstraUploaderConfig(
        collection_name=os.getenv("COLLECTION_NAME"),
        embedding_dimension=int(os.getenv("EMBEDDING_DIMENSION")),
        requested_indexing_policy={"deny": ["metadata"]},
    ),
).run()

NameError: name 'Pipeline' is not defined

2024-06-28 19:33:26,016 MainProcess INFO     Created index with configs: {"extensions": null, "recursive": false}, connection configs: {"access_config": "***REDACTED***", "drive_id": "1cF_wp5Mkuiyvrcee0KKBWWmLcyPj8BiP"}
2024-06-28 19:33:26,021 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": "***REDACTED***", "drive_id": "1cF_wp5Mkuiyvrcee0KKBWWmLcyPj8BiP"}
2024-06-28 19:33:26,026 MainProcess INFO     Created partition with configs: {"strategy": "fast", "ocr_languages": null, "encoding": null, "additional_partition_args": null, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructuredapp.io", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-06-28 19:33:26,031 MainProcess INFO     Created chunk with configs: {"chunking_str