# Transforming Unstructured Data from a Google Drive Folder into RAG-Ready Data in Databricks Volume

Do you have any unstructured data lying around in Google Drive that you'd like to use for RAG?

Let's ingest and preprocess it with Unstructured and load into Databricks volumes.

Prerequisites:

A. Get your [free Unstructured API key](https://unstructured.io/api-key-free) or [paid Unstructured API key](https://unstructured.io/api-key-hosted).

B. Create a [Google Drive service account](https://developers.google.com/workspace/guides/create-credentials#service-account) and get your key (as a json file).

C. Make sure the email address for the service account is given access to the Google Drive folder you will be ingesting from.

D. Create a [Databricks Workspace](https://docs.databricks.com/en/workspace/workspace-details.html) (you'll need your hostname, username, and password) and [UC Volume](https://docs.databricks.com/en/connect/unity-catalog/volumes.html).


## Setup


In [None]:
# Install the necessary libraries
!pip install unstructured[all-docs] httpx "unstructured[databricks-volumes, gdrive, docx]" python-dotenv

Here we're mounting a Google Drive **only to load an `.env` file** with the environment variables.  The documents that we intend to process are going to be ingested through an Unstructured source connector.
Feel free to load your environment variables in any other way you prefer.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import dotenv

dotenv.load_dotenv('/content/drive/MyDrive/.env')

In [None]:
# Optional cell to reduce the amount of logs
import logging

logger = logging.getLogger("unstructured.ingest")
logger.root.removeHandler(logger.root.handlers[0])

In [None]:
# Imports
from unstructured.ingest.connector.databricks_volumes import (
    DatabricksVolumesAccessConfig,
    DatabricksVolumesWriteConfig,
    SimpleDatabricksVolumesConfig,
)
from unstructured.ingest.connector.google_drive import (
    GoogleDriveAccessConfig,
    SimpleGoogleDriveConfig,
    )

from unstructured.ingest.interfaces import (
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)
from unstructured.ingest.runner import GoogleDriveRunner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.databricks_volumes import (
    DatabricksVolumesWriter,
)

In [None]:
# The Writer configures the connection to Databricks Volume
def get_writer() -> Writer:
    return DatabricksVolumesWriter(
        connector_config=SimpleDatabricksVolumesConfig(
            host=os.getenv("DATABRICKS_HOST"),
            access_config=DatabricksVolumesAccessConfig(
                username=os.getenv("DATABRICKS_USERNAME"), password=os.getenv("DATABRICKS_PASSWORD")
            ),
        ),
        write_config=DatabricksVolumesWriteConfig(
            catalog=os.getenv("DATABRICKS_CATALOG"),
            volume=os.getenv("DATABRICKS_VOLUME"),
        ),
    )


writer = get_writer()

# GoogleDriveRunner configures ingestion parameters for Google Drive
runner = GoogleDriveRunner(
    processor_config=ProcessorConfig(
        output_dir="google-drive-ingest-output",
        num_processes=2,
       ),

    read_config=ReadConfig(),

    partition_config=PartitionConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),

    ),

    connector_config=SimpleGoogleDriveConfig(
        access_config=GoogleDriveAccessConfig(service_account_key=os.getenv("YOUR_SERVICE_ACCOUNT_KEY")),
        recursive=True,
        # The drive_id is a part of the URL for your Google Drive folder, e.g.:
        # https://drive.google.com/drive/folders/{folder-id}
        drive_id="GOOGLE_DRIVE_FOLDER_ID",
        ),

    writer=writer,
    writer_kwargs={},
    )

runner.run()


2024-05-31 20:45:28,975 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/google_drive/85da14bf81
2024-05-31 20:45:28,979 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Writer -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "google-drive-ingest-output", "num_processes": 2, "raise_on_error": false}
2024-05-31 20:45:29,062 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "google-drive-ingest-output", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/google_drive/85da14bf81", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"drive_id": "1YnEi8MP1dVUcHkd