
# Delta Lake I/O with Pandas Dataframes outside Databricks Environment

![](https://img.shields.io/badge/Databricks-FF3621.svg?style=for-the-badge&logo=Databricks&logoColor=white)
![](https://img.shields.io/badge/Delta-003366.svg?style=for-the-badge&logo=Delta&logoColor=white)
![](https://img.shields.io/badge/pandas-150458.svg?style=for-the-badge&logo=pandas&logoColor=white)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dotlas/databricks_helpers/blob/main/notebooks/pandas_delta/pandas_delta.ipynb)


In this notebook, we showcase some utility functions built on top of existing third-party open source libraries in Python to read or write Pandas Dataframes **from within or outside a Databricks environment into Delta lake on Databricks**. The Delta lake can exist on [Unity Catalog](https://www.databricks.com/product/unity-catalog), or simply be the `hive_metastore` default. 

## Requirements

### Databricks
* A Databricks Workspace & Workspace Access Token
* At least one runnable cluster within the workspace
* Workspace attached to a metastore for Delta Lake

### Packages

This process heavily relies on [databricks-sql-python](https://github.com/databricks/databricks-sql-python) library which provides us with a [SQLAlchemy](https://sqlalche.me/) interface to write data. `databricks-sql-python` is an open source Python package maintained by Databricks, and `SQLAlchemy` is used since it is the default ORM wrapper used by the Pandas library


* `databricks-sql-connector`
* `sqlalchemy == 1.4.41`
* `pandas < 2.0`

### Infra

A cluster is required to be running on the Databricks workspace from where the Delta lake will be accessed. This cluster will behave as an intermediary to accept connections and data from outside Databricks and add the data into Delta lake. 

> In order to add data to Unity catalog, the cluster must be configured to access `Unity Catalog`

![](./assets/unity_catalog_cluster.png)

In [None]:
pip install pandas databricks-sql-connector sqlalchemy==1.4.41 -q

In [None]:
import os

import pandas as pd
from sqlalchemy import types as sql_types
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine

# databricks imports
from databricks import sql as databricks_sql


### Setup User Inputs

When running this on Databricks, `CLUSTER HTTP PATH` and `WORKSPACE HOSTNAME` can be inferred. When running outside Databricks, you need to start a cluster, and then get these values, copy them over to this notebook when it's run externally and use those as parameters

Use `HTTP_PATH` from  within the Cluster configuration page for `CLUSTER HTTP PATH` variable like so:

![](https://i.stack.imgur.com/qDotH.png)


**Fill up the values for the 3 parameters within the cell below when running this notebook outside a Databricks environment**

In [None]:
# Check if notebook is running inside databricks environment
DATABRICKS_ENV = any("SPARK" in k for k in os.environ)

if DATABRICKS_ENV:
    dbutils.widgets.removeAll()
    dbutils.widgets.text("WORKSPACE ACCESS TOKEN", "")
    dbutils.widgets.text("WORKSPACE HOSTNAME", "")
    dbutils.widgets.text("CLUSTER HTTP PATH", "")

# INPUT VALUES HERE

# The workspace access token. Usually of the form dapi*******
databricks_workspace_access_token: str = (
    getArgument("WORKSPACE ACCESS TOKEN")
    if DATABRICKS_ENV
    else "<INPUT WORKSPACE ACCESS TOKEN HERE>"
)

# server hostname like dbc-xxxx.cloud.databricks.com
# do not prefix with https:// or add a / at the end
databricks_server_hostname: str = (
    getArgument("WORKSPACE HOSTNAME")
    if DATABRICKS_ENV
    else "<INPUT WORKSPACE URL HERE>"
)

# the http path from the cluster configuration -> JDBC/ODBC tab
databricks_cluster_jdbc_http_path: str = (
    getArgument("CLUSTER HTTP PATH")
    if DATABRICKS_ENV
    else "<INPUT CLUSTER HTTP PATH HERE>"
)


### Infer & Assert Inputs

In [None]:
if DATABRICKS_ENV:
    # if notebook is running on databricks environment, then infer parameters
    if not databricks_cluster_jdbc_http_path:
        # spark works without imports within databricks environment
        cluster_id: str = spark.conf.get(
            "spark.databricks.clusterUsageTags.clusterId",
        )  # type: ignore
        workspace_id: str = spark.conf.get(
            "spark.databricks.clusterUsageTags.clusterOwnerOrgId",
        )  # type: ignore
        databricks_cluster_jdbc_http_path = (
            f"sql/protocolv1/o/{workspace_id}/{cluster_id}"
        )

    if not databricks_server_hostname:
        databricks_server_hostname = spark.conf.get("spark.databricks.workspaceUrl")

assert databricks_workspace_access_token, "Databricks Workspace Access Token Missing"
assert databricks_server_hostname, "Databricks Hostname Missing"
assert databricks_cluster_jdbc_http_path, "Cluster JDBC path Missing"


### Setup Connection

We will create a SQLAlchemy engine using the credentials required to connect to the cluster and workspace

In [None]:
databricks_sqlalchemy_url: str = (
    "databricks://token:"
    + databricks_workspace_access_token
    + "@"
    + databricks_server_hostname
    + "?http_path="
    + databricks_cluster_jdbc_http_path
)

databricks_alch_engine: Engine = create_engine(databricks_sqlalchemy_url)


Verify that the connection works by listing catalogs on Databricks

In [None]:
catalogs = pd.read_sql("show catalogs", databricks_alch_engine)


### Run Queries

In [None]:
catalog_name: str = "samples"
schema_name: str = "nyctaxi"
table_name: str = "trips"

In [None]:
df: pd.DataFrame = pd.read_sql(
    f"SELECT * FROM {catalog_name}.{schema_name}.{table_name} limit 100",
    databricks_alch_engine,
)

df.head()