# Options for Accessing Data in Databricks from Outside Databricks

May 08, 2024

Eumar Assis, Sr. Solutions Architect

Databricks

## Databricks Connect V2

Requires a Databricks compute cluster.

This is my preferred approach given its simplicity. It is lightweight for developers.

Works for accessing catalog/tables that are the product of a Delta Share.

Note: while compute in Databricks is required, it may save compute resources in sources like SageMaker given that processing is offloaded to Databricks.

https://docs.databricks.com/en/dev-tools/databricks-connect/index.html

In [None]:
%pip install databricks-connect

In [40]:
from databricks.connect import DatabricksSession

#spark = DatabricksSession.builder.profile("./profile.databrickscfg").getOrCreate()

spark = DatabricksSession.builder.remote(
  host       = "",
  token      = "",
  cluster_id = ""
).getOrCreate()


df = spark.read.table("samples.nyctaxi.trips")
df.show(5)

+--------------------+---------------------+-------------+-----------+----------+-----------+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip|
+--------------------+---------------------+-------------+-----------+----------+-----------+
| 2016-02-14 16:52:13|  2016-02-14 17:16:04|         4.94|       19.0|     10282|      10171|
| 2016-02-04 18:44:19|  2016-02-04 18:46:00|         0.28|        3.5|     10110|      10110|
| 2016-02-17 17:13:57|  2016-02-17 17:17:55|          0.7|        5.0|     10103|      10023|
| 2016-02-18 10:36:07|  2016-02-18 10:41:45|          0.8|        6.0|     10022|      10017|
| 2016-02-22 14:14:41|  2016-02-22 14:31:52|         4.51|       17.0|     10110|      10282|
+--------------------+---------------------+-------------+-----------+----------+-----------+
only showing top 5 rows



## Databricks SDK | Table data

Databricks compute cluster required.

Works for accessing catalog/tables that are the product of a Delta Share.

Note: while compute in Databricks is required, it may save compute resources in sources like SageMaker given that processing is offloaded to Databricks.

https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/index.html

In [None]:
%pip install databricks-sdk

In [None]:
import os

from databricks.sdk import WorkspaceClient
from databricks.sdk.service import compute

w = WorkspaceClient()

cluster_id = os.environ["TEST_DEFAULT_CLUSTER_ID"]

context = w.command_execution.create(cluster_id=cluster_id, language=compute.Language.PYTHON).result()

text_results = w.command_execution.execute(cluster_id=cluster_id,
                                           context_id=context.id,
                                           language=compute.Language.PYTHON,
                                           command="print(1)")

## Delta Sharing Scenario | COVID dataset

https://github.com/delta-io/delta-sharing?tab=readme-ov-file#accessing-shared-data


No Databricks compute cluster required

Unity Catalog in the provider Workspace acts as the Delta Sharing Server

In [None]:
%pip install delta_sharing

In [7]:

import os
import delta_sharing

# Point to the profile file. It can be a file on the local file system or a file on a remote storage.
profile_file = "./open-datasets.share"
#profile_file = "https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share"

# Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)

# List all shared tables.
print("########### All Available Tables #############")
print(client.list_all_tables())

# Create a url to access a shared table.
# A table path is the profile file path following with `#` and the fully qualified name of a table (`<share-name>.<schema-name>.<table-name>`).
table_url = profile_file + "#delta_sharing.default.owid-covid-data"

# Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data from a table that cannot fit in the memory.
print("########### Loading 10 rows from delta_sharing.default.owid-covid-data as a Pandas DataFrame #############")
data = delta_sharing.load_as_pandas(table_url, limit=10)

# Print the sample.
print("########### Show the fetched 10 rows #############")
print(data)

# Load a table as a Pandas DataFrame. This can be used to process tables that can fit in the memory.
print("########### Loading delta_sharing.default.owid-covid-data as a Pandas DataFrame #############")
data = delta_sharing.load_as_pandas(table_url)

# Do whatever you want to your share data!
print("########### Show Data #############")
print(data[data["iso_code"] == "USA"].head(10))

########### All Available Tables #############
[Table(name='COVID_19_NYT', share='delta_sharing', schema='default'), Table(name='boston-housing', share='delta_sharing', schema='default'), Table(name='flight-asa_2008', share='delta_sharing', schema='default'), Table(name='lending_club', share='delta_sharing', schema='default'), Table(name='nyctaxi_2019', share='delta_sharing', schema='default'), Table(name='nyctaxi_2019_part', share='delta_sharing', schema='default'), Table(name='owid-covid-data', share='delta_sharing', schema='default')]
########### Loading 10 rows from delta_sharing.default.owid-covid-data as a Pandas DataFrame #############
########### Show the fetched 10 rows #############
  iso_code continent     location        date  total_cases  new_cases  \
0      AFG      Asia  Afghanistan  2020-02-24          1.0        1.0   
1      AFG      Asia  Afghanistan  2020-02-25          1.0        0.0   
2      AFG      Asia  Afghanistan  2020-02-26          1.0        0.0   
3     

## Delta Sharing Scenario | Eumar Dataset


No Databricks compute cluster required

Unity Catalog in the provider Workspace acts as the Delta Sharing Server

In [9]:
# Reference: https://github.com/delta-io/delta-sharing?tab=readme-ov-file#accessing-shared-data
### DELTA SHARING TO OPEN | No  Databricks Credentials ###

import delta_sharing

# Point to the profile file. It can be a file on the local file system or a file on a remote storage.
profile_file = "./eumar_delta_share_credential.share"

# Create a url to access a shared table.
# A table path is the profile file path following with `#` and the fully qualified name of a table 
# (`<share-name>.<schema-name>.<table-name>`).
table_url = profile_file + "#eumar_test_share.eumar_default.candy_dataset"

# Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)

# List all shared tables.
print ("##### Delta Sharing Tables ####")
print(client.list_all_tables())


print ("##### Read with Pandas Connector ####")

# Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data 
# from a table that cannot fit in the memory.
#delta_sharing.load_as_pandas(table_url, limit=10)

# Load a table as a Pandas DataFrame. This can be used to process tables that can fit in the memory.
delta_sharing.load_as_pandas(table_url)

# If the code is running with PySpark, you can use `load_as_spark` to load the table as a Spark DataFrame.
#delta_sharing.load_as_spark(table_url)



##### Delta Sharing Tables ####
[Table(name='candy_dataset', share='eumar_test_share', schema='eumar_default')]
##### Read with Pandas Connector ####


Unnamed: 0,DATE,price
0,1972-01-01,74.6385
1,1972-02-01,62.5554
2,1972-03-01,57.5046
3,1972-04-01,56.2333
4,1972-05-01,56.6197
...,...,...
620,2023-09-01,103.9538
621,2023-10-01,114.1864
622,2023-11-01,109.6462
623,2023-12-01,113.1631


## Databricks SDK | Accessing Volume Files

No Databricks compute cluster required

https://databricks-sdk-py.readthedocs.io/en/latest/dbutils.html

https://databricks-sdk-py.readthedocs.io/en/latest/workspace/files/files.html


In [None]:



from databricks.sdk import WorkspaceClient

# Authenticates using the credentials in .databrickscfg
w = WorkspaceClient()

VOLUME_PATH = "/Volumes/eumar_tests/eumar_default/eumar_volume"


# Variables for controlling how many files to print
max_prints = 10
current_print = 0

print ("==== NON-DELTA SHARE ====")
print (f"Print using list_directory_contents('{VOLUME_PATH}')")
for query in w.files.list_directory_contents(VOLUME_PATH):

    print(query.as_dict())

    current_print = current_print + 1
    
    if current_print >= max_prints:
        break


print (f"Print using dbutils.fs.ls('/')")
dbutils = w.dbutils
files_in_root = dbutils.fs.ls(VOLUME_PATH)
print(f'number of files in root: {len(files_in_root)}')
current_print = 0
for file in files_in_root:

    print(file)

    current_print = current_print + 1
    
    if current_print >= max_prints:
        break


print ("==== ACCESSING SHARED TABLE  ====")

result = w.tables.list_summaries(catalog_name="demo_climate_assessment_share", schema_name_pattern="demo")

print (result)


## Accessing non-Delta Sharing | Databricks SQL SDK

Requires Databricks SQL Cluster

Works for accessing tables catalog/tables that are the product of a Delta Share.

https://docs.databricks.com/en/dev-tools/python-sql-connector.html

In [None]:
%pip install databricks-sql-connector

In [None]:
from databricks import sql
import os

connection = sql.connect(
                        server_hostname = "",
                        http_path = "",
                        access_token = "<PAT>")

cursor = connection.cursor()

cursor.execute("SELECT * from eumar_test_share.eumar_default.candy_dataset")
print(cursor.fetchall())

cursor.close()
connection.close()

## Accessing non-Delta Sharing | SageMaker JDBC on Data Wrangler

Requires a Databricks Compute Cluster

Specific to SageMaker

https://aws.amazon.com/blogs/machine-learning/prepare-data-from-databricks-for-machine-learning-using-amazon-sagemaker-data-wrangler/


## Accessing non-Delta Sharing | Open Apache Hive Metastore API

No Databricks compute required.

- Works with AWS Athena and EMR. No SageMaker
- Private Preview
- Still requires access to underlying storage

Private Preview: https://www.databricks.com/blog/extending-databricks-unity-catalog-open-apache-hive-metastore-api

In [None]:
from pyspark.sql import SparkSession

hive_jar_folder_location = ""
your_catalog = ""
databricks_host = ""


spark = SparkSession.builder \
   .appName('PySpark Session on Local Machine') \
   .master('local[*]') \
   .config("spark.sql.hive.metastore.jars", hive_jar_folder_location) \
   .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
   .config("spark.hadoop.hive.metastore.server.thrift.http.path", "/api/2.0/unity-hms-proxy/metadata") \
   .config("spark.hadoop.hive.metastore.client.thrift.transport.mode", "http") \
   .config("spark.sql.catalogImplementation", "hive") \
   .config("spark.sql.hive.metastore.version", "3.1.2") \
   .config("spark.hadoop.hive.metastore.client.http.additional.headers", f"X-Databricks-Catalog-Name={your_catalog}") \
   .config("spark.hadoop.hive.metastore.client.auth.mode", "<PAT>") \
   .config("spark.hadoop.hive.metastore.uris", databricks_host) \
   .getOrCreate()

spark.sql("select range(10)").show()