# Databricks Connect

This guide shows how to **manage Databricks workspace** using the Python API, and **access your database** on the SQL Warehouse by connecting to a Compute cluster.

## Requirements

### Databricks

This notebook uses data stored on a Databricks Data Warehouse. To access these data, an access to Databricks and the specific data catalogue is required. This link provide useful information to access Databricks remotely:
- https://docs.databricks.com/en/dev-tools/databricks-connect/python/index.html

Given these requirements are fulfilled, do the following steps:

#### 1. Learn Databricks host
- The **host** is the browser URL of Databricks web interface. Example:
    - https://[hostID].azuredatabricks.net/

**IMPORTANT:** The full **host** (the URL) is needed later, not only the hostID.

#### 2. Gain access token in Databricks web
- User Settings -> Developer -> Access tokens (Manage) -> Generate new token
- Specify Comment and Lifetime (leave empty for no expiration date)
- Save the generated **token** (you will not be able to see it again)

#### 3. Acquire SQL Warehouse ID
- Look for a warehouse in the SQL Warehouses menu
- Save the ID in the warehouse Overview: Name (ID: **warehouse_ID**)

#### 4. Set up compute cluster in Databricks web, that have access to the warehouse
- In Compute, create, or look for a compute node with this Runtime environment:
    - 13.0 +
    - 13.0 ML +
- Select node -> More ... -> View JSON
- Save the value of **"cluster_id"**

#### 5. Install databricks-connect in your Python environment
**Install** the databricks-connect package specific to the Compute cluster used in Databricks. Example:
- pip install databricks-connect==13.2

Note: databricks-connect 13.x+ requires python 3.10+, thus it is recommended to build the analysis environment with python 3.10+

To **upgrade** this package, use the *--upgrade* option:
- pip install --upgrade "databricks-connect==14.3.*"

Note: with the *.\** you can ensure that the latest revision of the package version will be installed.

## 1. Configure Databricks Connection

To configure, we need these information:
- Databricks host
- Access token
- Compute cluster ID

Optional:
- Warehouse ID can be included in the Config object for easier access

In [None]:
from databricks.sdk.core import Config

config = Config(
    profile    = 'access', # arbitrary config profile name
    host       = '----',   # fill this in
    token      = '----',   # fill this in
    cluster_id = '----'    # fill this in
    warehouse_id = '----'    # fill this in
)

## 2. Manage, and Start SQL Warehouse and Compute cluster

### 1. Create a WorkspaceClient

For more information, refer to the Python API on the WorkspaceClient class:
- https://databricks-sdk-py.readthedocs.io/en/latest/clients/workspace.html

In [None]:
from databricks.sdk import WorkspaceClient

w = WorkspaceClient(config=config)

### 2. (optional) Get general workspace information

In [None]:
# Print clusters
print([[i.cluster_name, i.cluster_id] for i in w.clusters.list()])

# Print warehouses
print([[i.name, i.id] for i in w.warehouses.list()])

# Print catalogs (1st level)
print([i.full_name for i in w.catalogs.list()])

# Print schemas in catalogs (2nd level)
print([i.full_name for i in w.schemas.list(catalog_name='medications')])

Cluster information (Fill in cluster_id)

In [None]:
w.clusters.get(cluster_id='----').as_dict()

Warehouse information (Fill in id)

In [None]:
w.warehouses.get(id='----').as_dict()

Catalog and schema information (Fill in name)

In [None]:
w.catalogs.get(name='----').as_dict()

In [None]:
w.schemas.get(full_name='----.----').as_dict() # full_name in the format: catalog.schema

Table information (Fill in names)

In [None]:
# This may not work (API bug?)
w.tables.list(catalog_name='----', schema_name='----')

In [None]:
catalog_name = '----'
schema_name = '----'
table_name = '----'

# Print table comments
print(w.tables.get(full_name='%s.%s.%s' % (catalog_name, schema_name, table_name)).comment)

# Print table columns and data types
[[c['name'], c['type_name']] for c in w.tables.get(full_name='%s.%s.%s' % (catalog_name, schema_name, table_name)).as_dict()['columns']]

### 3. Start clusters

In [None]:
# First, we begin starting the Compute node, then move on.
w.clusters.start(cluster_id=config.cluster_id)

# Second, we start the SQL Warehouse, and wait until it's finished. Fill in warehouse id.
w.warehouses.start_and_wait(id='----')

# Lastly, we wait to get 'RUNNING' state from the Compute node, too.
w.clusters.wait_get_cluster_running(cluster_id=config.cluster_id)

### 4. (optional) Check whether the clusters are running.

For this, we define two functions that check cluster states.

In [None]:
def checkDatabricksStopped(w, warehouse_id=None, cluster_id=None):
    if warehouse_id is not None:
        try:
            print('Warehouse: ' + w.warehouses.wait_get_warehouse_stopped(id=warehouse_id).as_dict()['state'])
        except:
            print('Warehouse might be running')
    if cluster_id is not None:
        try:
            print('Cluster: ' + w.clusters.wait_get_cluster_terminated(cluster_id=cluster_id).as_dict()['state'])
        except:
            print('Cluster might be running')

def checkDatabricksRunning(w, warehouse_id=None, cluster_id=None):
    if warehouse_id is not None:
        try:
            print('Warehouse: ' + w.warehouses.wait_get_warehouse_running(id=warehouse_id).as_dict()['state'])
        except:
            print('Warehouse might be stopped')
    if cluster_id is not None:
        try:
            print('Cluster: ' + w.clusters.wait_get_cluster_running(cluster_id=cluster_id).as_dict()['state'])
        except:
            print('Cluster might be terminated')

In [None]:
checkDatabricksRunning(w, warehouse_id='----', cluster_id=config.cluster_id)

## 3. Query and export data from Databricks

**IMPORTANT**: To connect via Spark session, the Compute cluster **AND** the SQL Warehouse must be running in Databricks. **Please, stop these computers after data is accessed**, if automated termination is not set up on these nodes.

### 1. Make spark session on Databricks

In [None]:
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

### 2. Show description of data tables

With the DESCRIBE statement in SQL some information of each column can be extracted: the column name, data type and column comments. This is usually faster than querying the database.

In [None]:
spark.sql(""" DESCRIBE TABLE catalog.database.table """).show()

### 3. Query from SQL Warehouse and convert to Pandas DataFrame

Note: Both the Spark and Pandas DataFrames can be used to create an AMLDataVariant object.

In [None]:
df_in_spark = spark.sql(""" SELECT * FROM catalog.database.table """)

df = df_in_spark.toPandas()