# Develop Databricks Projects using VS Code
## Disclaimer

 * The product may change or may never be released;
 * While we will not charge separately for this product right now, we may charge for it in the future. You will still incur charges for DBUs.
 * There's no formal support or SLAs for the preview - so please reach out to your account or other contact with any questions or feedback; and
 * We may terminate the preview or your access at any time;
 Place this notebook into a workspace folder and execute it in order to develop on the folder using a VSCode remote tunnel.

## Requirements:
 * UC
 * DBR/MLR >= 14.x
 * Single user cluster

## How to use
 * Place this notebook into the workspace folder you want to develop using VS Code
 * Configure
   * Run the cell called "Widgets"
   * Configure the IDE using the widgets at the top of the notebook
 * Click "Run all" each time the cluster is restarted.
 * Follow instructions in the output of the last cell to open the IDE
   * Open project folder in the IDE
   * Select python environment by executing the `Python: Select Interpreter` command. Select the item labeled `python.defaultInterpreterPath`.
   * When using notebooks click `select kernel`, select `Python environments...`, and then select the highlighted environment
 * Click "Interrupt" to close the tunnel once you are done developing

### Features
 * Authentication with `Github` and `Microsoft Entra ID` (can be configured in a widget)
   * Microsoft authentication currently requires VS Code Insiders channel
 * Full debugger support for Python files
 * Jupyter notebooks
 * Debugger works in notebooks
 * Access to all cluster libraries
 * Full Python LSP including code completion for `spark`
 * UC support
 * Auto-termination after 10m (can be configured using a widget)
 * Toggle between `stable` and `insider` VS Code channels
 * Customized Jupyter kernel to support
   * `spark` and `dbutils` globals
   * `%sql`
   * Better table output rendering

### Limitations
 * Requires UC and DRB >= 14.x (because of SparkConnect) 
 * Only SparkConnect and DBUtils from SparkConnect
 * No dbutils widgets
 * Notebooks written in VSCode don't show up in Databricks (needs DBR >= 17.2)
 * Notebooks written in Databricks can't be opened in VSCode (needs DBR >= 16)
 * No git support in VS Code. Code needs to be committed from Databricks (WIP)
 * VS Code
   * Extensions defined in `.vscode/extensions.json` will be installed automatically. Manually installed extensions must be added to `.vscode/extensions.json` or they will be lost on cluster termination
   * Python virtual environment needs to be selected manually using `Python: Select Interpreter` command


## Download and install

First down load and install the VS Code tunnel CLI

In [0]:
# Generate dbutils widget dropdown
dbutils.widgets.dropdown("Provider", "microsoft", ["github", "microsoft"])
dbutils.widgets.dropdown("Duration", "10m", ["10m", "30m", "1h", "4h"])
dbutils.widgets.dropdown("VS Code Channel", "stable", ["stable", "insider"])
dbutils.widgets.dropdown("Create example.ipynb", "Yes", ["Yes", "No"])

import os

PROVIDER = dbutils.widgets.get("Provider")
DURATION = dbutils.widgets.get("Duration")
CHANNEL = dbutils.widgets.get("VS Code Channel")
CREATE_EXAMPLE_NOTEBOOK = dbutils.widgets.get("Create example.ipynb") == "Yes"

In [0]:
import os
from os.path import expanduser
import io
import requests
import tarfile

code_path = expanduser("~/code")

def rm_file(path):
    if os.path.exists(path):
        os.remove(path)


def download_code(channel):
    rm_file(f"{code_path}/code")
    rm_file(f"{code_path}/code-insiders")

    download_url = (
        f"https://code.visualstudio.com/sha/download?build={channel}&os=cli-alpine-x64"
    )
    response = requests.get(download_url)

    response.raise_for_status()  # Check that the request was successful

    # Create a file-like object from the response content
    file_like_object = io.BytesIO(response.content)

    # Open the tar file
    with tarfile.open(fileobj=file_like_object, mode="r:gz") as tar:
        tar.extractall(code_path)

download_code(CHANNEL)

## Configure

Extract notebook properties into environment variables

In [0]:
from dbruntime.databricks_repl_context import get_context
from databricks.sdk import WorkspaceClient
import os
import hashlib

VERSION="0.4"

os.environ["DATABRICKS_SDK_UPSTREAM"] = "databricks_vscode_tunnel"
os.environ["DATABRICKS_SDK_UPSTREAM_VERSION"] = VERSION

ctx = get_context()
w = WorkspaceClient()

os.environ["DATABRICKS_HOST"] = ctx.workspaceUrl
os.environ["DATABRICKS_TOKEN"] = ctx.apiToken
os.environ["DATABRICKS_CLUSTER_ID"] = ctx.clusterId

def get_notebook_dir():
    path = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) # type: ignore
    return path if path.startswith("/Workspace") else "/Workspace" + path

DATABRICKS_USER_NAME = w.current_user.me().user_name

notebook_dir = get_notebook_dir()
os.environ["DATABRICKS_NOTEBOOK_DIR"] = notebook_dir

project_name = os.path.basename(notebook_dir).replace(" ", "-")
hash = hashlib.sha256(f"{ctx.clusterId}-{project_name}".encode()).hexdigest()[:6]
DATABRICKS_ENV_NAME = project_name[:13] + "-" + hash

# Configure Jupyter Kernel

Expose `spark` and `dbutils` globals and add support for `%sql`.

In [0]:
from os.path import expanduser

init_script = """
\"""Entry point for launching an IPython kernel with databricks feature support.

This file is based on the kernel launcher from ipykernel[1]. In this launcher we initialize a
connection to spark to both be used by user code and by databricks feature, initialize databricks
setup that require setup on kernel start and launch the kernel app so the jupyter client can
connect.

[1] https://github.com/ipython/ipykernel/blob/v5.2.1/ipykernel_launcher.py
\"""

from dbruntime.DatasetInfo import UserNamespaceDict
from dbruntime.PipMagicOverrides import PipMagicOverrides
from dbruntime.display import displayHTML
from dbruntime.monkey_patches import apply_monkey_patches
from dbruntime import UserNamespaceInitializer
from dbruntime.IPythonShellHooks import load_ipython_hooks


from IPython.core.getipython import get_ipython
from IPython.display import display
from dbruntime.IPythonShellHooks import IPythonShellHook

user_namespace_initializer = UserNamespaceInitializer.getOrCreate()
entry_point = user_namespace_initializer.get_spark_entry_point()

sc = user_namespace_initializer.localSparkHandles["sc"]
spark = user_namespace_initializer.localSparkHandles["spark"]
dbutils = user_namespace_initializer.dbutils
user_ns = UserNamespaceDict(
    user_namespace_initializer.get_namespace_globals(),
    entry_point.getDriverConf(),
    entry_point,
)

shell = get_ipython()
apply_monkey_patches(entry_point, sc, spark, display, displayHTML, dbutils)
shell.register_magics(PipMagicOverrides(entry_point, sc._conf, user_ns))

class UserNamespaceCommandHook(IPythonShellHook):
    def __init__(self, user_ns):
        self.user_ns = user_ns

    def pre_run_cell(self, info):
        self.user_ns.reset_new_dataframes()

    def post_run_cell(self, result):
        new_dataframe_info = self.user_ns.get_new_dataframe_infos_json()
        if new_dataframe_info:
            data = {"text/plain": new_dataframe_info}
            display(data, raw=True)
            
load_ipython_hooks(shell, UserNamespaceCommandHook(user_ns))


from typing import List
from IPython.core.formatters import BaseFormatter

def register_magics():
    def warn_for_dbr_alternative(magic: str):
        # Magics that are not supported on Databricks but work in jupyter notebooks.
        # We show a warning, prompting users to use a databricks equivalent instead.
        local_magic_dbr_alternative = {"%%sh": "%sh"}
        if magic in local_magic_dbr_alternative:
            warnings.warn(
                "\\n" + magic
                + " is not supported on Databricks. This notebook might fail when running on a Databricks cluster.\\n"
                  "Consider using %"
                + local_magic_dbr_alternative[magic]
                + " instead."
            )

    def throw_if_not_supported(magic: str):
        # These are magics that are supported on dbr but not locally.
        unsupported_dbr_magics = ["%r", "%scala"]
        if magic in unsupported_dbr_magics:
            raise NotImplementedError(
                magic
                + " is not supported for local Databricks Notebooks."
            )

    def is_cell_magic(lines: List[str]):
        def get_cell_magic(lines: List[str]):
            if len(lines) == 0:
                return
            if lines[0].strip().startswith("%%"):
                return lines[0].split(" ")[0].strip()
            
        def handle(lines: List[str]):
            cell_magic = get_cell_magic(lines)
            if cell_magic is None:
                return lines
            warn_for_dbr_alternative(cell_magic)
            throw_if_not_supported(cell_magic)
            return lines

        is_cell_magic.handle = handle
        return get_cell_magic(lines) is not None

    def is_line_magic(lines: List[str]):
        def get_line_magic(lines: List[str]):
            if len(lines) == 0:
                return
            if lines[0].strip().startswith("%"):
                return lines[0].split(" ")[0].strip().strip("%")
            
        def handle(lines: List[str]):
            lmagic = get_line_magic(lines)
            if lmagic is None:
                return lines
            warn_for_dbr_alternative(lmagic)
            throw_if_not_supported(lmagic)

            if lmagic == "md" or lmagic == "md-sandbox":
                lines[0] = (
                    "%%markdown" +
                    lines[0].partition("%" + lmagic)[2]
                )
                return lines

            if lmagic == "sql":
                lines = lines[1:]
                spark_string = (
                    "global _sqldf\\n"
                    + "_sqldf = spark.sql('''"
                    + "".join(lines).replace("'", "\\\\'")
                    + "''')\\n"
                    + "display(_sqldf)\\n"
                )
                return spark_string.splitlines(keepends=True)

            if lmagic == "python":
                return lines[1:]

        is_line_magic.handle = handle
        return get_line_magic(lines) is not None
        

    def parse_line_for_databricks_magics(lines: List[str]):
        if len(lines) == 0:
            return lines
        
        lines = [line for line in lines 
                    if line.strip() != "# Databricks notebook source" and \\
                    line.strip() != "# COMMAND ----------"
                ]
        lines = ''.join(lines).strip().splitlines(keepends=True)

        for magic_check in [is_cell_magic, is_line_magic]:
            if magic_check(lines):
                return magic_check.handle(lines)

        return lines

    ip = get_ipython()
    ip.input_transformers_cleanup.append(parse_line_for_databricks_magics)


def register_formatters():
    from pyspark.sql import DataFrame

    def df_html(df):
        return df.toPandas().to_html()

    html_formatter = get_ipython().display_formatter.formatters["text/html"]
    html_formatter.for_type(DataFrame, df_html)

    get_ipython().display_formatter.active_types.append('application/vnd.databricks.v1+datasetinfo')
    get_ipython().display_formatter.formatters['application/vnd.databricks.v1+datasetinfo'] = get_ipython().display_formatter.formatters['text/plain'].__class__()
    get_ipython().display_formatter.formatters['application/vnd.databricks.v1+datasetinfo'].enabled = True

register_magics()
register_formatters()
"""
 
os.makedirs(expanduser("~/.ipython/profile_default/startup"), exist_ok=True)
with open(expanduser("~/.ipython/profile_default/startup/init_script.py"), "w") as f:
    f.write(init_script)


## Persist environment

IDE extensions are ephemeral on disk and don't survive cluster restarts. Write `.vscode/extensions.json` so we can re-create them.

In [0]:
import os
import sys
import json

VSCODE_EXTENSIONS = [
    "ms-python.python",
    "ms-toolsai.jupyter",
    "donjayamanne.python-environment-manager",
    "databricks.databricks"
]

def persist_settings(extensions):
    os.chdir(get_notebook_dir())
    
    if not os.path.exists(".vscode/extensions.json"):
        os.makedirs(".vscode", exist_ok=True)
        with open(".vscode/extensions.json", "w") as f:
            f.write("""{
        "recommendations": %s
    }
    """ % json.dumps(extensions))

    if not os.path.exists(".vscode/settings.json"):
        os.makedirs(".vscode", exist_ok=True)
        with open(".vscode/settings.json", "w") as f:
            f.write("{}")

    with open(".vscode/settings.json", "r") as f:
        data = json.load(f)
        data["python.defaultInterpreterPath"] = sys.executable
        with open(".vscode/settings.json", "w") as f:
            json.dump(data, f)

persist_settings(VSCODE_EXTENSIONS)

In [0]:
import os
from os.path import expanduser
import shutil

def symlink_force(source, target):
    try:
        os.remove(target)
    except FileNotFoundError:
        pass

    os.symlink(source, target, target_is_directory=False)

def persist_login_token(user_name):
    # Define the base paths
    user_base_path = f"/Workspace/Users/{user_name}/.vscode/cli"
    root_base_path_vscode = expanduser("~/.vscode/cli")
    root_base_path_vscode_insiders = expanduser("~/.vscode-insiders/cli")

    # Create directories if they don't exist
    os.makedirs(user_base_path, exist_ok=True)
    os.makedirs(root_base_path_vscode, exist_ok=True)
    os.makedirs(root_base_path_vscode_insiders, exist_ok=True)

    # Path to the token file
    token_file_path = os.path.join(user_base_path, "token.json")

    # Ensure the token file exists
    open(token_file_path, 'a').close()

    # Create symbolic links
    symlink_force(token_file_path, os.path.join(root_base_path_vscode, "token.json"))
    symlink_force(token_file_path, os.path.join(root_base_path_vscode_insiders, "token.json"))

persist_login_token(DATABRICKS_USER_NAME)

## Example Notebook

Optionally create example notebook to be used in VS Code

In [0]:
def write_example_notebook():
    example = r"""{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Spark\n",
    "\n",
    "* directly use `spark` global\n",
    "* Spark through Databricks Connect (just like in shared clusters)\n",
    "* set breakpoints and use step debugging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.table(\"samples.nyctaxi.trips\")\n",
    "\n",
    "display(df.limit(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DBUtils\n",
    "\n",
    "* Supports common subset of DBUtils features\n",
    "  * `dbutils.fs`\n",
    "  * `dbutils.secrets`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Files:\")\n",
    "for file in dbutils.fs.ls(\"/\")[:5]:\n",
    "    print(file.path)\n",
    "\n",
    "print()\n",
    "print(\"Secret Scopes\")\n",
    "for scope in dbutils.secrets.listScopes()[:5]:\n",
    "    print(scope.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SQL\n",
    "\n",
    "Execute SQL using `%sql`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "sql"
    }
   },
   "outputs": [],
   "source": [
    "%sql\n",
    "\n",
    "select * from samples.nyctaxi.trips limit 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GPUs (MLR only)\n",
    "\n",
    "* Leverage GPUs\n",
    "* Use ML libraries such as `pytorch` from ML Runtimes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "for i in range(torch.cuda.device_count()):\n",
    "   print(torch.cuda.get_device_properties(i).name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pythonEnv-bea1ee52-5030-4e6d-a542-e3495d74414d",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
    """

    os.chdir(get_notebook_dir())
    if not os.path.exists("example.ipynb"):
        with open("example.ipynb", "w") as f:
            f.write(example)

if CREATE_EXAMPLE_NOTEBOOK:
    write_example_notebook()

## Start Tunnel

Start the VS Code tunnel

In [0]:
import subprocess
from os.path import expanduser

def start_tunnel(channel, timeout, provider, extensions, tunnel_name):
    if channel == "insider":
        cli = expanduser("~/code/code-insiders")
    else:
        cli = expanduser("~/code/code")

    # Check if the user is logged in
    user_status = subprocess.getoutput(f"{cli} tunnel user show")
    if user_status == "not logged in":
        subprocess.run(
            [cli, "tunnel", "user", "login", "--provider", provider], check=True
        )

    # Prepare the extensions argument
    ext_args = []
    for ext in extensions:
        ext_args.extend(["--install-extension", ext])

    # Kill the tunnel after a specified duration
    print(f"Killing tunnel after {timeout}")

    # Run the subprocess and forward stdout and stderr
    command = [
        "timeout",
        f"{timeout}",
        cli,
        "tunnel",
        *ext_args,
        "--accept-server-license-terms",
        "--name",
        tunnel_name,
    ]

    process = subprocess.Popen(
        command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
    )

    # Forward stdout and stderr to the main process
    for line in process.stdout:
        print(line, end="")  # Forward stdout line to main process stdout

    print("Tunnel closed!")

start_tunnel(CHANNEL, DURATION, PROVIDER, VSCODE_EXTENSIONS, DATABRICKS_ENV_NAME)

Killing tunnel after 30m
*
* Visual Studio Code Server
*
* By using the software, you agree to
* the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
* the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
*
[2025-01-17 14:45:34] warn Command-line options will not be applied until the existing tunnel exits.
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 35267 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 34373 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 33651 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 35059 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 44269 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 40201 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 45337 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding port 41381 (public=false)
[2025-01-17 14:45:34] info [rpc.0] Forwarding p

com.databricks.backend.common.rpc.CommandCancelledException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$5(SequenceExecutionState.scala:136)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:136)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.can