# Workspace Migration ✈️

## Requirements

### Databricks

* Two Databricks Workspaces & Workspace Access Tokens for the same
* At least one runnable cluster within any workspace

> Note: The word `job` and `wokflow` is used interchangeably throughout 

In [None]:
import json

import requests
from typing import Optional, Callable

## Steps 📊


### 1. Fetch workflow / cluster configurations 📬

We fetch all the workflows/clusters present in your workspace, each fetched workflow config will also contain the individual task config present in the workflow and their respective job cluster configs.  

### 2. Parse Information 🧩

In this step we parse the obtained config info. The main thing to keep in mind is that the cluster config contains some fields which are populated after the cluster is initialized but will be fetched anyway from step 1, we need to remove this field or else when we use the same config to create the workflow later it will throw an error. You can also add any custom logic here. For example: You can include webhook notification ID to be associated with a workflow you like, You can also associate an existing all-purpose-compute to a workflow that you want, etc.  

### 3. Create new workflow / config 👶🏽

Using the parsed info we create workflows/clusters in the new workspace.


### Set up workspace urls and access tokens


In [None]:
dbutils.widgets.removeAll()

dbutils.widgets.text("old_workspace_url", "")
old_workspace_url: str = getArgument("old_workspace_url")

dbutils.widgets.text("old_workspace_token", "")
old_workspace_token: str = getArgument("old_workspace_token")

dbutils.widgets.text("new_workspace_url", "")
new_workspace_url: str = getArgument("new_workspace_url")

dbutils.widgets.text("new_workspace_token", "")
new_workspace_token: str = getArgument("new_workspace_token")


query_params = {
    "LIST_JOBS_LIMIT": 100,  # max limit
    "EXPAND_TASKS": "true",  # provides the complete config info for each job
}

In [None]:
def paginate(
    can_paginate: bool,
    next_page_token: Optional[str],
    url: str,
    workspace_token: str,
    function_to_call: Callable,
) -> None:
    """
    Paginates to the next page if possible
    input:
        can_paginate [bool]: Boolean info about wheather there is additional info.
        next_page_token [str]: Token needed in url query param to paginate to next page.
        url [str]: Url used to list the needed info.
        function_to_call [Callable]: Function that gets called with the paginated url to paginate further.
    output:
        None
    """

    if next_page_token and can_paginate:
        if "&page_token" in url:
            url = f"{url[:url.find('&page_token')]}&page_token={next_page_token}"
        else:
            url = f"{url}&page_token={next_page_token}"

        function_to_call(url, workspace_token)
    else:
        return

## List Clusters 
#### Fetches all clusters in current workspace and its respective configs
<a href="https://docs.databricks.com/api/workspace/clusters/list">API Docs</a>


In [None]:
def getAllClusters(list_clusters_url: str, workspace_token: str) -> None:
    """
    Fetches all the clusters and metadata about them.
    input:
        list_clusters_url [str]: Databricks API used to fetch all the clusters.
        workspace_token [str]: Databricks workspace access token.
    output:
        None
    """

    response = requests.get(
        list_clusters_url,
        headers={"Authorization": f"Bearer {workspace_token}"},
    )
    assert response.status_code == 200

    response_data = response.json()

    for cluster_info in response_data.get("clusters", []):
        clusters.append(cluster_info)

    paginate(
        response_data.get("has_more", False),
        response_data.get("next_page_token"),
        list_clusters_url,
        workspace_token,
        getAllClusters,
    )


clusters = []  # holds all cluster' info
List_clusters_url = str(old_workspace_url + "/api/2.0/clusters/list")
getAllClusters(List_clusters_url, old_workspace_token)

## Filter and Parse info

In [None]:
def filterClusters(cluster_info: dict) -> bool:
    """Filter clusters based on custom logic"""
    return True


def parseClusters(cluster_info: dict) -> dict:
    """Modefies the cluster config.
    input:
        cluster_info [dict]: Dict containing all the config info about the cluster.
    output:
        dict : parsed result in accordance with the `create cluster` api payload."""
    if cluster_info.get("aws_attributes"):
        cluster_info.pop("aws_attributes")
    if cluster_info.get("cluster_id"):
        cluster_info.pop("cluster_id")

    # add more custom parsing logic if needed
    return cluster_info


filtered_clusters = []

# filter
for cluster_info in clusters:
    if filterClusters(cluster_info):
        filtered_clusters.append(cluster_info)

# parse
for idx in range(len(filtered_clusters)):
    cluster_info = filtered_clusters[idx]
    parsed_cluster_info = parseClusters(cluster_info)
    filtered_clusters[idx] = parsed_cluster_info

clusters = filtered_clusters

## Create new cluster
#### Use the parsed info as payload to create clusters in the new workspace
<a href="https://docs.databricks.com/api/workspace/clusters/create">API Docs</a>



In [None]:
for cluster_info in clusters:
    response = requests.post(
        f"{new_workspace_url}/api/2.0/clusters/create",
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {new_workspace_token}",
        },
        data=json.dumps(cluster_info),
    )
    assert response.status_code in {
        200,
        201,
    }

## List Workflows 
#### Fetches all workflows in current workspace and its respective configs
<a href="https://docs.databricks.com/api/jobs/clusters/list">API Docs</a>


In [None]:
def getAllJobs(list_jobs_url: str, workspace_token: str) -> None:
    """
    Fetches all the jobs and metadata about them.
    input:
        lists_jobs_url [str]: Databricks API used to fetch all the jobs.
        workspace_token [str]: Databricks workspace access token.
    output:
        None
    """

    response = requests.get(
        list_jobs_url,
        headers={"Authorization": f"Bearer {workspace_token}"},
    )
    assert response.status_code == 200

    response_data = response.json()

    for job in response_data.get("jobs", []):
        jobs.append(job.get("settings"))

    paginate(
        response_data.get("has_more", False),
        response_data.get("next_page_token"),
        list_jobs_url,
        workspace_token,
        getAllJobs,
    )


jobs = []  # holds all jobs' info
List_jobs_url = str(
    old_workspace_url
    + "/api/2.1/jobs/list?"
    + f"limit={query_params.get('LIST_JOBS_LIMIT')}"
    + f"&expand_tasks={query_params.get('EXPAND_TASKS')}"
)
getAllJobs(List_jobs_url, old_workspace_token)

## Filter and Parse info
#### Some of the parsing we can do 
1. You can add new webhook notif ID 
2. Tag an existing all-prupose compute to the workflow 
3. Tag an existing task if the new task (from the workflow) depends on it

In [None]:
def filterWorkflows(workflow_info: dict) -> bool:
    """Filter Workflow based on custom logic"""
    return True


def parseWorkflows(workflow_info: dict) -> dict:
    """Modefies the workflow config.
    input:
        workflow_info [dict]: Dict containing all the config info about the workflow.
    output:
        dict : parsed result in accordance with the `create job` api payload."""
    for cluster_info in workflow_info.get(
        "job_clusters", []
    ):  # below parsing is same for cluster config payload too.
        if "aws_attributes" in cluster_info.get("new_cluster"):
            cluster_info.get("new_cluster").pop("aws_attributes")

    # add more custom parsing logic if needed
    return workflow_info


filtered_jobs = []

# filter
for workflow_info in jobs:
    if filterWorkflows(workflow_info):
        filtered_jobs.append(workflow_info)

# parse
for idx in range(len(filtered_jobs)):
    workflow_info = filtered_jobs[idx]
    parsed_workflow_info = parseWorkflows(workflow_info)
    filtered_jobs[idx] = parsed_workflow_info

jobs = filtered_jobs

## Create Workflow
#### Use the parsed info to create workflow in new workspace
<a href="https://docs.databricks.com/api/workspace/jobs/create">API Docs</a>


In [None]:
for workflow_info in jobs:
    response = requests.post(
        url=f"{new_workspace_url}/api/2.1/jobs/create",
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {new_workspace_token}",
        },
        data=json.dumps(workflow_info),
    )
    assert response.status_code in {
        200,
        201,
    }