

# Requirements
## Databricks
* A Databricks Workspace & Workspace Access Token
* At least one runnable cluster within the workspace

## Packages
This process relies on a package `pandas` for data manipulation.
* <a href="https://pypi.org/project/pandas/">pandas</a>



### Import necessary libs

In [None]:
import json

import requests
from typing import Optional, Callable
import re
from collections import defaultdict

## Steps 📊


### 1. Fetch Job Configurations 📬

We fetch all the workflows present in your workspace, each fetched workflow config will also contain the individual task config present in the workflow and their respective job cluster configs. [Databricks API documentation](https://docs.databricks.com/api/workspace/jobs/list).  

### 2. Parse Information 🧩

In this step we parse the obtained config info. The main thing to keep in mind is that the cluster config contains some fields which are populated after the cluster is initialized but will be fetched anyway from step 1, we need to remove this field or else when we use the same config to create the workflow later it will throw an error. You can also add any custom logic here. For example: You can include webhook notification ID to be associated with a workflow you like, You can also associate an existing all-purpose-compute to a workflow that you want, etc.  

### 3. Save Configuration to JSON 💾

We later save the config to file, if you have a mounted s3 bucket or an azure data lake storage you can direcly specify the path as dbutils will take care of the rest. If you are running the notebook locally then you will need to change the code and use python's inbuilt `open` function to get the task done.


### Set up workspace urls and access tokens


In [None]:
dbutils.widgets.removeAll()

dbutils.widgets.text("workspace_url", "")
workspace_url: str = getArgument("workspace_url")

dbutils.widgets.text("workspace_token", "")
workspace_token: str = getArgument("workspace_token")

dbutils.widgets.text("backup_file_path", "./databricks_jobs_config.json")
backup_file_path: str = getArgument("backup_file_path")

query_params = {
    "LIST_JOBS_LIMIT": 100,  # max limit
    "EXPAND_TASKS": "true",  # provides the complete config info for each job
}

In [None]:
def paginate(
    can_paginate: bool,
    next_page_token: Optional[str],
    url: str,
    workspace_token: str,
    function_to_call: Callable,
) -> None:
    """
    Paginates to the next page if possible
    input:
        can_paginate [bool]: Boolean info about wheather there is additional info.
        next_page_token [str]: Token needed in url query param to paginate to next page.
        url [str]: Url used to list the needed info.
        function_to_call [Callable]: Function that gets called with the paginated url to paginate further.
    output:
        None
    """

    if next_page_token:
        if "&page_token" in url:
            url = f"{url[:url.find('&page_token')]}&page_token={next_page_token}"
        else:
            url = f"{url}&page_token={next_page_token}"

        getAllJobs(url, workspace_token)
    else:
        return

## List workflows 
#### Fetches all workflows in current workspace and its respective configs
<a href="https://docs.databricks.com/api/workspace/jobs/list">API Docs</a>


In [1]:
def getAllJobs(list_jobs_url: str, workspace_token: str) -> None:
    """
    Fetches all the jobs and metadata about them.
    input:
        lists_jobs_url [str]: Databricks API used to fetch all the jobs.
        workspace_token [str]: Databricks workspace access token.
    output:
        None
    """

    response = requests.get(
        list_jobs_url,
        headers={"Authorization": f"Bearer {workspace_token}"},
    )
    assert response.status_code == 200

    response_data = response.json()

    for job in response_data.get("jobs", []):
        jobs[job.get("job_id")] = job.get("settings")

    paginate(
        response_data.get("has_more", False),
        response_data.get("next_page_token"),
        list_jobs_url,
        workspace_token,
        getAllJobs,
    )


jobs = {}  # holds all jobs' info
List_jobs_url = f"{workspace_url}/api/2.1/jobs/list?limit={query_params.get('LIST_JOBS_LIMIT')}&expand_tasks={query_params.get('EXPAND_TASKS')}"
getAllJobs(List_jobs_url, workspace_token)
jobs_original = jobs.copy()

## Parse the fetched data

#### This is needed because the cluster config info in each task contains some current workspace specific properties, which are populated after cluster initialization, thus it needs to be removed.


In [None]:
def parseJobs(job_info: dict) -> dict:
    """
    Modefies the job config.
    input:
        job_info [dict]: Dict containing all the config info about the job.
    output:
        dict : parsed result in accordance with the `create job` api payload.
    """

    for cluster_info in job_info.get(
        "job_clusters", []
    ):  # below parsing is same for cluster config payload too.
        if "aws_attributes" in cluster_info.get("new_cluster"):
            cluster_info.get("new_cluster").pop("aws_attributes")

    return job_info


for job_id in jobs.keys():
    jobs[job_id] = parseJobs(jobs[job_id])

### Write to disk
#### Write the obtained config json to disk of your choice

In [2]:
dbutils.fs.put(backup_file_path, json.dumps(jobs), overwrite=True)