# Backup your Databricks Workflows 🗃

## Requirements

### Databricks

* A Databricks Workspace & Workspace Access Token
* At least one runnable cluster within the workspace


### Parameters

| Parameter Name | Parameter Description | Example Value |
| --- | --- | --- |
| `backup_file_path` | The file path (prefix) to the destination where the backup file will be stored. **Don't include filename in path**. | `s3://my-databricks-backups/jobs` |
| `workspace_token` | The workspace token used to interface with Databricks REST API | `dapi****` | 
| `workspace_url` | The URL of the workspace where the jobs exist, and for which the token is generated for | `https://dbc-***.cloud.databricks.com` |


### Steps

#### Fetch Job Configurations

We fetch all the workflows present in your workspace, each fetched workflow config will also contain the individual task config present in the workflow and their respective job cluster configs. [Databricks API documentation](https://docs.databricks.com/api/workspace/jobs/list).  

#### Parse Information 

In this step we parse the obtained config info. The main thing to keep in mind is that the cluster config contains some fields which are populated after the cluster is initialized but will be fetched anyway from step 1, we need to remove this field or else when we use the same config to create the workflow later it will throw an error. You can also add any custom logic here. For example: You can include webhook notification ID to be associated with a workflow you like, You can also associate an existing all-purpose-compute to a workflow that you want, etc.  

#### Save Configuration to JSON 💾

We later save the config to file, if you have a mounted s3 bucket or an azure data lake storage you can direcly specify the path as dbutils will take care of the rest. If you are running the notebook locally then you will need to change the code and use python's inbuilt `open` function to get the task done.

### Imports

In [0]:
from collections import defaultdict
from datetime import datetime
import json
import re
from typing import Optional, Callable

import requests

## Inputs


In [0]:
dbutils.widgets.removeAll()

dbutils.widgets.text("workspace_url", "")
dbutils.widgets.text("workspace_token", "")
dbutils.widgets.text("backup_file_path", "")

In [0]:
workspace_url: str = getArgument("workspace_url")
workspace_token: str = getArgument("workspace_token")
backup_file_path: str = getArgument("backup_file_path")

query_params = {
    "LIST_JOBS_LIMIT": 100,  # max limit
    "EXPAND_TASKS": "true",  # provides the complete config info for each job
}


## Utility Functions

In [0]:
def paginate(
    can_paginate: bool,
    next_page_token: Optional[str],
    url: str,
    workspace_token: str,
    function_to_call: Callable,
) -> None:
    """
    Paginates to the next page if possible
    input:
        can_paginate [bool]: Boolean info about wheather there is additional info.
        next_page_token [str]: Token needed in url query param to paginate to next page.
        url [str]: Url used to list the needed info.
        function_to_call [Callable]: Function that gets called with the paginated url to paginate further.
    output:
        None
    """

    if next_page_token and can_paginate:
        if "&page_token" in url:
            url = f"{url[:url.find('&page_token')]}&page_token={next_page_token}"
        else:
            url = f"{url}&page_token={next_page_token}"

        function_to_call(url, workspace_token)
    else:
        return

In [0]:
def getAllJobs(list_jobs_url: str, workspace_token: str) -> dict:
    """
    Fetches all the jobs and metadata about them.
    input:
        lists_jobs_url [str]: Databricks API used to fetch all the jobs.
        workspace_token [str]: Databricks workspace access token.
    output:
        None
    """
    response = requests.get(
        list_jobs_url,
        headers={"Authorization": f"Bearer {workspace_token}"},
    )
    assert response.status_code == 200

    response_data = response.json()

    for job in response_data.get("jobs", []):
        jobs[job.get("job_id")] = job.get("settings")

    paginate(
        can_paginate=response_data.get("has_more", False),
        next_page_token=response_data.get("next_page_token"),
        url=list_jobs_url,
        workspace_token=workspace_token,
        function_to_call=getAllJobs,
    )

## List workflows 

Fetches all workflows in current workspace and its respective configs

<a href="https://docs.databricks.com/api/workspace/jobs/list">API Docs</a>

In [0]:
jobs: dict[int, dict] = {}  # holds all jobs' info

list_jobs_url = str(
    workspace_url
    + "/api/2.1/jobs/list?"
    + f"limit={query_params.get('LIST_JOBS_LIMIT')}"
    + f"&expand_tasks={query_params.get('EXPAND_TASKS')}"
)

getAllJobs(list_jobs_url=list_jobs_url, workspace_token=workspace_token)

## Parse the fetched data

This is needed because the cluster config info in each task contains some current workspace specific properties, which are populated after cluster initialization, thus it needs to be removed.

In [0]:
def parseJobs(job_info: dict) -> dict:
    """
    Modefies the job config.
    input:
        job_info [dict]: Dict containing all the config info about the job.
    output:
        dict : parsed result in accordance with the `create job` api payload.
    """

    for cluster_info in job_info.get(
        "job_clusters", []
    ):  # below parsing is same for cluster config payload too.
        if "aws_attributes" in cluster_info.get("new_cluster"):
            cluster_info.get("new_cluster").pop("aws_attributes")

    return job_info

In [0]:
for job_id in jobs.keys():
    jobs[job_id] = parseJobs(jobs[job_id])


## Backup Job Config

Write the obtained config json to disk of your choice

In [0]:
assert len(jobs.keys()) > 1, "No Jobs Found"

In [0]:
backup_file_path_modded: str = backup_file_path + "/" + str(datetime.utcnow().date()).replace("-","") + ".json"
backup_file_path_modded

In [0]:
store_flag = None

store_flag: bool = dbutils.fs.put(
    backup_file_path_modded, json.dumps(jobs), overwrite=False
)

if not store_flag or store_flag is None:
    raise ValueError("Unable to Write Jobs Backup")