# Backup your Databricks Workflows 🗃

## Requirements

### Databricks

* At least one runnable cluster within the workspace


### Parameters

| Parameter Name | Parameter Description | Example Value |
| --- | --- | --- |
| `backup_file_path` | The file path (prefix) to the destination where the backup file will be stored. **Don't include filename in path**. | `s3://my-databricks-backups/jobs` |


### Steps

#### Fetch Job Configurations

We fetch all the workflows present in your workspace, each fetched workflow config will also contain the individual task config present in the workflow and their respective job cluster configs. [Databricks API documentation](https://docs.databricks.com/api/workspace/jobs/list).  

#### Parse Information 

In this step we parse the obtained config info. The main thing to keep in mind is that the cluster config contains some fields which are populated after the cluster is initialized but will be fetched anyway from step 1, we need to remove this field or else when we use the same config to create the workflow later it will throw an error. You can also add any custom logic here. For example: You can include webhook notification ID to be associated with a workflow you like, You can also associate an existing all-purpose-compute to a workflow that you want, etc.  

#### Save Configuration to JSON 💾

We later save the config to file, if you have a mounted s3 bucket or an azure data lake storage you can direcly specify the path as dbutils will take care of the rest. If you are running the notebook locally then you will need to change the code and use python's inbuilt `open` function to get the task done.

### Imports

In [0]:
from collections import defaultdict
from datetime import datetime
import json
import re
from typing import Optional, Callable

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import JobSettings

## Inputs


In [0]:
dbutils.widgets.removeAll()
dbutils.widgets.text("backup_file_path", "")
backup_file_path: str = getArgument("backup_file_path")

w = WorkspaceClient()

query_params = {
    "LIST_JOBS_LIMIT": 100,  # max limit
    "EXPAND_TASKS": "true",  # provides the complete config info for each job
}

## List workflows 

Fetches all workflows in current workspace and its respective configs

In [0]:
jobs: dict[int, dict] = {}

# Use the SDK's built-in paginator
for job in w.jobs.list(expand_tasks=query_params["EXPAND_TASKS"], limit=query_params["LIST_JOBS_LIMIT"]):
    jobs[job.job_id] = job.settings.as_dict()

## Parse the fetched data

This is needed because the cluster config info in each task contains some current workspace specific properties, which are populated after cluster initialization, thus it needs to be removed.

In [0]:
def parse_jobs(job_info: JobSettings) -> dict:
    """
    input:
        job_info [JobSettings]: JobSettings object from the SDK.
    output:
        dict : Parsed dictionary.
    """
    job_dict = job_info.as_dict()

    for cluster_info in job_dict.get("job_clusters", []):
        new_cluster = cluster_info.get("new_cluster", {})
        if "aws_attributes" in new_cluster:
            new_cluster.pop("aws_attributes")

    return job_dict

In [0]:
for job_id, job_settings in jobs.items():
    parsed = parse_jobs(JobSettings.from_dict(job_settings))
    jobs[job_id] = parsed


## Backup Job Config

Write the obtained config json to disk of your choice

In [0]:
assert len(jobs.keys()) > 1, "No Jobs Found"

In [0]:
backup_file_path_modded: str = backup_file_path + "/" + str(datetime.utcnow().date()).replace("-","") + ".json"
backup_file_path_modded

In [0]:
store_flag = None

store_flag: bool = dbutils.fs.put(
    backup_file_path_modded, json.dumps(jobs), overwrite=False
)

if not store_flag or store_flag is None:
    raise ValueError("Unable to Write Jobs Backup")