# Workflow Calendar 📆
## Requirements
### Databricks
* A Databricks Workspace & Workspace Access Token
* At least one runnable cluster within the workspace
* At least one scheduled job in Databricks workflows

### Packages
This process relies on a package called `cron-schedule-triggers` which is used to infer the cron-schedule expression. `pandas` for data manipulation and `plotly` for visualization.
* <a href="https://pypi.org/project/cron-schedule-triggers/#:~:text=Cron%20Schedule%20Triggers%20(CSTriggers)%20is,the%20wild%20to%20choose%20from.">cron-schedule-triggers</a>
* <a href="https://pypi.org/project/pandas/">pandas</a>
* <a href="https://pypi.org/project/plotly/">plotly</a>



## Imports

In [None]:
import requests
from typing import Optional, Callable
import pandas as pd
import datetime
import re

from cstriggers.core.trigger import QuartzCron
from datetime import timedelta
import plotly.express as px


import plotly.graph_objects as go
import plotly.figure_factory as ff

## Input Data

> Provide the date values in `YYYY-MM-DD` format

In [None]:
dbutils.widgets.removeAll()

dbutils.widgets.text("start_date", "2023-10-01")

start_date: datetime.datetime = datetime.datetime.strptime(
    getArgument("start_date"), "%Y-%m-%d"
)

dbutils.widgets.text("end_date", "2023-11-05")

end_date: datetime.datetime = datetime.datetime.strptime(
    getArgument("end_date"), "%Y-%m-%d"
)

dbutils.widgets.text("databricks_url", "")
databricks_url: str = getArgument("databricks_url")

dbutils.widgets.text("databricks_workspace_token", "")
databricks_workspace_token: str = getArgument("databricks_workspace_token")

headers: dict = {"Authorization": f"Bearer {databricks_workspace_token}"}

In [None]:
query_params: dict = {
    "LIST_JOBS_LIMIT": 100,  # max limit
    "LIST_RUNS_LIMIT": 25,  # max limit
    "EXPAND_RUNS": "true",
    "EXPAND_TASKS": "true",
}

In [None]:
def paginate(
    can_paginate: bool,
    next_page_token: Optional[str],
    url: str,
    function_to_call: Callable,
) -> None:
    """
    Paginates to the next page if possible
    input:
        can_paginate [bool]: Boolean info about wheather there is additional info.
        next_page_token [str]: Token needed in url query param to paginate to next page.
        url [str]: Url used to list the needed info.
        function_to_call [Callable]: Function that gets called with the paginated url to paginate further.
    output:
        None
    """

    if next_page_token:
        if "&page_token" in url:
            url = f"{url[:url.find('&page_token')]}&page_token={next_page_token}"
        else:
            url = f"{url}&page_token={next_page_token}"
        getAllJobs(url)
    else:
        return

## Steps 📊

### 1. Fetch Workflows and Runs 🏃‍♂️

This notebook begins by fetching all the [workflows](https://docs.databricks.com/api/workspace/jobs/list) in your Databricks workspace. It also retrieves information about the [runs](https://docs.databricks.com/api/workspace/runs/list) that have occurred within a specified date range, which is provided by the user.

### 2. Parse the fetched info 🧩
Workflows have a schedule which is defined using a `quartz_cron-expression` using which we generate the dates of next runs.

### 3. Visualizations 📈

The notebook provides three insightful visualizations:

- **First Scheduled Run of All Workflows**: Visualizes the first scheduled run of each workflow since the start date.

- **Scheduled Runs Between Start and End Date**: Shows all scheduled runs that occurred within the specified date range.

- **All Runs Since Start Date with Time Taken**: Displays all runs that have occurred since the start date, plotting them along with their execution time for performance analysis.



## List workflows 
#### Fetches all workflows in current workspace and its respective configs
<a href="https://docs.databricks.com/api/workspace/jobs/list">API Docs</a>


In [None]:
def getAllJobs(list_jobs_url: str) -> None:
    """
    Fetches all the jobs and metadata about them.
    input:
        lists_jobs_url [str]: Databricks API used to fetch all the jobs.
    output:
        None
    """

    response = requests.get(
        list_jobs_url,
        headers=headers,
    )
    assert response.status_code == 200

    response_data = response.json()

    for job in response_data.get("jobs", []):
        if job.get("settings", {}).get("schedule"):
            jobs[job.get("job_id")] = {
                "name": job.get("settings", {}).get("name"),
                "quartz_cron_expression": job.get("settings", {})
                .get("schedule", {})
                .get("quartz_cron_expression")
                .lower(),
            }

    paginate(
        response_data.get("has_more", False),
        response_data.get("next_page_token"),
        list_jobs_url,
        getAllJobs,
    )


jobs = {}  # holds all jobs' info

list_jobs_url: str = (
    databricks_url 
    + "/api/2.1/jobs/list" 
    + f"?limit={query_params.get('LIST_JOBS_LIMIT')}" 
    + f"&expand_tasks={query_params['EXPAND_TASKS']}"
)

getAllJobs(list_jobs_url)

## Parse the fetched data
#### Infer the cron expression and calculate the next run.  
#### Additionally you can also categorize workflows based on the title, as this category is what determines the colour of the plotted workflow.

In [None]:
def categorizeWorkflow(workflow_title: str) -> str:
    """You can add custom grouping logic. as this will be used to
    group the workflows, as they will be coloured based on their categories
    in the plot.
    input:
        workflow_title : str
    output:
        category : str
    """

    category = workflow_title  # add custom logic to categorize the workflow
    return category


for job_id, job_info in jobs.items():
    cron_expression = job_info["quartz_cron_expression"]

    cron_obj = QuartzCron(
        schedule_string=cron_expression,
        start_date=start_date,  # This is the start date based on which the next scheduled run is generated. You can change it as per your needs.
    )

    next_scheduled_run = cron_obj.next_trigger(isoformat=False)
    # print(next_scheduled_run)
    jobs[job_id]["next_scheduled_run"] = next_scheduled_run
    jobs[job_id]["workflow_category"] = categorizeWorkflow(jobs[job_id]["name"])

## Jitter workflows
#### Sometimes workflows maybe scheduled too close to each other, this causes them to be too close to each other in the visualization, thus we jitter the workflows slighlty so as to obtain a neat visualization.

In [None]:
def jitterPoints(df: pd.DataFrame) -> pd.DataFrame:
    """If two workflow's have schedules too close to each other
    then this function moves them a bit away from each other
    so that the visualization is neat"""
    # Initialize a flag to keep track of whether any adjustments were made
    adjusted = True
    max_iterations = 2  # Set a maximum number of iterations, increase if you have a lot of conflicting workflow schedules.
    jitter_minutes = 10  # adjust based on need

    iteration = 0
    while adjusted and iteration < max_iterations:
        adjusted = False

        for i in range(1, len(df)):
            diff = df["start_datetime"].iloc[i] - df["start_datetime"].iloc[i - 1]

            if diff <= timedelta(minutes=10):
                # Adjust the start time of the current event
                df["start_datetime"].iloc[i] = df["start_datetime"].iloc[
                    i - 1
                ] + timedelta(minutes=jitter_minutes)
                adjusted = True

        iteration += 1
    return df

## Helper Function
#### Used to generate X axis tick values

In [4]:
def generateXAxisTickTexts() -> list:
    """Helper function used to generate x axis tick values"""
    temp = list(range(1, 13)) + list(range(1, 13)) #12 hour clock entries 
    temp = temp[-1:] + temp[:-1] #right shifting 
    for idx in range(len(temp)): #filling the AM/PM value as its a 12 hour format
        if idx < len(temp) // 2:
            temp[idx] = f"{temp[idx]} AM"
        else:
            temp[idx] = f"{temp[idx]} PM"
    return temp

## Plot the all the result


In [5]:
# Adjust the plot dimensions here
PLOT_HEIGHT = 700
PLOT_WIDTH = 2000
POINT_SIZE = 15

events = [
    {
        "name": job_info["name"],
        "start_datetime": job_info["next_scheduled_run"],
        "workflow_category": job_info["workflow_category"],
    }
    for job_info in jobs.values()
]

df = pd.DataFrame(events)

df["start_datetime"] = pd.to_datetime(df["start_datetime"])

# Sort DataFrame by 'start_datetime'
df.sort_values(by="start_datetime", inplace=True)

# jitter closeby points
df = jitterPoints(df)


# Increase the size of all points by adjusting the marker size
point_size = POINT_SIZE  # Adjust the size as needed

# Create an interactive scatter plot using Plotly Express
fig = px.scatter(
    df,
    x=df["start_datetime"].dt.hour
    + df["start_datetime"].dt.minute / 60
    + df["start_datetime"].dt.second / 3600,
    y=df["start_datetime"].dt.strftime("%Y/%m/%d"),
    # y= df['start_datetime'].dt.strftime('%d-%m-%y'),
    color="workflow_category",  # Color points by 'workflow_cateogry' column
    hover_name="name",  # Display event name on hover
    labels={"x": "Time of Day (12-hour format)", "y": "Date"},
    title=f"Workflow's first run since {start_date.strftime('%Y-%m-%d')}",
    template="plotly_white",
)


# Customize the appearance of the plot
fig.update_layout(
    xaxis=dict(
        tickmode="array",
        tickvals=list(range(1, 25)),
        ticktext=generateXAxisTickTexts(),
    ),
    yaxis=dict(
        tickmode="array",
        tickvals=list(
            range(
                0,
                int((df["start_datetime"].iloc[-1] - df["start_datetime"].iloc[0]).days)
                + 10,
            )
        ),
    ),
    showlegend=True,
    legend_title_text="Workflow Category",
    height=PLOT_HEIGHT,  # Height of the plot
    width=PLOT_WIDTH,  # Width of the plot
)

# Increase the marker size for all points
fig.update_traces(marker=dict(size=point_size))

# Show the interactive plot
fig.show()

## Calculate all the scheduled runs 
#### using `start_date` and `end_data` we calculate all the scheduled runs within the data range
#### Using `cron-schedule-triggers` we calculate all the next scheduled runs since the mentioned `start_date` 

In [None]:
all_scheduled_runs = []
for job_id, job_info in jobs.items():
    cron_expression = job_info["quartz_cron_expression"]

    cron_obj = QuartzCron(
        schedule_string=cron_expression,
        start_date=start_date,
    )

    next_scheduled_run = cron_obj.next_trigger(isoformat=False)
    runs = []
    while next_scheduled_run <= end_date:
        runs.append(next_scheduled_run)
        next_scheduled_run = cron_obj.next_trigger(isoformat=False)

    for run in runs:
        all_scheduled_runs.append(
            {
                "name": jobs[job_id]["name"],
                "start_datetime": run,
                "workflow_category": jobs[job_id]["workflow_category"],
            }
        )

## Plot the result

In [6]:
# Adjust the plot dimensions here
PLOT_HEIGHT = 700
PLOT_WIDTH = 2000
POINT_SIZE = 15


df = pd.DataFrame(all_scheduled_runs)

df["start_datetime"] = pd.to_datetime(df["start_datetime"])

# Sort DataFrame by 'start_datetime'
df.sort_values(by="start_datetime", inplace=True)

# jitter closeby points
df = jitterPoints(df)

# Increase the size of all points by adjusting the marker size
point_size = POINT_SIZE  # Adjust the size as needed

# Create an interactive scatter plot using Plotly Express
fig = px.scatter(
    df,
    x=df["start_datetime"].dt.hour
    + df["start_datetime"].dt.minute / 60
    + df["start_datetime"].dt.second / 3600,
    y=df["start_datetime"].dt.strftime("%Y/%m/%d"),
    color="workflow_category",  # Color points by 'workflow_category' column
    hover_name="name",  # Display event name on hover
    labels={"x": "Time of Day (12-hour format)", "y": "Date"},
    title=f"All Workflow runs scheduled from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}",
    template="plotly_white",
)

# Customize the appearance of the plot
fig.update_layout(
    xaxis=dict(
        tickmode="array",
        tickvals=list(range(1, 25)),
        ticktext=generateXAxisTickTexts(),
    ),
    yaxis=dict(
        tickmode="array",
        tickvals=list(
            range(
                0,
                int((df["start_datetime"].iloc[-1] - df["start_datetime"].iloc[0]).days)
                + 10,
            )
        ),
    ),
    showlegend=True,
    legend_title_text="Workflow category",
    height=PLOT_HEIGHT,  # Height of the plot
    width=PLOT_WIDTH,  # Width of the plot
)

# Increase the marker size for all points
fig.update_traces(marker=dict(size=point_size))

# Show the interactive plot
fig.show()

## List workflow runs
#### Fetch all workflow runs that have taken place since the mentioned start date. Making sure to parse the necessary info
<a href="https://docs.databricks.com/api/workspace/jobs/listruns">API Docs</a>




In [None]:
all_runs_info = []


def getAllRuns(list_runs_url: int) -> None:
    """
    Fetches all the run and metadata about a given workflow.
    input:
        lists_jobs_url [str]: Databricks API used to fetch all the runs belonging to a given job.
    output:
        None
    """

    response = requests.get(
        list_runs_url,
        headers=headers,
    )
    assert response.status_code == 200

    response_data = response.json()
    pattern = r"job_id=([\w-]+)"
    matched = re.search(pattern, list_runs_url)
    job_id = int(matched.group(1))

    if "runs" in response_data:
        for run_info in response_data["runs"]:
            if (
                "start_time" in run_info
                and "end_time" in run_info
                and run_info["end_time"]
            ):
                all_runs_info.append(
                    {
                        "Task": jobs[job_id]["name"],
                        "Start": datetime.datetime.fromtimestamp(
                            run_info["start_time"] / 1000
                        ),
                        "Finish": datetime.datetime.fromtimestamp(
                            run_info["end_time"] / 1000
                        ),
                        "Duration": (
                            datetime.datetime.fromtimestamp(run_info["end_time"] / 1000)
                            - datetime.datetime.fromtimestamp(
                                run_info["start_time"] / 1000
                            )
                        ).total_seconds()
                        / 3600,
                        "workflow_category": jobs[job_id]["workflow_category"],
                    }
                )

    paginate(
        response_data.get("has_more", False),
        response_data.get("next_page_token"),
        list_runs_url,
        getAllRuns,
    )


job_ids = list(jobs.keys())
list_runs_urls = [
    f"{databricks_url}/api/2.1/jobs/runs/list?job_id={job_id}&limit={query_params.get('LIST_RUNS_LIMIT')}&expand_tasks={query_params.get('EXPAND_RUNS')}&start_time_from={start_date.timestamp()*1000}"
    for job_id in job_ids
]

for url in list_runs_urls:
    getAllRuns(url)

## Plot the result

In [7]:
# Adjust accordingly
PLOT_HEIGHT = 1500
PLOT_WIDTH = 2000

runs_df = pd.DataFrame(all_runs_info)

runs_df["Start"] = pd.to_datetime(runs_df["Start"])
runs_df["Finish"] = pd.to_datetime(runs_df["Finish"])

runs_df["Duration"] = (
    runs_df["Finish"] - runs_df["Start"]
).dt.total_seconds() / 3600  # Duration in hours

# Create a new column 'Day' representing the day for each task
runs_df["Day"] = runs_df["Start"].dt.date
runs_df.head()

# Extract task, start, and end dates
tasks = runs_df["Task"].tolist()
start_dates = runs_df["Start"].tolist()
end_dates = runs_df["Finish"].tolist()

# Create the Gantt chart
fig = ff.create_gantt(
    runs_df,
    title="Task Duration Gantt Chart",
)

fig.update_layout(
    height=PLOT_HEIGHT,
    width=PLOT_WIDTH,
    plot_bgcolor="white",
    paper_bgcolor="white",
    yaxis=dict(showgrid=True, gridcolor="lightgray"),
    xaxis=dict(showgrid=True, gridcolor="lightgray"),
)

fig.show()