# Analysis of DP0.2 pipetask execution

### Introduction

This notebook summarizes the consumption of CPU and memory by pipeline taks, observed at FrDF while executing processing for Data Preview 0.2.

### Source data
Source data for this analysis is extracted from the records about task execution collected by the pipelines themselves and recorded in the butler repository. Those records were pre-processed by Quentin Le Boulc'h to generate `.csv` files and are located under `/sps/lsst/users/lsstprod/DP02/profiling/step*`.

For preparing this analysis, we converted those `.csv` files to `.parquet` format and stored them in the directory `data/pipetasks` directory using tools in directory `preprocess` included in this repository.

Each pipetask of the LSST Science Pipelines records information about its execution. Among the metrics recorded are the time spent in its execution and the memory it used. Those metrics are recorded in the Butler repository as task metadata datasets named `*_metadata` (e.g. `assembleCoadd_metadata` or `makeWarp_metadata`) in YAML format. In this particular case, for DP0.2 Quentin extracted and processed those YAML using Jim Chiang's [gather_resource_info.py](https://github.com/LSSTDESC/gen3_workflow/blob/master/python/desc/gen3_workflow/gather_resource_info.py) module. See further details in Quentin's https://gitlab.in2p3.fr/rubin-lsst/dp02-analysis.

Among the metrics collected for each task we have:

* CPU time: collected using Python's [time.process_time()](https://docs.python.org/3/library/time.html#time.process_time). It returns the *"the sum of the system and user CPU time of the current process. It does not include time elapsed during sleep."* This is a real value in seconds.
* Elapsed time (in seconds)
* Maximum [resident set size](https://en.wikipedia.org/wiki/Resident_set_size): this value is obtained via Python's [resource.getrusage()](https://docs.python.org/3/library/resource.html#resource.getrusage) and represents the maximum value of RSS recorded for the process (in kilobytes). We interpret this as the maximum amount of RAM the process executing a given pipeline task has been allocated.

In [1]:
import glob
import math
import os
import pathlib
import shutil
import sys
from typing import Tuple, List

In [2]:
import polars as pl

# Set the maximum length to display for string columns
_ = pl.Config.set_fmt_str_lengths(50)

In [3]:
import pandas as pd
import numpy as np

In [4]:
import bokeh
import bokeh.plotting as bkh
import bokeh.models as bkhmodels
bkh.output_notebook()

In [5]:
import IPython.display
print_md = IPython.display.Markdown

In [6]:
def build_dataframe_from_dir(dir: str) -> pl.DataFrame:
    """Read all .parquet files in directory 'dir' and build a single dataframe.
    
    Parameters
    ----------
    dir : `str`
       directory where the input .parquet files are located. All the files
       in that directory are read to build the dataframe
       
    Returns
    -------
    build_dataframe_from_dir: polars.DataFrame
       a dataframe where each row contains the information of one task.
    """
    print(f'Loading data files in directory {dir} ...')
    df = None
    for file in pathlib.Path(dir).glob('*.parquet'):
        this_df = pl.read_parquet(file)
        df = this_df if df is None else df.vstack(this_df)
                      
    return df

In [7]:
# Load the data files for each step of interest and aggregate it in a single dataframe
data_dir = '../../data/pipetasks'
paths = sorted(pathlib.Path(data_dir).glob('step*'))

df = None
for path in paths:
    df = build_dataframe_from_dir(path) if df is None else df.vstack(build_dataframe_from_dir(path)) 

# Add a column 'cpu_efficiency' and compute 'peak_RSS' column in gigabytes instead of kilobytes as reported
# by Python's resource.getrusage
df = df.with_columns([
    (pl.col("cpu_time") / pl.col("elapsed_time")).alias("cpu_efficiency"),
    (pl.col("peak_RSS") / 1e6),
])

Loading data files in directory ../../data/pipetasks/step1 ...
Loading data files in directory ../../data/pipetasks/step2 ...
Loading data files in directory ../../data/pipetasks/step3 ...
Loading data files in directory ../../data/pipetasks/step4 ...
Loading data files in directory ../../data/pipetasks/step5 ...
Loading data files in directory ../../data/pipetasks/step6 ...
Loading data files in directory ../../data/pipetasks/step7 ...


In [8]:
def clean_output_dir(directory):
    """Remove all .html and .png files from `directory`
    """
    if not os.path.exists(directory):
        return
    
    to_remove = glob.glob(os.path.join(directory, 'images', '*.png')) + glob.glob(os.path.join(directory, 'html', '*.html'))
    for name in to_remove:
        os.remove(name)

In [9]:
# Create an output directory for plots created by this notebook
output_dir = 'results'
output_dir = os.path.join('.', 'results')
os.makedirs(output_dir, exist_ok=True)

clean_output_dir(output_dir)

## Overview

In [10]:
total_elapsed_time_hours = pl.sum(df.get_column("elapsed_time")) / 3_600
total_cpu_time_hours = pl.sum(df.get_column("cpu_time")) / 3_600
global_cpu_efficiency = 100 * (total_cpu_time_hours / total_elapsed_time_hours)

overview = f"""
There were **{df.shape[0]:,} pipetasks** which consumed **{total_elapsed_time_hours:,.0f} elapsed hours ({total_cpu_time_hours:,.0f} CPU hours**)
for a global CPU effiency of **{global_cpu_efficiency:.1f}%**.
"""
print_md(overview)


There were **57,903,740 pipetasks** which consumed **2,347,306 elapsed hours (1,728,248 CPU hours**)
for a global CPU effiency of **73.6%**.


## Per step analysis

In [11]:
df.head(5)

task,band,visit,cpu_time,elapsed_time,peak_RSS,run,step,cpu_efficiency
str,str,str,f32,f32,f64,u16,u16,f32
"""writeSourceTable""","""r""","""1171832""",0.232792,0.78534,1.329064,9,1,0.296421
"""isr""","""r""","""1155569""",24.685934,25.170607,2.668844,9,1,0.980744
"""calibrate""","""r""","""1182632""",36.576054,38.942017,1.32892,9,1,0.939244
"""writeSourceTable""","""z""","""1174511""",0.21872,0.793542,1.329056,9,1,0.275625
"""transformSourceTable""","""z""","""1174424""",0.142618,1.665406,1.328892,9,1,0.085635


In [12]:
def get_column(df:pl.DataFrame, column:str) -> List:
    """Return a list with the values of column `column` from data frame `df`
    """
    return df.get_column(column).to_list()

In [13]:
# Create a dataframe for summarizing per-step metrics, i.e.
# - quanta: number of quanta executed in this step
# - pipetasks: number of different pipetasks executed in this step
# - elapsed_time_hours: elapsed time spent in this step (hours)
# - cpu_time_hours: CPU time spent in this step (hours)
# - cpu_efficiency: CPU efficiency
# - max_RSS: maximum RSS for any task in this step (gigabytes)
per_step_df = df.groupby("step", maintain_order=True).agg(
    [
        pl.col("task").n_unique().alias("pipetasks"),
        pl.col("task").count().alias("quanta"),
        (pl.col("elapsed_time")/3_600).sum().alias("elapsed_time_hours"),
        (pl.col("cpu_time")/3_600).sum().alias("cpu_time_hours"),
        (pl.col("cpu_time").sum() / pl.col("elapsed_time").sum()).alias("cpu_efficiency"),
        pl.col("peak_RSS").max().alias("max_RSS"),
    ])

In [14]:
per_step_df

step,pipetasks,quanta,elapsed_time_hours,cpu_time_hours,cpu_efficiency,max_RSS
u16,u32,u32,f64,f64,f32,f64
1,5,14156932,92729.945437,81092.351006,0.884075,28.735664
2,5,99264,21118.748554,711.333131,0.033683,4.189256
3,42,4378987,934581.951709,855426.070201,0.913383,171.928144
4,6,21307370,1226600.0,773067.175264,0.639746,6.26334
5,12,12505857,54567.039126,13051.466559,0.239318,63.225148
6,4,5455322,17728.701126,4908.955586,0.270815,1.386728
7,3,8,5.299142,2.366513,0.446584,2.782916


In [15]:
def save_figure(fig, output_dir, filename, title):
    """Save figure in formats HTML and PNG
    """
    # Ensure output directories exists
    os.makedirs(output_dir, exist_ok=True)
    image_dir = os.path.join(output_dir, 'images')
    os.makedirs(image_dir, exist_ok=True)
    html_dir = os.path.join(output_dir, 'html')
    os.makedirs(html_dir, exist_ok=True)
    
    # Save PNG
    png_filename = os.path.join(image_dir, f'{filename}.png')
    _ = bokeh.io.export_png(fig, filename=png_filename)
    
    # Save HTML
    html_filename = os.path.join(html_dir, f'{filename}.html')
    bkh.output_file(filename=html_filename, title=title)
    bkh.save(fig)
    
    # Reset the output file so the html file is not overwritten by
    # future calls to save()
    bokeh.io.reset_output()
    bokeh.io.output_notebook()

In [16]:
def make_figure_per_step_task_count(steps: List[int], quanta: List[int], annotation: str = None) -> bkh.Figure:
    """Return a figure representing the number of tasks on each step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, quanta))
    steps, quanta = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'quanta': quanta,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(quanta)*100)
    fig = bkh.figure(
        x_axis_label = 'step',
        y_range = y_range,
        y_axis_label = 'quanta',
        y_axis_type = 'log',
        plot_width = 800,
        plot_height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Execution of quanta per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    
    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Add vertical bar for quanta counts
    bars = fig.vbar(x='steps', top='quanta', bottom=1, width=0.8, color='tan', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=5.9, y=max(quanta)*4.2, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    
    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('quanta', '@quanta{0,0}'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [17]:
total_tasks = per_step_df.select(pl.col('quanta')).sum().row(0)[0]
task_count_per_step_fig = make_figure_per_step_task_count(
    steps = get_column(per_step_df, 'step'), 
    quanta = get_column(per_step_df, 'quanta'),
    annotation = f"total quanta: {total_tasks/1e6:,.1f}M",
)
bkh.show(task_count_per_step_fig)

# Export this figure
save_figure(task_count_per_step_fig, output_dir=output_dir, filename='quanta-per-step', title='DP0.2 Quanta per step')

In [18]:
def make_figure_per_step_elapsed(steps: List[int], elapsed_time: List[float], annotation: str = None) -> bkh.Figure:
    """Return a figure representing the execution time spent on each step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, elapsed_time))
    steps, elapsed_time = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'elapsed_time': elapsed_time,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(elapsed_time)*100)
    fig = bkh.figure(
        x_axis_label = 'step',
        y_range = y_range,
        y_axis_label = 'hours',
        y_axis_type = 'log',
        plot_width = 800,
        plot_height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Elapsed time per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a a vertical bar for elapsed time
    bars = fig.vbar(x='steps', top='elapsed_time', bottom=1, width=0.8, color='teal', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.5, y=max(elapsed_time)*5, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('elapsed time', '@elapsed_time{0,0} hours'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [19]:
def make_figure_per_step_efficiency(steps: List[int], cpu_efficiency: List[float], annotation: str = None) -> bkh.Figure:
    """Return a figure representing the CPU efficiency per step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, cpu_efficiency))
    steps, cpu_time = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps,
        'cpu_efficiency': cpu_efficiency,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0., 1.)
    fig = bkh.figure(
        x_axis_label = 'step',
        y_range = y_range,
        plot_width = 800,
        plot_height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU efficiency per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LinearAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.formatter = bokeh.models.NumeralTickFormatter(format='0%')
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a a vertical bar for elapsed time
    bars = fig.vbar(x='steps', top='cpu_efficiency', bottom=0.0, width=0.8, color='thistle', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.8, y=0.85, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)
       
    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('CPU efficiency', '@cpu_efficiency{%0}'), 
         ], 
         renderers=[bars], mode='mouse')) 

    return fig

In [20]:
# Generate a figure for elapsed time vs. step
total_elapsed = per_step_df.select(pl.col('elapsed_time_hours')).sum().row(0)[0]
elapsed_per_step_fig = make_figure_per_step_elapsed(
    steps = get_column(per_step_df, 'step'), 
    elapsed_time = get_column(per_step_df, 'elapsed_time_hours'), 
    annotation = f"aggregated elapsed time: {total_elapsed/1e6:,.1f}M hours",
)

# Generate a figure for CPU efficiency vs. step
total_elapsed = per_step_df.select(pl.col('elapsed_time_hours')).sum().row(0)[0]
total_cpu = per_step_df.select(pl.col('cpu_time_hours')).sum().row(0)[0]
total_efficiency = 100.0 * total_cpu/total_elapsed
efficiency_per_step_fig = make_figure_per_step_efficiency(
    steps = get_column(per_step_df, 'step'), 
    cpu_efficiency = get_column(per_step_df, 'cpu_efficiency'), 
    annotation = f"aggregated CPU efficiency: {total_efficiency:,.0f}%",
)
bkh.show(bokeh.layouts.row(elapsed_per_step_fig, efficiency_per_step_fig))

# Export these figures
save_figure(elapsed_per_step_fig, output_dir=output_dir, filename='elapsed-per-step', title='DP0.2 Elapsed time per step')
save_figure(efficiency_per_step_fig, output_dir=output_dir, filename='cpu-efficiency-per-step', title='DP0.2 CPU efficiency per step')

In [21]:
def make_figure_per_step_memory(steps: List[int], max_rss: List[float], annotation: str = None) -> bkh.Figure:
    """Return a figure representing the execution time spent on each step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, max_rss))
    steps, max_rss = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'max_rss': max_rss,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(max_rss)*2)
    fig = bkh.figure(
        x_axis_label = 'step',
        y_range = y_range,
        y_axis_label = 'gigabyte',
        y_axis_type = 'log',
        plot_width = 800,
        plot_height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Maximum RSS per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a a vertical bar for elapsed time
    bars = fig.vbar(x='steps', top='max_rss', bottom=1, width=0.8, color='palegoldenrod', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.5, y=max(max_rss)*5, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('max RSS', '@max_rss{0,0} GB'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [22]:
# Generate a figure for peak RSS vs. step
peak_rss_per_step_fig = make_figure_per_step_memory(
    steps = get_column(per_step_df, 'step'), 
    max_rss = get_column(per_step_df, 'max_RSS'), 
)
bkh.show(peak_rss_per_step_fig)

# Export this figure
save_figure(peak_rss_per_step_fig, output_dir=output_dir, filename='peak-rss-per-step', title='DP0.2 Maximum RSS per step')

## Per step pipetask execution details

In [23]:
# Build a new dataframe per step and for each pipetask within a step 
# compute its number of quanta, its elapsed time and its peak_RSS
steps = sorted(df.select(pl.col('step')).unique().get_column('step').to_list())

step_dfs = {}
for step in steps:
    df_step = df.filter(pl.col('step') == step).groupby('task').agg(
        [
            pl.col("task").count().alias("quanta"),
            (pl.col("elapsed_time")/3_600).sum().alias("elapsed_time_hours"),
            pl.col('peak_RSS').max().alias('RSS_max'),
        ]
    )
    step_dfs[step] = df_step

In [24]:
# Generate a table with pipetask details for each step and write
# the same information into a CSV files for export
table = f"""
| step | task | quanta | elapsed time (hours) | max RSS (GB) |
| ---- | ---- | -----: | -------------------: | -----------: |
"""
csv_separator = ','
csv_output = csv_separator.join(('step', 'pipetask', 'quanta', 'elapsed_time_hours', 'peak_RSS_gb'))
for step in sorted(step_dfs.keys()):
    step_df = step_dfs[step].sort('task')
    for row_index in range(step_df.height):
        row = step_df.row(row_index)
        csv_output += f'\n{step}{csv_separator}' + csv_separator.join((str(v) for v in row))
        step_out = f'**{step}**' if row_index == 0 else ''
        task, quanta, elapsed, peak_RSS = row
        table += f'| {step_out} | {task} | {quanta:,} | {elapsed:,.1f} | {peak_RSS:,.1f} |\n'
        
# Write to CSV file
with open('pipetasks.csv', 'w') as f:
    f.write(csv_output)

print_md(table)


| step | task | quanta | elapsed time (hours) | max RSS (GB) |
| ---- | ---- | -----: | -------------------: | -----------: |
| **1** | calibrate | 2,805,180 | 25,503.0 | 28.7 |
|  | characterizeImage | 2,870,512 | 39,028.9 | 2.7 |
|  | isr | 2,870,894 | 24,837.6 | 2.7 |
|  | transformSourceTable | 2,805,173 | 2,358.8 | 2.7 |
|  | writeSourceTable | 2,805,173 | 1,001.7 | 2.7 |
| **2** | TE3 | 19,853 | 5,688.1 | 4.2 |
|  | TE4 | 19,852 | 5,689.6 | 4.2 |
|  | consolidateSourceTable | 19,853 | 1,182.2 | 1.4 |
|  | consolidateVisitSummary | 19,853 | 2,881.3 | 0.3 |
|  | nsrcMeasVisit | 19,853 | 5,677.6 | 4.1 |
| **3** | AB1 | 44,100 | 120.5 | 0.8 |
|  | AD1_design | 900 | 20.2 | 11.4 |
|  | AD2_design | 900 | 21.1 | 11.4 |
|  | AD3_design | 900 | 19.5 | 11.4 |
|  | AF1_design | 900 | 20.3 | 11.3 |
|  | AF2_design | 900 | 21.5 | 11.4 |
|  | AF3_design | 900 | 20.0 | 11.4 |
|  | AM1 | 900 | 20.2 | 11.4 |
|  | AM2 | 900 | 20.9 | 11.4 |
|  | AM3 | 900 | 19.9 | 11.4 |
|  | PA1 | 900 | 19.4 | 11.4 |
|  | PF1_design | 900 | 19.6 | 11.3 |
|  | assembleCoadd | 44,094 | 48,477.7 | 3.1 |
|  | consolidateObjectTable | 149 | 12.7 | 18.4 |
|  | deblend | 7,344 | 26,072.6 | 17.9 |
|  | detection | 44,094 | 3,832.3 | 3.1 |
|  | forcedPhotCoadd | 44,064 | 466,996.4 | 4.4 |
|  | healSparsePropertyMaps | 899 | 1,171.0 | 2.9 |
|  | makeWarp | 3,959,410 | 118,777.1 | 8.4 |
|  | matchCatalogsPatch | 44,100 | 12,979.9 | 15.6 |
|  | matchCatalogsPatchMultiBand | 7,350 | 11,755.0 | 53.0 |
|  | matchCatalogsTract | 900 | 2,871.4 | 171.9 |
|  | matchObjectToTruth | 149 | 27.6 | 4.4 |
|  | measure | 44,064 | 229,842.0 | 9.4 |
|  | mergeDetections | 7,344 | 155.7 | 0.5 |
|  | mergeMeasurements | 7,344 | 202.7 | 6.2 |
|  | modelPhotRepGal1 | 900 | 20.5 | 11.4 |
|  | modelPhotRepGal2 | 900 | 19.5 | 11.4 |
|  | modelPhotRepGal3 | 900 | 19.5 | 11.4 |
|  | modelPhotRepGal4 | 900 | 19.6 | 11.3 |
|  | modelPhotRepStar1 | 900 | 20.1 | 11.3 |
|  | modelPhotRepStar2 | 900 | 19.4 | 11.4 |
|  | modelPhotRepStar3 | 900 | 19.6 | 11.4 |
|  | modelPhotRepStar4 | 900 | 19.5 | 11.3 |
|  | psfPhotRepStar1 | 900 | 20.0 | 11.4 |
|  | psfPhotRepStar2 | 900 | 19.6 | 11.4 |
|  | psfPhotRepStar3 | 900 | 19.9 | 11.3 |
|  | psfPhotRepStar4 | 900 | 19.2 | 11.4 |
|  | selectGoodSeeingVisits | 44,100 | 164.3 | 0.3 |
|  | templateGen | 44,094 | 10,088.5 | 3.1 |
|  | transformObjectTable | 7,344 | 85.1 | 19.7 |
|  | writeObjectTable | 7,344 | 490.5 | 19.7 |
| **4** | forcedPhotCcd | 4,320,956 | 559,030.6 | 6.1 |
|  | forcedPhotDiffim | 4,320,163 | 514,967.3 | 6.3 |
|  | getTemplate | 2,782,954 | 83,011.9 | 3.6 |
|  | imageDifference | 2,782,569 | 61,927.0 | 6.3 |
|  | transformDiaSourceCat | 2,781,653 | 2,632.1 | 6.3 |
|  | writeForcedSourceTable | 4,319,075 | 5,008.5 | 6.3 |
| **5** | TE1 | 930 | 437.0 | 47.3 |
|  | TE2 | 930 | 434.2 | 47.3 |
|  | consolidateAssocDiaSourceTable | 155 | 2.9 | 1.6 |
|  | consolidateForcedSourceOnDiaObjectTable | 155 | 8.1 | 63.2 |
|  | consolidateFullDiaObjectTable | 155 | 2.3 | 1.7 |
|  | drpAssociation | 7,595 | 931.2 | 0.8 |
|  | drpDiaCalculation | 7,595 | 347.2 | 0.8 |
|  | forcedPhotCcdOnDiaObjects | 4,157,718 | 14,246.4 | 1.0 |
|  | forcedPhotDiffOnDiaObjects | 4,157,718 | 11,921.5 | 1.0 |
|  | transformForcedSourceOnDiaObjectTable | 7,595 | 6,036.1 | 4.2 |
|  | transformForcedSourceTable | 7,593 | 19,815.4 | 9.8 |
|  | writeForcedSourceOnDiaObjectTable | 4,157,718 | 384.8 | 1.0 |
| **6** | consolidateDiaSourceTable | 19,828 | 220.1 | 0.6 |
|  | consolidateRecalibratedSourceTable | 19,828 | 603.5 | 1.4 |
|  | transformRecalibratedSourceTable | 2,707,833 | 3,638.9 | 0.6 |
|  | writeRecalibrateSourceTable | 2,707,833 | 13,266.2 | 0.6 |
| **7** | consolidateHealSparsePropertyMaps | 6 | 1.8 | 2.8 |
|  | makeCcdVisitTable | 1 | 1.7 | 2.5 |
|  | makeVisitTable | 1 | 1.7 | 0.3 |


## Per task analysis

In [25]:
# Create a dataframe with details about each pipetask
per_task_df = df.groupby("task", maintain_order=True).agg(
    [
        pl.col("task").count().alias("task_count"),
        (pl.col("cpu_time")/3_600).sum().alias("cpu_time_hours"),
        (pl.col("elapsed_time")/3_600).sum().alias("elapsed_time_hours"),
        (pl.col("cpu_time").sum()/pl.col("elapsed_time").sum()).alias("cpu_efficiency"),
        pl.col('peak_RSS').min().alias('RSS_min'),
        pl.col('peak_RSS').max().alias('RSS_max'),
        pl.col('peak_RSS').mean().alias('RSS_mean'),
        pl.col('peak_RSS').std().alias('RSS_std'),
        pl.col('peak_RSS').quantile(0.05).alias('RSS_p05'),
        pl.col('peak_RSS').quantile(0.50).alias('RSS_p50'),
        pl.col('peak_RSS').quantile(0.95).alias('RSS_p95'),
    ]
)

In [26]:
task_types = per_task_df.height
total_elapsed_hours = per_task_df.select('elapsed_time_hours').sum().row(0)[0]
total_cpu_hours = per_task_df.select('cpu_time_hours').sum().row(0)[0]

overview = f"""
There were **{task_types:,} kinds of pipetasks** which consumed in aggregate **{total_elapsed_hours:,.0f} elapsed hours ({total_cpu_hours:,.0f} CPU hours**)
"""
print_md(overview)


There were **77 kinds of pipetasks** which consumed in aggregate **2,347,309 elapsed hours (1,728,260 CPU hours**)


In [27]:
per_task_df.sort(['elapsed_time_hours', 'task_count'], reverse=True)

task,task_count,cpu_time_hours,elapsed_time_hours,cpu_efficiency,RSS_min,RSS_max,RSS_mean,RSS_std,RSS_p05,RSS_p50,RSS_p95
str,u32,f64,f64,f32,f64,f64,f64,f64,f64,f64,f64
"""forcedPhotCcd""",4320956,360562.168082,559030.556208,0.646047,0.789324,6.105668,2.659361,0.727462,1.499,2.516116,3.620152
"""forcedPhotDiffim""",4320163,322497.048844,514967.258536,0.625869,0.78728,6.26334,3.38019,0.207599,3.114832,3.37374,3.70822
"""forcedPhotCoadd""",44064,465498.998973,466996.387273,0.996798,0.671344,4.414136,3.19682,0.532958,1.901612,3.296496,3.828436
"""measure""",44064,225247.558957,229842.01584,0.980006,0.686332,9.366016,4.514935,1.75243,2.049336,4.097084,7.474968
"""makeWarp""",3959410,100977.831742,118777.081407,0.850294,0.656996,8.369464,5.258593,1.609001,2.826792,5.168052,7.852832
"""getTemplate""",2782954,30238.845204,83011.898205,0.364829,0.78728,3.613108,2.563435,0.246486,2.217896,2.56594,2.946296
"""imageDifference""",2782569,55623.283681,61926.955453,0.898126,0.78728,6.26334,3.373445,0.21765,3.107744,3.369776,3.701564
"""assembleCoadd""",44094,20804.817916,48477.700096,0.429163,1.318912,3.11066,1.956814,0.411765,1.528784,1.806616,2.630604
"""characterizeImage""",2870512,36628.95093,39028.856148,0.938618,0.87666,2.72008,1.111754,0.125878,0.979472,1.023248,1.329052
"""deblend""",7344,25959.366437,26072.617217,0.995654,3.021084,17.893952,13.26697,1.914749,8.09158,13.784692,15.052644


## Pipetask execution time

In [28]:
def make_figure_execution_time(tasks: List[str], elapsed_time: List[float], cpu_time: List[float], cpu_efficiency: List[float]) -> bkh.Figure:
    """Return a figure representing the execution time spent on each task category.
    """
    # Build the data source
    sorted_by_elapsed = sorted(zip(elapsed_time, cpu_time, cpu_efficiency, tasks), reverse=True)
    elapsed_time, cpu_time, cpu_efficiency, tasks = zip(*sorted_by_elapsed)    
    total_elapsed = sum(elapsed_time)
    elapsed_percentage = [100.0 * v/total_elapsed for v in elapsed_time]
    elapsed_cumulated = np.cumsum(elapsed_percentage)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks, 
        'elapsed_time': elapsed_time,
        'elapsed_percentage': elapsed_percentage,
        'elapsed_cumulated': elapsed_cumulated,
        'cpu_time': cpu_time,
        'cpu_efficiency': cpu_efficiency,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(elapsed_time)*5)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        y_axis_label = 'hours',
        y_axis_type = 'log',
        plot_width = 1_600,
        plot_height = 800,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Elapsed and CPU time spent by pipetask", text_font_size="18pt"), 'above')

    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_orientation = math.pi/3
    fig.xaxis.major_label_text_font_size = "11pt"
    
    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Annotation: band to show the tasks consuming a given percentage of elapsed time
    threshold_percent = 90
    right = np.argmax(elapsed_cumulated >= threshold_percent) + 1
    box = bkhmodels.BoxAnnotation(left=0, right=right, fill_alpha=0.1, fill_color='lightcoral')
    fig.add_layout(box)
    
    # Annotation: label to show a message on the meaning of the band
    label = bkhmodels.Label(x=1, y=max(elapsed_time)*2, x_units='data', y_units='data',
                 text=f'{threshold_percent:.0f}% of elapsed time', text_font_size='12pt', text_color='crimson')
    fig.add_layout(label)
    
    overall_cpu_efficiency = 100.0 * sum(cpu_time) / sum(elapsed_time)
    efficiency = bkhmodels.Label(x=len(tasks)//2.5, y=max(elapsed_time)/5, x_units='data', y_units='data',
                 text=f'\n aggregated CPU efficiency: {overall_cpu_efficiency:.0f}% \n', text_font_size='12pt',
                 text_color='dimgray', text_alpha=0.8,
                 background_fill_color='white', background_fill_alpha=1.0,
                 border_line_color='dimgray', border_line_alpha=0.5)
    fig.add_layout(efficiency)

    # Add a dash for elapsed time and a vertical bar for CPU time
    dashes = fig.dash(x='tasks', y='elapsed_time', color='crimson', size=15, line_width=2, source=source, legend_label='elapsed time')
    bars = fig.vbar(x='tasks', top='cpu_time', bottom=0.001, width=0.8, color='steelblue', alpha=0.7, source=source, legend_label='CPU time')
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('Task', '@tasks'), 
            ('Elapsed time', '@elapsed_time{0,0} h'), 
            ('CPU time', '@cpu_time{0,0} h'), 
            ('CPU efficiency', '@cpu_efficiency{%0}'), 
            ('Elapsed time (% of total)', '@elapsed_percentage{0.2f} %'), 
            ('Cumulated elapsed time', '@elapsed_cumulated{0.2f} %'),
        ], renderers=[bars, dashes], mode='mouse')) 

    return fig

In [29]:
# Select the tasks having spent more than zero hours (CPU time)
per_task_df_non_zero = per_task_df.filter(pl.col('cpu_time_hours') > 0.0)
task_elapsed_fig = make_figure_execution_time(
    tasks = get_column(per_task_df_non_zero, 'task'), 
    elapsed_time = get_column(per_task_df_non_zero, 'elapsed_time_hours'),
    cpu_time = get_column(per_task_df_non_zero, 'cpu_time_hours'),
    cpu_efficiency = get_column(per_task_df_non_zero, 'cpu_efficiency'),
)
bkh.show(task_elapsed_fig)

# Export this figure
save_figure(task_elapsed_fig, output_dir=output_dir, filename='elapsed-cpu-per-pipetask', title='DP0.2 Elapsed and CPU time per pipetask')

In [30]:
def make_figure_cpu_efficiency(tasks: List[str], cpu_efficiency: List[float]) -> bkh.Figure:
    """Return a figure representing the CPU efficiency per pipetask.
    """
    # Build the data source
    sorted_by_efficiency = sorted(zip(cpu_efficiency, tasks), reverse=True)
    cpu_efficiency, tasks = zip(*sorted_by_efficiency)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks, 
        'cpu_efficiency': cpu_efficiency,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0, 1)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        plot_width = 1_600,
        plot_height = 800,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU efficiency by pipetask", text_font_size="18pt"), 'above')

    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LinearAxis(y_range_name="y_axis_right"), 'right')

    # Axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_orientation = math.pi/3
    fig.xaxis.major_label_text_font_size = "11pt"

    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.formatter = bokeh.models.NumeralTickFormatter(format='0%')
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
     
    # Add bars
    palette = ['crimson', 'crimson', 'crimson', 'sandybrown', 'mediumseagreen']
    color_map = bokeh.transform.linear_cmap(field_name='cpu_efficiency', palette=palette, low=0, high=max(cpu_efficiency))
    bars = fig.vbar(x='tasks', top='cpu_efficiency', bottom=0, width=0.8, source=source, color=color_map, fill_color=color_map, alpha=0.6)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('pipetask', '@tasks'), ('CPU efficiency', '@cpu_efficiency{%0}')], renderers=[bars], mode='mouse'))

    return fig

In [31]:
task_cpu_efficiency_fig = make_figure_cpu_efficiency(
    tasks = get_column(per_task_df, 'task'),
    cpu_efficiency = get_column(per_task_df, 'cpu_efficiency')
)
bkh.show(task_cpu_efficiency_fig)

# Export this figure
save_figure(task_cpu_efficiency_fig, output_dir=output_dir, filename='cpu-efficiency-per-pipetask', title='DP0.2 CPU effiency per pipetask')

## Pipetask memory consumption

In [32]:
def make_figure_rss_max(tasks: List[str], rss_max: List[float]) -> bkh.Figure:
    """Return a figure representing the peak RSS per pipetask.
    """
    # Build the data source
    sorted_by_rss = sorted(zip(rss_max, tasks), reverse=True)
    rss_max, tasks = zip(*sorted_by_rss)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks, 
        'rss_max': rss_max,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0.1, max(rss_max)*5)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        y_axis_label = 'gigabyte',
        y_axis_type = 'log',
        plot_width = 1_600,
        plot_height = 800,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory – Processing for Data Preview 0.2 at FrDF (v23.0.1)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Peak RSS by pipetask kind", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = math.pi/3
    
    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')

    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.formatter = bokeh.models.NumeralTickFormatter(format='0,0')
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
    
    # Add vertical bars
    # bars = fig.vbar(x='tasks', top='rss_max', bottom=0, width=0.8, color='mediumturquoise', alpha=0.7, source=source)
    circles = fig.circle(x='tasks', y='rss_max', color='crimson', size=10, alpha=0.7, source=source)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('Task', '@tasks'), ('Max RSS', '@rss_max{0.2f} GB')], renderers=[circles], mode='mouse'))
    
    return fig

In [33]:
task_peak_rss_fig = make_figure_rss_max(
    tasks=per_task_df.get_column('task').to_list(),
    rss_max=per_task_df.get_column('RSS_max').to_list()
)
bkh.show(task_peak_rss_fig)

In [34]:
def make_figure_rss_quantiles(tasks: List[str], rss_min: List[float], rss_max: List[float], rss_mean: List[float], rss_pct_low: List[float], rss_pct_high: List[float], label: str = None, note: str = None) -> bkh.Figure:
    """Return a figure representing the RSS distribution for each pipetask.
    """
    # Build the data source: sort tasks by rss_max
    sorted_by_max = sorted(zip(rss_max, rss_min, rss_mean, rss_pct_low, rss_pct_high, tasks), reverse=True)
    rss_max, rss_min, rss_mean, rss_pct_low, rss_pct_high, tasks = zip(*sorted_by_max)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks,
        'rss_max': rss_max,
        'rss_min': rss_min,
        'rss_mean': rss_mean,
        'rss_pct_low': rss_pct_low,
        'rss_pct_high': rss_pct_high,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0.1, max(rss_max)*3)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        x_axis_label = 'pipetask',
        y_axis_label = 'gigabyte',
        y_axis_type = 'log',
        plot_width = 1_600,
        plot_height = 1_000,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Memory consumption by pipetask", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = math.pi/2.5
    
    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')

    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Annotation
    if label is not None:
        annotation_label = bkhmodels.Label(x=len(tasks)//2.2, y=max(rss_max)/1.2, x_units='data', y_units='data',
                     text=label, text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(annotation_label)

    
    # Add glyphs
    dashes_max = fig.dash(x='tasks', y='rss_max', color='indianred', size=15, line_width=3, source=source, legend_label='max')
    mean = fig.circle(x='tasks', y='rss_mean', size=8, color='mediumseagreen', source=source, legend_label='mean')
    dashes_min = fig.dash(x='tasks', y='rss_min', color='steelblue', size=15, line_width=3, source=source, legend_label='min')
    whisker = bkhmodels.Whisker(base='tasks', upper='rss_pct_high', lower='rss_pct_low', source=source)
    fig.add_layout(whisker)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('Task', '@tasks'),
        ('RSS min', '@rss_min{0.2f} GB'),
        ('RSS percentile low', '@rss_pct_low{0.2f} GB'),
        ('RSS mean', '@rss_mean{0.2f} GB'),
        ('RSS percentile high', '@rss_pct_high{0.2f} GB'),
        ('RSS max', '@rss_max{0.2f} GB'),
    ], renderers=[mean, dashes_max, dashes_min], mode='mouse'))
    
    # Add note below the figure
    if note is not None:
        fig.add_layout(bkhmodels.Title(text=note, text_font_style='italic'), 'below')
    
    return fig

In [35]:
# Select the tasks we want to include in our figure
max_rss_lower_bound = 1  # gigabytes
per_task_df_non_zero = per_task_df.filter((pl.col('RSS_p05') >= 0.1) & (pl.col('RSS_max') >= max_rss_lower_bound))

task_rss_quantiles_fig = make_figure_rss_quantiles(
    tasks = get_column(per_task_df_non_zero, 'task'),
    rss_max = get_column(per_task_df_non_zero, 'RSS_max'),
    rss_min = get_column(per_task_df_non_zero, 'RSS_min'),
    rss_mean = get_column(per_task_df_non_zero, 'RSS_mean'),
    rss_pct_low = get_column(per_task_df_non_zero, 'RSS_p05'),
    rss_pct_high = get_column(per_task_df_non_zero, 'RSS_p95'),
    label=f'\n Pipetasks with max RSS ≥ {max_rss_lower_bound} GB \n (whiskers show 5th to 95th percentiles) \n',
#    note=f'NOTE: whiskers show 5th to 95th percentiles.',
)
bkh.show(task_rss_quantiles_fig)

# Export this figure
save_figure(task_rss_quantiles_fig, output_dir=output_dir, filename='memory-consumption-per-pipetask', title='DP0.2 Memory consumption per pipetask')

## Memory distribution for pipetask consuming most of the compute time

In [36]:
def make_figure_big_task_consumers(tasks: List[str], elapsed_time_pct: List[float], rss_min: List[float], rss_max: List[float], rss_mean: List[float],
                                   rss_pct_low: List[float], rss_pct_high: List[float], label: str = None, note: str = None) -> bkh.Figure:
    """Return a figure representing the RSS distribution for each pipetask and its consumption of compute time.
    """
    # Build the data source: sort tasks by rss_max
    sorted_by_elapsed = sorted(zip(elapsed_time_pct, rss_min, rss_max, rss_mean, rss_pct_low, rss_pct_high, tasks), reverse=True)
    elapsed_time_pct, rss_min, rss_max, rss_mean, rss_pct_low, rss_pct_high, tasks = zip(*sorted_by_elapsed)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks,
        'elapsed_time_pct': elapsed_time_pct,
        'elapsed_time_cumulated': np.cumsum(elapsed_time_pct),
        'rss_min': rss_min,
        'rss_max': rss_max,
        'rss_mean': rss_mean,
        'rss_pct_low': rss_pct_low,
        'rss_pct_high': rss_pct_high,
    })
    
    # Build and configure the figure
    left_y_range = bokeh.models.Range1d(0.1, max(rss_max)*2)
    fig = bkh.figure(
        x_range = tasks,
        y_range = left_y_range,
        y_axis_type = 'log',
        y_axis_label = 'gigabyte',
        plot_width = 1_600,
        plot_height = 1_200,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Memory used by the most compute-intensive pipetasks", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_size = "11pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = math.pi/2.5
    
    # Add secondary Y axis
    bottom_right_y_axis = min(elapsed_time_pct) / 2
    top_right_axis = max(elapsed_time_pct) * 1.2
    fig.extra_y_ranges = {"y_axis_right": bokeh.models.Range1d(bottom_right_y_axis, top_right_axis)}
    fig.add_layout(bokeh.models.LinearAxis(y_range_name='y_axis_right', axis_label='total elapsed time'), 'right')
    fig.yaxis[1].formatter = bokeh.models.NumeralTickFormatter(format='0%')

    # Format Y axis
    # fig.yaxis[1].ticker = bokeh.models.tickers.LogTicker()
    for y_axis in fig.yaxis:
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Annotation
    if label is not None:
        annotation_label = bkhmodels.Label(x=len(tasks)//2.2, y=max(rss_max)/1.2, x_units='data', y_units='data',
                     text=label, text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(annotation_label)

    
    # Add glyphs
    dashes_max = fig.dash(x='tasks', y='rss_max', color='indianred', size=15, line_width=3, source=source, legend_label='max RSS')
    mean = fig.circle(x='tasks', y='rss_mean', size=8, color='mediumseagreen', source=source, legend_label='mean RSS ')
    dashes_min = fig.dash(x='tasks', y='rss_min', color='steelblue', size=15, line_width=3, source=source, legend_label='min RSS')
    bars = fig.vbar(x='tasks', top='elapsed_time_pct', bottom=bottom_right_y_axis, width=0.8, color='tan', alpha=0.3, source=source, legend_label='elapsed time', y_range_name='y_axis_right')
    
    whisker = bkhmodels.Whisker(base='tasks', upper='rss_pct_high', lower='rss_pct_low', source=source)
    fig.add_layout(whisker)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Hide legend on click
    fig.legend.click_policy = 'mute'

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('Task', '@tasks'),
        ('RSS max', '@rss_max{0.2f} GB'),
        ('RSS percentile high', '@rss_pct_high{0.2f} GB'),
        ('RSS mean', '@rss_mean{0.2f} GB'),
        ('RSS percentile low', '@rss_pct_low{0.2f} GB'),
        ('RSS min', '@rss_min{0.2f} GB'),
        ('elapsed time', '@elapsed_time_pct{0.0%}'),
        ('cumulated elapsed time', '@elapsed_time_cumulated{0.0%}'),
    ], renderers=[bars, mean, dashes_max, dashes_min], mode='mouse'))
    
    # Add note below the figure
    if note is not None:
        fig.add_layout(bkhmodels.Title(text=note, text_font_style='italic'), 'below')
    
    return fig

In [37]:
# Select the tasks which consume in aggregate a threshold of the elapsed time
task_time_consumers_df = per_task_df.filter(pl.col('cpu_time_hours') > 0.0)
tasks = get_column(task_time_consumers_df, 'task')
elapsed_time = get_column(task_time_consumers_df, 'elapsed_time_hours')

# Sort by elapsed time in decreasing order
sorted_by_elapsed_time = sorted(zip(elapsed_time, tasks), reverse=True)
elapsed_time, tasks = zip(*sorted_by_elapsed_time)

# Compute the percentage of total time each kind of pipetask spent in execution
elapsed_percentage = [100.0 * v/sum(elapsed_time) for v in elapsed_time]
elapsed_cumulated = np.cumsum(elapsed_percentage)

cumulated_elapsed_threshold = 98 # percent
time_consumers_tasks = np.array(tasks)[elapsed_cumulated <= cumulated_elapsed_threshold].tolist()

# Build a dataframe with big consumer tasks and their memory usage
# with columns:
#    task, elapsed_time_hours, min_RSS, mean_RSS, max_RSS, p05_RSS, p95_RSS
time_consumers_df = (
    per_task_df.with_column(
        # Add column 'elapsed_time_pct' with the percentage of total time consumed by each kind of pipetask
        (pl.col('elapsed_time_hours') / pl.col('elapsed_time_hours').sum()).alias('elapsed_time_pct')
    )
    .filter(
        # Select only the tasks consuming the most
        pl.col('task').is_in(time_consumers_tasks)
    )
    .sort(by='elapsed_time_hours', reverse=True)
)

# Build the plot
task_rss_consumers_fig = make_figure_big_task_consumers(
    tasks = get_column(time_consumers_df, 'task'),
    elapsed_time_pct = get_column(time_consumers_df, 'elapsed_time_pct'),
    rss_max = get_column(time_consumers_df, 'RSS_max'),
    rss_min = get_column(time_consumers_df, 'RSS_min'),
    rss_mean = get_column(time_consumers_df, 'RSS_mean'),
    rss_pct_low = get_column(time_consumers_df, 'RSS_p05'),
    rss_pct_high = get_column(time_consumers_df, 'RSS_p95'),
    # label=f'\n Pipetasks which consume in aggregate {cumulated_elapsed_threshold}% of elapsed time.\n',
    note = f'NOTE: the pipetasks shown consume in aggregate {cumulated_elapsed_threshold}% of the total elapsed time of the DP0.2 campaign. Whiskers show 5th to 95th RSS percentiles.',
)
bkh.show(task_rss_consumers_fig)

# Export this figure
save_figure(task_rss_consumers_fig, output_dir=output_dir, filename='memory-by-compute-intensive-pipetasks', title='DP0.2 Memory by most compute-intensive pipetasks')

## Memory distribution per pipetask

In [38]:
df

task,band,visit,cpu_time,elapsed_time,peak_RSS,run,step,cpu_efficiency
str,str,str,f32,f32,f64,u16,u16,f32
"""writeSourceTable""","""r""","""1171832""",0.232792,0.78534,1.329064,9,1,0.296421
"""isr""","""r""","""1155569""",24.685934,25.170607,2.668844,9,1,0.980744
"""calibrate""","""r""","""1182632""",36.576054,38.942017,1.32892,9,1,0.939244
"""writeSourceTable""","""z""","""1174511""",0.21872,0.793542,1.329056,9,1,0.275625
"""transformSourceTable""","""z""","""1174424""",0.142618,1.665406,1.328892,9,1,0.085635
"""characterizeImage""","""i""","""1185851""",49.158764,50.42981,1.32892,9,1,0.974796
"""isr""","""y""","""1056411""",24.167212,28.213358,2.68126,9,1,0.856588
"""writeSourceTable""","""r""","""1185197""",0.224889,1.236427,1.329068,9,1,0.181886
"""isr""","""y""","""1176106""",23.32202,24.606878,2.66482,9,1,0.947785
"""isr""","""y""","""1177175""",33.599365,36.804726,2.668596,9,1,0.912909


In [39]:
def make_figure_rss_histogram_per_task(hist: List[float], edges: List[float], task: str, annotation: str=None) -> bkh.Figure:
    """Return a figure with a histogram of the RSS for a given task
    """
    # Build and configure the figure
    width, height = 1_000, 600
    y_range = bokeh.models.Range1d(1, max(hist)*15)
    fig = bkh.figure(
        x_range = (edges[0], edges[-1]),
        x_axis_label = 'gigabyte',
        y_axis_label = 'frequency',
        y_range = y_range,
        y_axis_type = 'log',
        plot_width = width,
        plot_height = height,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text=f"Distribution of memory usage by {task} pipetask", text_font_size="16pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.xaxis.formatter = bokeh.models.NumeralTickFormatter(format='0,.f')

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a histogram
    quad = fig.quad(top=hist, bottom=1, left=edges[:-1], right=edges[1:],
           fill_color="orange", line_color="black", alpha=0.3)
    
    # Add annotation label
    if annotation is not None:
        # TODO: fix the issue with the location of the annotation in the plot
        label = bkhmodels.Label(x=int(width*0.75), y=int(height*0.60),
                                x_units='screen', y_units='screen',
                                x_offset=15, y_offset=15,
                                text=f'{annotation}', text_font_size='10pt',
                                text_color='dimgray', text_alpha=0.8,
                                background_fill_color='white', background_fill_alpha=1.0,
                                border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)
        
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('Interval', 'from @left{0.1f} to @right{0.1f} GB'),
        ('Frequency', '@top{0,} quanta'),
    ], renderers=[quad,], mode='mouse'))
       
    # Hide toolbar
    fig.toolbar.autohide = True

    return fig

In [40]:
def describe_column(df, column) -> Tuple[float, float, float, float]:
    df_description = df.with_columns(
        [
            pl.col(column).min().alias('min'),
            pl.col(column).max().alias('max'),
            pl.col(column).mean().alias('mean'),
            pl.col(column).std().alias('std'),
        ]
    ).select(['min', 'max', 'mean', 'std'])
    return df_description.row(0)

memory_distribution_figures = []
bins = 20
for task in sorted(time_consumers_tasks):
    # Collect the histogram data for this task
    task_df = df.filter(
        (pl.col('task') == task) & pl.col('peak_RSS').is_not_null()
    ).select(pl.col('peak_RSS'))
    
    # Compute descriptive statistics
    min_value, max_value, mean_value, std_value = describe_column(task_df, 'peak_RSS')
        
    # Compute histogram
    sizes = task_df.get_column('peak_RSS').to_list()
    hist, edges = np.histogram(sizes, density=False, bins=bins)
    count = len(sizes)
    
    annotation = f' N: {count:,} \n min: {min_value:.2f} \n mean: {mean_value:.2f} \n std: {std_value:.2f} \n max: {max_value:.2f} '
    
    # Plot the histogram
    fig = make_figure_rss_histogram_per_task(hist, edges, task, annotation=annotation)
    bkh.show(fig)
    
    # Export this figure
    filename = f'memory-distribution-{task}'
    memory_distribution_figures.append(filename)
    save_figure(fig, output_dir=output_dir, filename=filename, title=f'DP0.2 Memory distribution by pipetask {task}')

In [54]:
def publish(results_dir, publication_top_dir):
    """Publish this notebook results under top_dir
    """
    # Create the publication directories and remove .png and .html files
    # that may exist in them
    publication_dir = os.path.join(publication_top_dir, 'pipetasks')    
    os.makedirs(publication_dir, exist_ok=True)
    
    dst_images_dir = os.path.join(publication_dir, 'images')
    os.makedirs(dst_images_dir, exist_ok=True)
    for f in glob.glob(os.path.join(dst_images_dir, '*.png')):
        os.remove(f)
    
    dst_html_dir = os.path.join(publication_dir, 'html')
    os.makedirs(dst_html_dir, exist_ok=True)
    for f in glob.glob(os.path.join(dst_html_dir, '*.html')):
        os.remove(f)
    
    # Publish PNG files
    src_image_dir = os.path.join(results_dir, 'images')
    for src_file in glob.glob(os.path.join(src_image_dir, '*.png')):
        dst_file = os.path.join(publication_dir, 'images', os.path.basename(src_file))
        shutil.copy(src_file, dst_file)
    
    # Publish HTML files
    src_html_dir = os.path.join(results_dir, 'html')
    for src_file in glob.glob(os.path.join(src_html_dir, '*.html')):
        dst_file = os.path.join(publication_dir, 'html', os.path.basename(src_file))
        shutil.copy(src_file, dst_file)
        
    # Publish main HTML file
    src_index_html = os.path.join(results_dir, 'pipetasks.html')
    dst_index_html = os.path.join(publication_dir,  os.path.basename(src_index_html))
    shutil.copy(src_index_html, dst_index_html)

In [55]:
# Render the HTML template
import jinja2

environment = jinja2.Environment(loader=jinja2.FileSystemLoader('./templates'))
template = environment.get_template('pipetasks-template.html')

pipetasks_html = os.path.join(output_dir, 'pipetasks.html')
with open(pipetasks_html, mode="w", encoding="utf-8") as f:
    context = {
        'figures': memory_distribution_figures,
    }
    f.write(template.render(context))

In [56]:
publish(results_dir=output_dir, publication_top_dir='/sps/lsst/users/fabio/web/rubin-dp0.2-at-frdf')

# I am here

## Categorize tasks per memory consumption

In [44]:
def categorize_tasks_per_memory(df: pl.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> dict:
    """Return a dictionnary with details about categories of tasks according to ``categories`` parameter.

    The returned dict contains for each category in ``categories`` the elapsed time spent on each category.
    """    
    result = {}
    key = f"0 GB ≤ max RSS ≤ {categories['small']} GB"
    elapsed = df.filter(
        (pl.col('RSS_max') >= 0) & (pl.col('RSS_max') <= categories['small'])
    ).select('elapsed_time_hours').sum().row(0)[0]
    result[key] = elapsed
    
    key = f"{categories['small']} GB < max RSS ≤ {categories['medium']} GB"
    elapsed = df.filter(
        (pl.col('RSS_max') > categories['small']) & (pl.col('RSS_max') <= categories['medium'])
    ).select('elapsed_time_hours').sum().row(0)[0]
    result[key] = elapsed

    key = f"max RSS > {categories['medium']} GB"
    elapsed = df.filter(
        pl.col('RSS_max') > categories['medium']
    ).select('elapsed_time_hours').sum().row(0)[0]
    result[key] = elapsed

    return result

In [45]:
def make_figure_task_category_by_memory(categories: List[str], elapsed_time: List[float]) -> bkh.Figure:
    """Return a figure representing the elapsed time spent by each of the categories of tasks
    """
    # Build the data source
    total_elapsed = sum(elapsed_time)
    source = bkhmodels.ColumnDataSource(data={
        'category': categories, 
        'elapsed_time': [v/total_elapsed for v in elapsed_time], 
    })
    
    # Build and configure the figure
    fig = bkh.figure(
        x_range = bokeh.models.Range1d(0, 1.0),
        y_range = categories,
        plot_width = 800,
        plot_height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory French Data Facility – processing for Data Preview 0.2 (v23)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Elapsed time spent per category of pipetask", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.xaxis.formatter = bokeh.models.NumeralTickFormatter(format='0%')

    # Format Y axis
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
     
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add horizontal bars
    bars = fig.hbar(y='category', right='elapsed_time', left=0, height=0.8, source=source, fill_color='wheat', alpha=0.7, line_color='black')
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('Elapsed time', '@elapsed_time{%0}')], renderers=[bars], mode='mouse'))

    return fig

In [46]:
task_categories = categorize_tasks_per_memory(per_task_df)

task_categories_fig = make_figure_task_category_by_memory(
    categories=list(task_categories.keys()),
    elapsed_time=list(task_categories.values())
)
bkh.show(task_categories_fig) 

# I am here

## Overview of resource consumption by kind of task

The table below summarizes the CPU and memory consumption by kind of task. Tasks are presented in alphabetical order, not in the order they were executed.

In [None]:
per_task_df = df.groupby("task", maintain_order=True).agg(
    [
        pl.col("task").count().alias("task_count"),
        pl.col("task").n_unique().alias("task_kinds"),
        (pl.col("cpu_time")/3_600).sum().alias("cpu_time_hours"),
        (pl.col("elapsed_time")/3_600).sum().alias("elapsed_time_hours"),
        (pl.col("cpu_time").sum()/pl.col("elapsed_time").sum()).alias("cpu_efficiency"),
        pl.col("peak_RSS").max().alias("max_peak_RSS"),
    ]
)

In [None]:
# Generate a table with a summary of the characteristics of each task kind
summary = f"""
| Pipetask     | Number of tasks | Cumulated elapsed time (h) | Cumulated CPU time (h) | Overall CPU efficiency      | Max RSS (GB) |
| -----------: | --------------: | -------------------------: | ---------------------: | --------------------------: | -----------: |
"""

for task in sorted(per_task_df.get_column("task")):
    task_info = per_task_df.filter(pl.col("task") == task)
    task_count = task_info["task_count"][0]
    elapsed_time = task_info['elapsed_time_hours'][0]
    cpu_time = task_info["cpu_time_hours"][0]
    cpu_efficiency = task_info['cpu_efficiency'][0]
    max_rss = task_info['max_peak_RSS'][0]
    summary += f'| `{task}` | {task_count:,} | {math.ceil(elapsed_time):,.0f} | {math.ceil(cpu_time):,.0f} | {cpu_efficiency:.2f} | {max_rss:.0f} |\n'    

# Summarize the dataframe
total_task_count = per_task_df.select("task_count").sum()[0,0]
total_elapsed_time =per_task_df.select("elapsed_time_hours").sum()[0,0]
total_cpu_time = per_task_df.select("cpu_time_hours").sum()[0,0]
total_cpu_efficiency = (per_task_df.select("cpu_time_hours").sum() / per_task_df.select("elapsed_time_hours").sum())[0,0]
total_max_rss = "n/a"

summary += f'| **Total** | **{total_task_count:,}** | **{math.ceil(total_elapsed_time):,.0f}** | **{math.ceil(total_cpu_time):,.0f}** | **{total_cpu_efficiency:.2f}** | **{total_max_rss}** |\n'
print_md(summary)

## Tasks categorization by memory consumption

The table below presents a categorization of tasks based on their peak memory usage and present the CPU time spent executing each of those categories. Three classes are presented: small-, medium- and high-memory tasks.

In [None]:
def categorize_tasks_per_memory(df: pd.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> pd.DataFrame:
    """Return a data frame with details about categories of tasks according to ``categries``.

    The returned data frame is indexed by category and contains the CPU time spent in each category (in seconds)
    as well as the percentage of time spent by each category of task.
    """
    # Determine the CPU time consumed by three classes of tasks: small-memory, medium-memory and high-memory
    total_cpu_time = df['cpu_time'].sum()

    # Compute the CPU time spent running each category of tasks tasks
    is_small_memory_task = df['maxRSS'] <= categories['small']
    is_medium_memory_task = (df['maxRSS'] > categories['small']) & (df['maxRSS'] <= categories['medium'])
    is_high_memory_task = ~is_small_memory_task & ~is_medium_memory_task
    cpu_time = [
        df[is_small_memory_task]['cpu_time'].sum(),
        df[is_medium_memory_task]['cpu_time'].sum(),
        df[is_high_memory_task]['cpu_time'].sum(),
    ]
    
    # Compute the percentage CPU time spent by each category of tasks
    cpu_time_percent = [v/total_cpu_time for v in cpu_time]
    
    # Build the resulting dataframe
    data = {
        'category': categories.keys(),
        'cpu_time': cpu_time,
        'cpu_time_percent': cpu_time_percent,
    }
    
    out_df = pd.DataFrame.from_records(data=data)
    out_df.set_index('category', inplace=True)
    return out_df

In [None]:
def generate_task_categories_summary(df: pd.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> str:
    """Return a Markdown-formatted table with a summary of the percentage of CPU time spent in each category of task.
    """
    # Summarize
    summary = f"""
| category of task    |    CPU time                                         | comments                                                                      |
| ------------------- | --------------------------------------------------: | ----------------------------------------------------------------------------- |
| **small-memory**    | {100. * df.loc['small', 'cpu_time_percent']:.2f}%   | 0 < max RSS ≤ {categories['small']:.0f} GB                                    |
| **medium-memory**   | {100. * df.loc['medium', 'cpu_time_percent']:.2f}%  | {categories['small']:.0f} < max RSS ≤ {categories['medium']:.0f} GB           |
| **high-memory**     | {100. * df.loc['high', 'cpu_time_percent']:.2f}%    | max RSS > {categories['medium']:.0f} GB                                       |
    """
    return summary

In [None]:
# Categorize tasks by their max RSS and generate a summary
task_categories = {
    'small': 5,
    'medium': 20,
    'high': None
}
categories_df = categorize_tasks_per_memory(df, categories=task_categories)
summary = generate_task_categories_summary(categories_df, categories=task_categories)
print_md(summary)

In [None]:
def make_task_category_figure(df: pd.DataFrame) -> bkh.Figure:
    """Return a figure representing the CPU time spent on each task category
    """
    # Build the data source
    categories = [f"{v}-memory tasks" for v in df.index.values]
    source = bkhmodels.ColumnDataSource(data={
        'category': categories,
        'percentage': df['cpu_time_percent'].values,
        'percentage_tooltips': [100.0 * v for v in df['cpu_time_percent'].values],
        'color': ['mediumseagreen', 'gold', 'crimson'],
    })
    
    # Build and configure the figure
    fig = bkh.figure(
        x_axis_label = 'CPU time',
        x_range = bokeh.models.Range1d(0, 1.0),
        y_range=categories,
        plot_width=600,
        plot_height=400,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin processing at FrDF for Data Preview 0.2 (v23.0.1)", text_font_style="italic", text_font_size="12pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU time spent per task category", text_font_size="18pt"), 'above')

    # Axes
    fig.xaxis.axis_label_text_font_size = "14pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_size = "14pt"
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
    
    # Add bars
    bars = fig.hbar(y='category', left=0, right='percentage', height=0.5, color='color', source=source)
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('task category', '@category'), ('CPU time', '@percentage_tooltips{0.2f} %')], renderers=[bars], mode='mouse'))
    
    # Add formatters
    fig.xaxis[0].formatter = bokeh.models.NumeralTickFormatter(format="0%")

    return fig

In [None]:
category_fig = make_task_category_figure(categories_df)
bkh.show(category_fig)

print_md(f"""
The figure above presents the fraction of CPU time spent for executing each category of tasks. Hover over the bars to get more details about each category. Tasks are categorized as small-, medium- and high-memory.
Small-memory tasks are those requiring up to {task_categories['small']} GB. Medium-memory tasks are those
requiring more than {task_categories['small']} GB and up {task_categories['medium']} GB. High-memory tasks are those requiring more than {task_categories['medium']} GB.
""")

In [None]:
def summarize_cpu_time_per_task_category(df: pd.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> pd.DataFrame:
    """Return a data frame with details about each task type. The columns of the dataframe are the percentage of CPU time spent for each task category.
    The returned data frame is indexed by task type.
    """
    
    out_df = pd.DataFrame()
    for name, group in df.groupby('task'):
        # Compute the CPU spent per task category, for this group
        is_small_memory_task = group['maxRSS'] <= categories['small']
        is_medium_memory_task = (group['maxRSS'] > categories['small']) & (group['maxRSS'] <= categories['medium'])
        is_high_memory_task = ~is_small_memory_task & ~is_medium_memory_task
        
        group_cpu_time = group['cpu_time'].sum()
        row = { 'task': name }
        row['small'] = group[is_small_memory_task]['cpu_time'].sum() / group_cpu_time
        row['medium'] = group[is_medium_memory_task]['cpu_time'].sum() / group_cpu_time
        row['high'] = group[is_high_memory_task]['cpu_time'].sum() / group_cpu_time
        out_df = pd.concat([out_df, pd.DataFrame.from_records(data=[row,], columns=row.keys())])

    out_df.set_index('task', inplace=True)
    out_df.sort_values(by='task', ascending=True, inplace=True)
    return out_df

In [None]:
def make_figure_category_within_task_type(df: pd.DataFrame) -> bkh.Figure:
    """Return a figure representing the CPU time spent on each task category, for each kind of task
    """
    # Sort by task name in reverse alphabetical order so that veertical reading makes sense
    df = df.sort_values(by=['small', 'medium', 'high'], ascending=True)

    # Build the data source
    source = bkhmodels.ColumnDataSource(data={
        'task': df.index.values,
        'small': df['small'].values,
        'medium': df['medium'].values,
        'high': df['high'].values,
    })
    
    # Build and configure the figure
    fig = bkh.figure(
        x_axis_label = 'CPU time',
        y_axis_label = 'task',
        x_range = bokeh.models.Range1d(0, 1.0),
        y_range = df.index.values,
        plot_width = 1200,
        plot_height = 1200,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin processing at FrDF for Data Preview 0.2 (v23.0.1)", text_font_style="italic", text_font_size="12pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU time spent per task category", text_font_size="18pt"), 'above')

    # Axes
    fig.xaxis.axis_label_text_font_size = "14pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_size = "14pt"
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
    fig.xaxis[0].formatter = bokeh.models.NumeralTickFormatter(format="0%")
    
    # Legend
    fig.add_layout(bokeh.models.Legend(), 'right')
    
    # Add bars
    bars = fig.hbar_stack(['small', 'medium', 'high'], y='task', color=['mediumseagreen', 'gold', 'crimson'], height=0.7, source=source, legend_label=['small-memory', 'medium-memory', 'high-memory'])
    
    # Add tooltips
    # fig.add_tools(bkhmodels.HoverTool(tooltips=[('task', '@task'), ('small memory', '@small{0.2f} %')], renderers=[bars], mode='mouse'))
    # fig.add_tools(bkhmodels.HoverTool(tooltips=[('task', '@task'),], renderers=[bars], mode='mouse'))
    return fig

In [None]:
taks_df = summarize_cpu_time_per_task_category(df, task_categories)
all_tasks_fig = make_figure_category_within_task_type(taks_df)
bkh.show(all_tasks_fig)
print_md(f"""
The figure above presents a more detailed view of the share of CPU time spent by each task category for every kind of pipeline task. Tasks are categorized as small-, medium- and high-memory.
Small-memory tasks are those requiring up to {task_categories['small']} GB. Medium-memory tasks are those
requiring more than {task_categories['small']} GB and up {task_categories['medium']} GB. High-memory tasks are those requiring more than {task_categories['medium']} GB.
""")

In [None]:
def build_boxplot_dataframe(df: pd.DataFrame, column: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Build a dataframe with data to populate a figure composed of box plots for each
    kind of task based on the values of the column ``column``.
    
    Parameters
    ----------
    df : pd.DataFrame
       dataframe which contains one row for each executed task.
       
    column: str
       name of the column in ``df`` to be used as criterion for building the boxplot
       data.
       
    Returns
    -------
    first: pd.DataFrame
       a dataframe with columns 'task', 'q0', 'q1', 'q2', 'q3', 'q4', 'lower' and 'upper'.
       The columns 'qN' contain the value of the corresponding quartile. The columns
       'lower' and 'upper' contain the values for the lower and upper whiskers of the
       box plot.
       This dataframe is indexed by task name.
       
    second: pd.DataFrame
        outliers dataframe
       
    Notes
    -----
    See also: https://en.wikipedia.org/wiki/Box_plot
    """
    # Build a new dataframe where each row contains values computed for each kind of task.
    # Those values are used later for building a figure with a boxplot per task type.
    # The values are the first, second and third quartiles for 'maxRSS' column, the
    # inter-quartile range and the lower and upper boxplot limits.
    out_df = pd.DataFrame()
    
    for name, group in df.groupby('task'):
        # Compute the 25th, 50th and 75th percentile for this group
        q0, q1, q2, q3, q4 = np.percentile(group[column].dropna(), (0.0, 25.0, 50.0, 75.0, 100.0))
                
        # Compute inter-quartile range and values for lower and upper whiskers
        iqr = q3 - q1
        lower = max(q1 - 1.5*iqr, q0)
        upper = min(q3 + 1.5*iqr, q4)

        # Append a new row to the resulting dataframe for this kind of task
        row = {
            'task': name, 
            'q0': q0, 
            'q1': q1, 
            'q2': q2,
            'q3': q3,
            'q4': q4,
            'lower': lower,
            'upper': upper,
        }
        out_df = pd.concat([out_df, pd.DataFrame.from_records(data=[row,], columns=row.keys())])

    # Set the dataframe index to the task name
    out_df.set_index('task', inplace=True)
    
    # Select outliers
    def select_outliers(group):
        """Return outliers for the given DataFrame group. It expects a single group per task.

        Outliers are those data points below or higher than the task's lower and upper limits.
        """
        task = group.name
        lower = out_df.loc[task]['lower']
        upper = out_df.loc[task]['upper']
        is_outlier = (group[column] < lower) | (group[column] > upper)
        return group[is_outlier][column]

    outliers = df.groupby('task').apply(select_outliers).dropna()
    return out_df, outliers

In [None]:
def make_figure_boxplot_rss(df: pd.DataFrame, unit: str, outliers: pd.Series) -> bkh.Figure:
    """Builds a figure with a boxplot per task
    """
    # Sort the dataframe by median value
    df = df.sort_values(by='q2', ascending=False)
    
    # Retrieve the names of the tasks
    tasks = df.index.values
    
    # Build an configure the figure
    fig = bkh.figure(
        x_axis_label = 'max RSS (gigabyte)',
        y_axis_label = 'pipeline task',
        background_fill_color = "#f4f3f3",
        y_range = tasks,
        plot_width = 1200,
        plot_height = 1400
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin processing at FrDF for Data Preview 0.2 (v23.0.1)", text_font_style="italic", text_font_size="12pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Memory Consumption by Pipeline Tasks", text_font_size="18pt"), 'above')

    # Axes
    fig.xaxis.axis_label_text_font_size = "14pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_size = "14pt"
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
    
    # Data source
    data_source = bkhmodels.ColumnDataSource({
        'task': df.index.values,
        'q0': df['q0'].values,
        'q1': df['q1'].values,
        'q2': df['q2'].values,
        'q3': df['q3'].values,
        'q4': df['q4'].values,
        'lower': df['lower'].values,
        'upper': df['upper'].values,
    })

    # Stems
    line_color = 'black'
    fig.segment(x0='q3', y0='task', x1='upper', y1='task', source=data_source, line_color=line_color)
    fig.segment(x0='lower', y0='task', x1='q1', y1='task', source=data_source, line_color=line_color)

    # Boxes
    box_fill_color = 'tan'
    box_height = 0.8
    boxes = fig.hbar(y='task', height=box_height, right='q3', left='q1', source=data_source, fill_color=box_fill_color, line_color=box_fill_color)

    # Add tooltips for boxes
    fig.add_tools(
        bkhmodels.HoverTool(
            tooltips=[('task', '@task'), ('min', f'@q0{{0.2}} {unit}'), ('median', f'@q2{{0.}} {unit}'),  ('max', f'@q4{{0.}} {unit}')],
            renderers=[boxes],
            mode='mouse'
        )
    )

    # Whiskers
    whisker_height = box_height * 0.50
    fig.rect(x='lower', y='task', width=0.001, height=whisker_height, source=data_source, line_color=line_color, fill_color=line_color)
    fig.rect(x='upper', y='task', width=0.001, height=whisker_height, source=data_source, line_color=line_color, fill_color=line_color)

    # Median
    median_color = 'red'
    fig.rect(x='q2', y='task', width=0.001, height=box_height, source=data_source, line_color=median_color, fill_color=median_color)
    
    # Outliers
    if not outliers.empty:
        outliers_data = bkhmodels.ColumnDataSource({
            'x': list(outliers.values),
            'task': list(outliers.index.get_level_values(0)),
        })
        outliers_color = '#F38630' # 'darksalmon'
        circles = fig.circle(x='x', y='task', source=outliers_data, size=6, color=outliers_color, fill_alpha=0.6)
        fig.add_tools(bkhmodels.HoverTool(tooltips=[('task', '@task'), ('max RSS', f'$x{{0.1}} {unit}')], renderers=[circles], mode='mouse'))

    return fig

In [None]:
rss_boxplot_df, rss_outliers = build_boxplot_dataframe(df, column='maxRSS')
box_plot_rss_fig = make_figure_boxplot_rss(rss_boxplot_df, 'GB', rss_outliers)
bkh.show(box_plot_rss_fig)

## CPU efficiency

In [None]:
# Compare the CPU efficiency of tasks of step 4 by discrimating among the tasks by run
# Some runs were executed after downloading the input data to the worker's local disk, while others
# accessed directly the input data directly from the remote server
df_step_4 = df[df['step'] == 4]

run = 4
is_group_1 = df_step_4['run'] <= run
is_group_2 = ~(df_step_4['run'] <= run)
cpu_efficiency_group_1 = df_step_4[is_group_1]['cpu_efficiency']
cpu_efficiency_group_1_mean, cpu_efficiency_group_1_std = cpu_efficiency_group_1.mean(), cpu_efficiency_group_1.std()
overall_cpu_efficiency_group_1 = df_step_4[is_group_1]['cpu_time'].sum() / df_step_4[is_group_1]['utc_time'].sum()

cpu_efficiency_group_2 = df_step_4[is_group_2]['cpu_efficiency']
cpu_efficiency_group_2_mean, cpu_efficiency_group_2_std = cpu_efficiency_group_2.mean(), cpu_efficiency_group_2.std()
overall_cpu_efficiency_group_2 = df_step_4[is_group_2]['cpu_time'].sum() / df_step_4[is_group_2]['utc_time'].sum()

summary = f"""
### CPU efficiency of step 4 tasks:

| run         | overall                                          | mean per task                               | std per task                                |
| ----------- | -----------------------------------------------: | ------------------------------------------: | ------------------------------------------: |
| **≤ {run}** | {100.0 * overall_cpu_efficiency_group_1:.1f}%    | {100.0 * cpu_efficiency_group_1_mean:.1f}%  | {100.0 * cpu_efficiency_group_1_std:.1f}%   |
| **> {run}** | {100.0 * overall_cpu_efficiency_group_2:.1f}%    | {100.0 * cpu_efficiency_group_2_mean:.1f}%  | {100.0 * cpu_efficiency_group_2_std:.1f}%   |

({df_step_4.shape[0]:,} tasks)
"""
print_md(summary)

In [None]:
# Compute the CPU efficiency distribution for each kind of task
from scipy.stats import gaussian_kde

efficiencies = df[df['task'] == 'matchCatalogsTract']['cpu_efficiency']
pdf = gaussian_kde(efficiencies)

In [None]:
tasks = ('makeWarp',)

x = np.linspace(start=0, stop=100, num=10)
source = bokeh.models.ColumnDataSource(data=dict(x=x))

def ridge(category, data, scale=100):
    return list(zip([category]*len(data), scale*data))

p = bokeh.plotting.figure(y_range=tasks, width=900, x_range=(-5, 105), toolbar_location=None)

y = ridge('makeWarp', pdf(x))
source.add(y, 'makeWarp')
p.patch('x', 'makeWarp', color='blue', alpha=0.6, line_color="black", source=source)

bokeh.plotting.show(p)

# TODO

* Categorize tasks per memory consumption
* Add a plot to show tasks ordered by memory consumption
* Add a plot to show tasks by CPU elapsed time
* Add a plot to show tasks by CPU efficiency

* For each kind of task, display the 98th percentile of CPU efficiency using horizontal bars or alternatively a [ridge plot](https://docs.bokeh.org/en/latest/docs/gallery/ridgeplot.html) for showing the distribution of CPU efficiency
* Improve comments on the notebook for preparing for publication
* Create git repo and publish the results