# Analysis of HSC PDR2 pipetask execution

### Introduction

This notebook summarizes the consumption of CPU and memory by pipeline taks, observed at FrDF while executing processing for Data Preview 0.2.
More info on the run [here](https://rtn-063.lsst.io/).

### Source data
Source data for this analysis is obtained from the records about task execution collected by the pipelines themselves and recorded in the butler repository. Those records are stored in `parquet` format.
The data are located under `/sps/lsst/dataproducts/hsc/hsc_pdr2/u/lsstprod/gather-resource-usage/`.
There is a `step_*/` folder for each step of the pipeline, with inside :
 + `*_resource_usage/` for each task, containing metric for the task execution,
 +  `ResourceUsageSummary/` for each step, containing runtime and memory distribution for each task,

Each pipetask of the LSST Science Pipelines records information about its execution. Among the metrics recorded are the time spent in its execution and the memory it used. Those metrics are recorded in the Butler repository as task metadata datasets named `*_metadata` (e.g. `assembleCoadd_metadata` or `makeWarp_metadata`) in YAML format.

??? In this particular case, for DP0.2 Quentin extracted and processed those YAML using Jim Chiang's [gather_resource_info.py](https://github.com/LSSTDESC/gen3_workflow/blob/master/python/desc/gen3_workflow/gather_resource_info.py) module. See further details in Quentin's https://gitlab.in2p3.fr/rubin-lsst/dp02-analysis.

Among the metrics collected for each task we have:
 * Runtime or Elapsed time (in seconds)
 * Maximum [resident set size](https://en.wikipedia.org/wiki/Resident_set_size): this value is obtained via Python's [resource.getrusage()](https://docs.python.org/3/library/resource.html#resource.getrusage) and represents the maximum value of RSS recorded for the process (in kilobytes). We interpret this as the maximum amount of RAM the process executing a given pipeline task has been allocated.

In previous version, there was CPU time included, but this metric does not exist as of 2024-09-29:
 * CPU time: collected using Python's [time.process_time()](https://docs.python.org/3/library/time.html#time.process_time). It returns the *"the sum of the system and user CPU time of the current process. It does not include time elapsed during sleep."* This is a real value in seconds.

In [1]:
import glob
import math
import os
import pathlib
import shutil
import sys
from typing import Tuple, List

In [2]:
import polars as pl
import polars.selectors as cs

# Set the maximum length to display for string columns
_ = pl.Config.set_fmt_str_lengths(50)

In [3]:
import pandas as pd
import numpy as np

In [4]:
import bokeh
import bokeh.plotting as bkh
import bokeh.models as bkhmodels
from bokeh.io import curdoc, output_notebook

curdoc().clear()

bkh.output_notebook()
output_notebook()

In [5]:
import IPython.display
print_md = IPython.display.Markdown

In [6]:
def build_dataframe_from_dir(dir: str) -> pl.DataFrame:
    """Read all .parquet files in directory 'dir' and build a single dataframe.

    Parameters
    ----------
    dir : `str`
       directory where the input .parquet files are located. All the files
       in that directory are read to build the dataframe

    Returns
    -------
    build_dataframe_from_dir: polars.DataFrame
       a dataframe where each row contains the information of one task.
    """
    print(f"Loading data files in directory {dir} ...")
    df = None
    for subdir in pathlib.Path(dir).glob("*"):
        for file in pathlib.Path(subdir).glob("*.parq"):
            this_df = pl.read_parquet(file)
            df = (
                this_df
                if df is None
                else pl.concat([df, this_df], how="diagonal_relaxed")
            )

    return df


def build_dataframe_from_list(paths: list[str]) -> pl.DataFrame:
    """Read all .parquet files given in 'paths' and build a single dataframe.

    Parameters
    ----------
    paths : `list[str]`
       .parqu input's filenames. All the files are read to build the dataframe.

    Returns
    -------
    build_dataframe_from_dir: polars.DataFrame
       a dataframe where each row contains the information of one task.
    """
    print(f"Loading data files ...")
    df = None
    for file in paths:
        this_df = pl.read_parquet(file)
        # Pre-pend with step ID, and task ID if needed
        if "task" in this_df.columns:
            this_df = this_df.select(
                pl.lit(file.parents[1].name).alias("step"), pl.all()
            )
        else:
            this_df = this_df.select(
                pl.lit(file.parents[1].name).alias("step"),
                pl.lit(file.parents[0].name.split("_")[0]).alias("task"),
                pl.all(),
            )
        df = this_df if df is None else pl.concat([df, this_df], how="diagonal_relaxed")

    return df

In [150]:
# Load the data files for each step of interest and aggregate it in a single dataframe
#data_dir = '../../data/pipetasks'
data_dir = '/sps/lsst/dataproducts/hsc/hsc_pdr2/u/lsstprod/gather-resource-usage'

# Processing identifier
processing_label = "Rubin Observatory French Data Facility – HSC PDR2 Reprocessing and Operations Rehearsal for DRP (v27)"

# Processing ID
processing_ID = None

#paths = sorted(pathlib.Path(data_dir).glob('step*'))
# We use recursive globbing to match the files '**/' or .rglob
paths_RessourceUsageSummary = sorted(pathlib.Path(data_dir).glob('**/*ResourceUsageSummary*/*.parq'))
paths_individualTask = sorted(list(set(pathlib.Path(data_dir).glob('**/*.parq')) - set(paths_RessourceUsageSummary)))

print(f"There is {len(paths_RessourceUsageSummary)} step and {len(paths_individualTask)} individual tasks")
                                                        
df_all, df_summary = None, None
df_all = build_dataframe_from_list(paths_individualTask)
df_summary = build_dataframe_from_list(paths_RessourceUsageSummary)

# Add a column 'cpu_efficiency' and compute 'memory' column in gigabytes instead of kilobytes as reported
# by Python's resource.getrusage
df_all = df_all.with_columns([
    (pl.col("memory") / 1e9),
    (pl.col('run_time') / pl.col('run_time').sum()).alias('elapsed_time_pct'),
    
])

There is 9 step and 41 individual tasks
Loading data files ...
Loading data files ...


In [8]:
def clean_output_dir(directory):
    """Remove all .html and .png files from `directory`
    """
    if not os.path.exists(directory):
        return
    
    to_remove = glob.glob(os.path.join(directory, 'images', '*.png')) + glob.glob(os.path.join(directory, 'html', '*.html'))
    for name in to_remove:
        os.remove(name)

In [9]:
# Create an output directory for plots created by this notebook
output_dir = os.path.join('.', 'results')
os.makedirs(output_dir, exist_ok=True)

clean_output_dir(output_dir)

## Overview

In [10]:
total_elapsed_time_hours_IT = df_all.get_column("run_time").sum() / 3_600
total_elapsed_time_hours_RUS = df_summary.get_column("integrated_runtime_hrs").sum()

#total_cpu_time_hours = pl.sum(df.get_column("cpu_time")) / 3_600
total_cpu_time_hours = -1
#global_cpu_efficiency = 100 * (total_cpu_time_hours / total_elapsed_time_hours)
global_cpu_efficiency = 100 * (total_cpu_time_hours / total_elapsed_time_hours_IT)

overview = f"""
There were **{df_all.shape[0]:,} pipetasks** which consumed **{total_elapsed_time_hours_IT:,.0f} elapsed hours ({total_cpu_time_hours:,.0f} CPU hours**)
for a global CPU effiency of **{global_cpu_efficiency:.1f}%**.
"""
print_md(overview)


There were **104,864 pipetasks** which consumed **1,617 elapsed hours (-1 CPU hours**)
for a global CPU effiency of **-0.1%**.


In [11]:
df_all.head(5)
#df_all.n_unique("quanta")

step,task,band,instrument,day_obs,detector,physical_filter,visit,memory,init_time,run_time,group,exposure,skymap,tract,patch
str,str,str,str,i64,i64,str,i64,f64,f64,f64,str,i64,str,i64,i64
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20150813,0,"""HSC-G""",38440,1.860194,0.043232,17.485948,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42158,1.864344,0.03966,42.24704,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42170,1.867305,0.040183,24.076729,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42172,1.864229,0.03583,36.990868,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42174,1.851531,0.046396,23.545905,,,,,


In [12]:
df_summary.head(5)

step,task,quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
str,str,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""step1_20240828T132128Z""","""calibrate""",13529,167.770902,1.613598,1.724127,1.725945,1.734741,1.736198,1.738201,1.834927,2.084296,2.174904,9.536886,16.764526,20.926171,33.581326,43.019593,52.236053,76.045029,90.384598,116.709748
"""step1_20240828T132128Z""","""characterizeImage""",13766,272.800257,1.613598,1.724155,1.725948,1.734749,1.736217,1.738282,1.835623,2.089701,2.174904,17.084336,30.320538,38.064049,58.991123,70.88596,80.546032,109.542471,139.589732,257.292522
"""step1_20240828T132128Z""","""isr""",14208,133.382186,1.613598,1.724046,1.725856,1.734582,1.735981,1.737675,1.83255,1.834735,1.839458,15.695945,21.299464,24.703215,31.241011,32.306488,35.117577,44.508923,47.933342,62.964969
"""step1_20240828T132128Z""","""transformPreSourceTable""",13529,0.604893,1.613598,1.724127,1.725945,1.734741,1.736198,1.738201,1.834927,2.084296,2.174904,0.093049,0.103454,0.113912,0.152783,0.162058,0.172113,0.19786,0.222688,0.823765
"""step1_20240828T132128Z""","""writePreSourceTable""",13529,1.122836,1.613598,1.724127,1.725945,1.734741,1.736198,1.738201,1.834927,2.084296,2.174904,0.141746,0.174111,0.2055,0.266139,0.292306,0.324557,0.401867,0.451771,1.120655


# Check constistency
We check that data from `df_all`, the DataFrame with every quanta, and from `df_summary`, the DataFrame with summarized data for each task are coherent.

In [13]:
# Percentiles used in ResourceUsageSummary
percentiles = ["000", "001", "005", "032", "050", "068", "095", "099", "100"]

# Create a dataframe for summarizing per-task metrics comparable to the dataframe summarized
# - quanta: number of quanta executed in this task
# - integrated_runtime_hrs: elapsed time spent in this task (hours)
# - mem_GB_pXXX: XXX percentile of the maximum RSS in the given task (in gigabytes)
df_summary_check = df_all.group_by("task", maintain_order=True).agg(
    [     
        pl.col("step").unique().get(0).alias("step"),
        pl.col("task").count().alias("quanta"),
        (pl.col("run_time").sum()/3_600).alias("integrated_runtime_hrs"),
    ] + [
        pl.col("memory").quantile((int(p))/100).alias(f"mem_GB_p{p}") for p in percentiles
    ] + [
        pl.col("run_time").quantile(int(p)/100).alias(f"runtime_s_p{p}") for p in percentiles
    ]
)

display(df_summary_check, df_summary)

task,step,quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
str,str,u32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""calibrate""","""step1_20240828T132128Z""",13529,167.770902,1.732588,1.851265,1.853219,1.862664,1.864229,1.866379,1.970242,2.238063,2.335285,9.536886,16.756969,20.925693,33.58139,43.019593,52.235669,76.055155,90.394904,116.709748
"""characterizeImage""","""step1_20240828T132128Z""",13766,272.800257,1.732588,1.85131,1.853223,1.862672,1.864249,1.866465,1.970995,2.243785,2.335285,17.084336,30.32414,38.063157,58.99247,70.887692,80.545272,109.549904,139.507073,257.292522
"""isr""","""step1_20240828T132128Z""",14208,133.382186,1.732588,1.851179,1.853121,1.862492,1.863995,1.865814,1.967686,1.970033,1.975103,15.695945,21.299271,24.703027,31.240946,32.306885,35.118042,44.510339,47.933537,62.964969
"""transformPreSourceTable""","""step1_20240828T132128Z""",13529,0.604893,1.732588,1.851265,1.853219,1.862664,1.864229,1.866379,1.970242,2.238063,2.335285,0.093049,0.103453,0.11391,0.152783,0.162058,0.172113,0.197861,0.222724,0.823765
"""writePreSourceTable""","""step1_20240828T132128Z""",13529,1.122836,1.732588,1.851265,1.853219,1.862664,1.864229,1.866379,1.970242,2.238063,2.335285,0.141746,0.174084,0.205478,0.26614,0.292306,0.324557,0.401929,0.452102,1.120655
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""photometricCatalogMatch""","""step8_20240909T130219Z""",1,0.002065,1.892135,1.892135,1.892135,1.892135,1.892135,1.892135,1.892135,1.892135,1.892135,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349
"""photometricRefCatObjectTract""","""step8_20240909T130219Z""",1,0.005757,0.82919,0.82919,0.82919,0.82919,0.82919,0.82919,0.82919,0.82919,0.82919,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211
"""plotPropertyMapTract""","""step8_20240909T130219Z""",5,0.124601,3.972592,3.972592,3.972592,3.975291,3.977843,3.982262,3.98524,3.98524,3.98524,64.289125,64.289125,64.289125,92.639373,94.022988,97.031368,100.580533,100.580533,100.580533
"""refCatObjectTract""","""step8_20240909T130219Z""",1,0.004115,0.770441,0.770441,0.770441,0.770441,0.770441,0.770441,0.770441,0.770441,0.770441,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182


step,task,quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
str,str,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""step1_20240828T132128Z""","""calibrate""",13529,167.770902,1.613598,1.724127,1.725945,1.734741,1.736198,1.738201,1.834927,2.084296,2.174904,9.536886,16.764526,20.926171,33.581326,43.019593,52.236053,76.045029,90.384598,116.709748
"""step1_20240828T132128Z""","""characterizeImage""",13766,272.800257,1.613598,1.724155,1.725948,1.734749,1.736217,1.738282,1.835623,2.089701,2.174904,17.084336,30.320538,38.064049,58.991123,70.88596,80.546032,109.542471,139.589732,257.292522
"""step1_20240828T132128Z""","""isr""",14208,133.382186,1.613598,1.724046,1.725856,1.734582,1.735981,1.737675,1.83255,1.834735,1.839458,15.695945,21.299464,24.703215,31.241011,32.306488,35.117577,44.508923,47.933342,62.964969
"""step1_20240828T132128Z""","""transformPreSourceTable""",13529,0.604893,1.613598,1.724127,1.725945,1.734741,1.736198,1.738201,1.834927,2.084296,2.174904,0.093049,0.103454,0.113912,0.152783,0.162058,0.172113,0.19786,0.222688,0.823765
"""step1_20240828T132128Z""","""writePreSourceTable""",13529,1.122836,1.613598,1.724127,1.725945,1.734741,1.736198,1.738201,1.834927,2.084296,2.174904,0.141746,0.174111,0.2055,0.266139,0.292306,0.324557,0.401867,0.451771,1.120655
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""step8_20240909T130219Z""","""photometricCatalogMatch""",1,0.002065,1.762188,1.762188,1.762188,1.762188,1.762188,1.762188,1.762188,1.762188,1.762188,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349,7.435349
"""step8_20240909T130219Z""","""photometricRefCatObjectTract""",1,0.005757,0.772243,0.772243,0.772243,0.772243,0.772243,0.772243,0.772243,0.772243,0.772243,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211,20.724211
"""step8_20240909T130219Z""","""plotPropertyMapTract""",5,0.124601,3.699764,3.699865,3.700267,3.702944,3.704655,3.707618,3.710989,3.711433,3.711544,64.289125,65.423135,69.959175,93.026785,94.022988,96.189022,99.8707,100.438566,100.580533
"""step8_20240909T130219Z""","""refCatObjectTract""",1,0.004115,0.717529,0.717529,0.717529,0.717529,0.717529,0.717529,0.717529,0.717529,0.717529,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182,14.813182


In [14]:
# Do the check
from polars.testing import assert_frame_equal
# assert_frame_equal(df_summary, df_summary_check.select(df_summary.columns), check_dtypes=False)

# Differentiate and check
df_difference = df_summary.select(cs.numeric()) - df_summary_check.select(cs.numeric())
display(df_difference)
display(
    df_difference.select(pl.max("*")),
    df_difference.select(pl.mean("*")),
    df_difference.select(pl.min("*")),
)

quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0,2.8422e-14,-0.11899,-0.127138,-0.127274,-0.127923,-0.12803,-0.128178,-0.135314,-0.153766,-0.160381,0.0,0.007557,0.000478,-0.000064,0.0,0.000384,-0.010126,-0.010306,0.0
0,-5.6843e-14,-0.11899,-0.127155,-0.127275,-0.127924,-0.128032,-0.128183,-0.135372,-0.154084,-0.160381,0.0,-0.003602,0.000892,-0.001347,-0.001732,0.000759,-0.007433,0.082659,0.0
0,-2.5580e-13,-0.11899,-0.127133,-0.127265,-0.12791,-0.128014,-0.128139,-0.135136,-0.135298,-0.135645,0.0,0.000193,0.000187,0.000065,-0.000397,-0.000465,-0.001416,-0.000195,0.0
0,9.9920e-16,-0.11899,-0.127138,-0.127274,-0.127923,-0.12803,-0.128178,-0.135314,-0.153766,-0.160381,0.0,0.000001,0.000003,-7.4000e-9,0.0,1.2936e-7,-0.000001,-0.000035,0.0
0,-1.9984e-15,-0.11899,-0.127138,-0.127274,-0.127923,-0.12803,-0.128178,-0.135314,-0.153766,-0.160381,0.0,0.000027,0.000023,-0.000001,0.0,1.6128e-7,-0.000062,-0.000331,0.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
0,0.0,-0.129947,-0.129947,-0.129947,-0.129947,-0.129947,-0.129947,-0.129947,-0.129947,-0.129947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,-0.056947,-0.056947,-0.056947,-0.056947,-0.056947,-0.056947,-0.056947,-0.056947,-0.056947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,-0.272827,-0.272727,-0.272325,-0.272347,-0.273188,-0.274644,-0.274251,-0.273807,-0.273696,0.0,1.13401,5.67005,0.387412,0.0,-0.842346,-0.709833,-0.141967,0.0
0,0.0,-0.052912,-0.052912,-0.052912,-0.052912,-0.052912,-0.052912,-0.052912,-0.052912,-0.052912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0,2.2737e-13,-0.024228,-0.0243,-0.024372,-0.024576,-0.024804,-0.025067,-0.025706,-0.026037,-0.02665,0.0,20.080258,5.67005,3.240546,0.0,12.030737,3.773793,10.683166,0.0


quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.0,-9.8466e-15,-0.194649,-0.20255,-0.202155,-0.215185,-0.22361,-0.229425,-0.257986,-0.268442,-0.271042,0.0,0.312845,0.496448,0.004835,-0.067274,0.380775,-1.855184,-0.016146,0.0


quanta,integrated_runtime_hrs,mem_GB_p000,mem_GB_p001,mem_GB_p005,mem_GB_p032,mem_GB_p050,mem_GB_p068,mem_GB_p095,mem_GB_p099,mem_GB_p100,runtime_s_p000,runtime_s_p001,runtime_s_p005,runtime_s_p032,runtime_s_p050,runtime_s_p068,runtime_s_p095,runtime_s_p099,runtime_s_p100
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0,-2.7001e-13,-1.923552,-1.923552,-1.923552,-1.923552,-1.923552,-1.923552,-1.923552,-1.923552,-1.923552,0.0,-9.600759,-0.029145,-3.087319,-1.106885,-7.405115,-58.532981,-11.706596,0.0


There are some significant differences between the two data.
For the next analysis, we will only use the global DataFrame.

## Per step analysis

In [15]:
def make_figure_per_step_task_count(steps: List[str], quanta: List[int], annotation: str = None) -> bkh.figure:
    """Return a figure representing the runtime's distribution for a single task
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, quanta))
    steps, quanta = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'quanta': quanta,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(quanta)*100)
    fig = bkh.figure(
        x_axis_label = 'step',
        x_range = steps,
        y_range = y_range,
        y_axis_label = 'quanta',
        y_axis_type = 'log',
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Execution of quanta per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = 1.5708/2
    
    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Add vertical bar for quanta counts
    bars = fig.vbar(x='steps', top='quanta', bottom=1, width=0.8, color='tan', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=5.9, y=max(quanta)*4.2, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    
    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('quanta', '@quanta{0,0}'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [16]:
def get_column(df:pl.DataFrame, column:str) -> List:
    """Return a list with the values of column `column` from data frame `df`
    """
    return df.get_column(column).to_list()

In [17]:
# Create a dataframe for summarizing per-step metrics, i.e.
# - quanta: number of quanta executed in this step
# - pipetasks: number of different pipetasks executed in this step
# - elapsed_time_hours: elapsed time spent in this step (hours)
# - cpu_time_hours: CPU time spent in this step (hours)
# - cpu_efficiency: CPU efficiency
# - max_RSS: maximum RSS for any task in this step (gigabytes)
df_per_step = df_all.group_by("step", maintain_order=True).agg(
    [
        pl.col("task").count().alias("quanta"),
        pl.col("task").n_unique().alias("pipetasks"),
        (pl.col("run_time").sum()/3_600).alias("elapsed_time_hours"),
        pl.col("memory").max().alias("max_RSS"),
    ]
)

#per_step_df = df_all.group_by("step", maintain_order=True).agg(
#    [
#        pl.col("task").n_unique().alias("pipetasks"),
#        pl.col("task").count().alias("quanta"),
#        (pl.col("elapsed_time")/3_600).sum().alias("elapsed_time_hours"),
#        (pl.col("cpu_time")/3_600).sum().alias("cpu_time_hours"),
#        (pl.col("cpu_time").sum() / pl.col("elapsed_time").sum()).alias("cpu_efficiency"),
#        pl.col("memory").max().alias("max_RSS"),
#
#])

In [18]:
df_per_step

step,quanta,pipetasks,elapsed_time_hours,max_RSS
str,u32,u32,f64,f64
"""step1_20240828T132128Z""",68561,5,575.681074,2.335285
"""step2b_20240904T090129Z""",76,2,1.352095,1.638654
"""step2c_20240904T112118Z""",3,3,0.937859,28.008501
"""step2d_20240904T130354Z""",26336,5,114.110038,1.436373
"""step2e_20240904T143621Z""",2,2,0.026105,2.40898
"""step3_20240904T145006Z""",5251,13,837.143551,18.108277
"""step4_20240905T072021Z""",4602,1,87.111164,1.406783
"""step7_20240909T075532Z""",5,1,0.029666,0.513073
"""step8_20240909T130219Z""",28,9,0.666128,20.463735


In [19]:
def save_figure(fig, output_dir, filename, title):
    """Save figure in formats HTML and PNG
    """
    # Ensure output directories exists
    os.makedirs(output_dir, exist_ok=True)
    image_dir = os.path.join(output_dir, 'images')
    os.makedirs(image_dir, exist_ok=True)
    html_dir = os.path.join(output_dir, 'html')
    os.makedirs(html_dir, exist_ok=True)
    
    # Save PNG
    png_filename = os.path.join(image_dir, f'{filename}.png')
    _ = bokeh.io.export_png(fig, filename=png_filename)
    
    # Save HTML
    html_filename = os.path.join(html_dir, f'{filename}.html')
    bkh.output_file(filename=html_filename, title=title)
    bkh.save(fig)
    
    # Reset the output file so the html file is not overwritten by
    # future calls to save()
    bokeh.io.reset_output()
    bokeh.io.output_notebook()

In [20]:
def make_figure_per_step_task_count(steps: List[str], quanta: List[int], annotation: str = None) -> bkh.figure:
    """Return a figure representing the number of tasks on each step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, quanta))
    steps, quanta = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'quanta': quanta,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(quanta)*100)
    fig = bkh.figure(
        x_axis_label = 'step',
        x_range = steps,
        y_range = y_range,
        y_axis_label = 'quanta',
        y_axis_type = 'log',
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Execution of quanta per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = 1.5708/2
    
    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Add vertical bar for quanta counts
    bars = fig.vbar(x='steps', top='quanta', bottom=1, width=0.8, color='tan', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=5.9, y=max(quanta)*4.2, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    
    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('quanta', '@quanta{0,0}'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [21]:
total_tasks = df_per_step.select(pl.col('quanta')).sum().row(0)[0]
task_count_per_step_fig = make_figure_per_step_task_count(
    steps = get_column(df_per_step, 'step'), 
    quanta = get_column(df_per_step, 'quanta'),
    annotation = f"total quanta: {total_tasks/1e6:,.1f}M",
)

# Show this figure
bkh.show(task_count_per_step_fig)
# Export this figure
save_figure(task_count_per_step_fig, output_dir=output_dir, filename='quanta-per-step', title='DP0.2 Quanta per step')

In [22]:
def make_figure_per_step_elapsed(steps: List[str], elapsed_time: List[float], annotation: str = None) -> bkh.figure:
    """Return a figure representing the execution time spent on each step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, elapsed_time))
    steps, elapsed_time = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'elapsed_time': elapsed_time,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(elapsed_time)*100)
    fig = bkh.figure(
        x_axis_label = 'step',
        x_range = steps,
        y_range = y_range,
        y_axis_label = 'hours',
        y_axis_type = 'log',
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Elapsed time per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = 1.5708/2

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a a vertical bar for elapsed time
    bars = fig.vbar(x='steps', top='elapsed_time', bottom=1, width=0.8, color='teal', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.5, y=max(elapsed_time)*5, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('elapsed time', '@elapsed_time{0,0} hours'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [23]:
def make_figure_per_step_efficiency(steps: List[int], cpu_efficiency: List[float], annotation: str = None) -> bkh.figure:
    """Return a figure representing the CPU efficiency per step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, cpu_efficiency))
    steps, cpu_time = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps,
        'cpu_efficiency': cpu_efficiency,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0., 1.)
    fig = bkh.figure(
        x_axis_label = 'step',
        
        y_range = y_range,
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU efficiency per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LinearAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.formatter = bokeh.models.NumeralTickFormatter(format='0%')
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a a vertical bar for elapsed time
    bars = fig.vbar(x='steps', top='cpu_efficiency', bottom=0.0, width=0.8, color='thistle', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.8, y=0.85, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)
       
    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('CPU efficiency', '@cpu_efficiency{%0}'), 
         ], 
         renderers=[bars], mode='mouse')) 

    return fig

In [24]:
# Generate a figure for elapsed time vs. step
total_elapsed = df_per_step.select(pl.col('elapsed_time_hours')).sum().row(0)[0]
elapsed_per_step_fig = make_figure_per_step_elapsed(
    steps = get_column(df_per_step, 'step'), 
    elapsed_time = get_column(df_per_step, 'elapsed_time_hours'), 
    annotation = f"aggregated elapsed time: {total_elapsed/1e3:,.1f}k hours",
)

# Generate a figure for CPU efficiency vs. step
#total_elapsed = per_step_df.select(pl.col('elapsed_time_hours')).sum().row(0)[0]
#total_cpu = per_step_df.select(pl.col('cpu_time_hours')).sum().row(0)[0]
#total_efficiency = 100.0 * total_cpu/total_elapsed
#efficiency_per_step_fig = make_figure_per_step_efficiency(
#    steps = get_column(per_step_df, 'step'), 
#    cpu_efficiency = get_column(per_step_df, 'cpu_efficiency'), 
#    annotation = f"aggregated CPU efficiency: {total_efficiency:,.0f}%",
#)
#bkh.show(bokeh.layouts.row(elapsed_per_step_fig, efficiency_per_step_fig))

bkh.show(elapsed_per_step_fig)

# Export these figures
save_figure(elapsed_per_step_fig, output_dir=output_dir, filename='elapsed-per-step', title='DP0.2 Elapsed time per step')
#save_figure(efficiency_per_step_fig, output_dir=output_dir, filename='cpu-efficiency-per-step', title='DP0.2 CPU efficiency per step')

In [25]:
def make_figure_per_step_memory(steps: List[str], max_rss: List[float], annotation: str = None) -> bkh.figure:
    """Return a figure representing the execution time spent on each step
    """
    # Build the data source
    sorted_by_step = sorted(zip(steps, max_rss))
    steps, max_rss = zip(*sorted_by_step)
    source = bkhmodels.ColumnDataSource(data={
        'steps': steps, 
        'max_rss': max_rss,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(max_rss)*2)
    fig = bkh.figure(
        x_axis_label = 'step',
        x_range = steps,
        y_range = y_range,
        y_axis_label = 'gigabyte',
        y_axis_type = 'log',
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Maximum RSS per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = 1.5708/2

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a a vertical bar for elapsed time
    bars = fig.vbar(x='steps', top='max_rss', bottom=1, width=0.8, color='palegoldenrod', alpha=0.7, line_color='black', source=source)
    
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.5, y=max(max_rss)*5, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('step', '@steps'), 
            ('max RSS', '@max_rss{0,0} GB'), 
         ], renderers=[bars], mode='mouse')) 

    return fig

In [26]:
# Generate a figure for peak RSS vs. step
memory_per_step_fig = make_figure_per_step_memory(
    steps = get_column(df_per_step, 'step'), 
    max_rss = get_column(df_per_step, 'max_RSS'), 
)
bkh.show(memory_per_step_fig)

# Export this figure
save_figure(memory_per_step_fig, output_dir=output_dir, filename='peak-rss-per-step', title='DP0.2 Maximum RSS per step')

## Per step pipetask execution details

In [27]:
# Build a new dataframe per step and for each pipetask within a step 
# compute its number of quanta, its elapsed time and its memory
steps = sorted(df_all.select(pl.col('step')).unique().get_column('step').to_list())

step_dfs = {}
for step in steps:
    df_step = df_all.filter(pl.col('step') == step).group_by('task').agg(
        [
            pl.col("task").count().alias("quanta"),
            (pl.col("run_time")/3_600).sum().alias("elapsed_time_hours"),
            pl.col('memory').max().alias('RSS_max'),
            (pl.col('memory').quantile(0.75)-pl.col('memory').quantile(0.25)).alias("RSS_iqr"),
            (pl.col('memory').std()).alias('RSS_std'),
        ]
    )
    step_dfs[step] = df_step

In [28]:
# Generate a table with pipetask details for each step and write
# the same information into a CSV files for export
table = f"""
| step | task | quanta | elapsed time (hours) | max RSS (GB) | inter-quartile RSS (GB) | std RSS (GB) |
| ---- | ---- | -----: | -------------------: | -----------: | ----------------------: | -----------: |
"""
csv_separator = ','
csv_output = csv_separator.join(('step', 'pipetask', 'quanta', 'elapsed_time_hours', 'memory_gb', 'RSS_IQR', 'RSS_std'))
for step in sorted(step_dfs.keys()):
    step_df = step_dfs[step].sort('task')
    for row_index in range(step_df.height):
        row = step_df.row(row_index)
        csv_output += f'\n{step}{csv_separator}' + csv_separator.join((str(v) for v in row))
        step_out = f'**{step}**' if row_index == 0 else ''
        task, quanta, elapsed, memory, mem_iqr, mem_std = row
        if mem_std is None: mem_std = -1
        table += f'| {step_out} | {task} | {quanta:,} | {elapsed:,.1f} | {memory:,.1f} | {mem_iqr:,.3f} | {mem_std:,.3f}\n'
        
# Write to CSV file
with open(os.path.join(output_dir, 'pipetasks.csv'), 'w') as f:
    f.write(csv_output)

print_md(table)


| step | task | quanta | elapsed time (hours) | max RSS (GB) | inter-quartile RSS (GB) | std RSS (GB) |
| ---- | ---- | -----: | -------------------: | -----------: | ----------------------: | -----------: |
| **step1_20240828T132128Z** | calibrate | 13,529 | 167.8 | 2.3 | 0.052 | 0.076
|  | characterizeImage | 13,766 | 272.8 | 2.3 | 0.087 | 0.078
|  | isr | 14,208 | 133.4 | 2.0 | 0.006 | 0.043
|  | transformPreSourceTable | 13,529 | 0.6 | 2.3 | 0.052 | 0.076
|  | writePreSourceTable | 13,529 | 1.1 | 2.3 | 0.052 | 0.076
| **step2b_20240904T090129Z** | gbdesAstrometricFit | 60 | 1.3 | 1.6 | 0.353 | 0.242
|  | isolatedStarAssociation | 16 | 0.1 | 0.7 | 0.050 | 0.031
| **step2c_20240904T112118Z** | fgcmBuildFromIsolatedStars | 1 | 0.0 | 7.7 | 0.000 | -1.000
|  | fgcmFitCycle | 1 | 0.9 | 28.0 | 0.000 | -1.000
|  | fgcmOutputProducts | 1 | 0.0 | 7.6 | 0.000 | -1.000
| **step2d_20240904T130354Z** | consolidateSourceTable | 126 | 0.2 | 1.0 | 0.223 | 0.134
|  | finalizeCharacterization | 128 | 70.8 | 1.4 | 0.112 | 0.068
|  | transformSourceTable | 12,978 | 0.5 | 0.4 | 0.010 | 0.006
|  | updateVisitSummary | 126 | 10.7 | 0.8 | 0.013 | 0.009
|  | writeRecalibratedSourceTable | 12,978 | 31.9 | 0.6 | 0.015 | 0.010
| **step2e_20240904T143621Z** | makeCcdVisitTable | 1 | 0.0 | 2.4 | 0.000 | -1.000
|  | makeVisitTable | 1 | 0.0 | 2.4 | 0.000 | -1.000
| **step3_20240904T145006Z** | assembleCoadd | 405 | 21.8 | 4.3 | 0.180 | 0.303
|  | consolidateObjectTable | 1 | 0.0 | 18.1 | 0.000 | -1.000
|  | deblend | 81 | 31.4 | 4.8 | 0.672 | 0.578
|  | detection | 405 | 16.4 | 1.4 | 0.024 | 0.017
|  | forcedPhotCoadd | 405 | 450.5 | 4.0 | 0.707 | 0.580
|  | healSparsePropertyMaps | 5 | 1.2 | 3.8 | 0.817 | 0.427
|  | makeWarp | 2,815 | 59.8 | 2.1 | 0.413 | 0.288
|  | measure | 405 | 252.9 | 3.7 | 0.536 | 0.518
|  | mergeDetections | 81 | 0.6 | 0.5 | 0.013 | 0.011
|  | mergeMeasurements | 81 | 0.4 | 1.8 | 0.350 | 0.264
|  | selectDeepCoaddVisits | 405 | 0.5 | 0.6 | 0.092 | 0.061
|  | transformObjectTable | 81 | 0.6 | 1.6 | 0.189 | 0.146
|  | writeObjectTable | 81 | 0.8 | 9.2 | 1.794 | 1.349
| **step4_20240905T072021Z** | forcedPhotCcd | 4,602 | 87.1 | 1.4 | 0.162 | 0.120
| **step7_20240909T075532Z** | consolidateHealSparsePropertyMaps | 5 | 0.0 | 0.5 | 0.002 | 0.002
| **step8_20240909T130219Z** | analyzeMatchedVisitCore | 16 | 0.5 | 20.5 | 10.072 | 7.222
|  | analyzeObjectTableCore | 1 | 0.0 | 6.2 | 0.000 | -1.000
|  | analyzeObjectTableSurveyCore | 1 | 0.0 | 1.1 | 0.000 | -1.000
|  | catalogMatchTract | 1 | 0.0 | 1.9 | 0.000 | -1.000
|  | photometricCatalogMatch | 1 | 0.0 | 1.9 | 0.000 | -1.000
|  | photometricRefCatObjectTract | 1 | 0.0 | 0.8 | 0.000 | -1.000
|  | plotPropertyMapTract | 5 | 0.1 | 4.0 | 0.007 | 0.005
|  | refCatObjectTract | 1 | 0.0 | 0.8 | 0.000 | -1.000
|  | validateObjectTableCore | 1 | 0.0 | 1.0 | 0.000 | -1.000


## Per task analysis

In [29]:
# Create a dataframe with details about each pipetask
df_per_task = df_all.group_by("task", maintain_order=True).agg(
    [
        pl.col("task").count().alias("task_count"),
        (pl.col("run_time")/3_600).sum().alias("elapsed_time_hours"),
        pl.col('memory').min().alias('RSS_min'),
        pl.col('memory').max().alias('RSS_max'),
        pl.col('memory').mean().alias('RSS_mean'),
        pl.col('memory').std().alias('RSS_std'),
        pl.col('memory').quantile(0.05).alias('RSS_p05'),
        pl.col('memory').quantile(0.50).alias('RSS_p50'),
        pl.col('memory').quantile(0.95).alias('RSS_p95'),
        (pl.col('memory').quantile(0.75)-pl.col('memory').quantile(0.25)).alias("RSS_iqr"),
    ]
)

In [30]:
task_types = df_per_task.height
total_elapsed_hours = df_per_task.select('elapsed_time_hours').sum().row(0)[0]
#total_cpu_hours = df_per_task.select('cpu_time_hours').sum().row(0)[0]
total_cpu_hours = -1

overview = f"""
There were **{task_types:,} kinds of pipetasks** which consumed in aggregate **{total_elapsed_hours:,.0f} elapsed hours ({total_cpu_hours:,.0f} CPU hours**)
"""
print_md(overview)


There were **41 kinds of pipetasks** which consumed in aggregate **1,617 elapsed hours (-1 CPU hours**)


In [31]:
df_per_task.sort(['elapsed_time_hours', 'task_count'], descending=True)[:10]

task,task_count,elapsed_time_hours,RSS_min,RSS_max,RSS_mean,RSS_std,RSS_p05,RSS_p50,RSS_p95,RSS_iqr
str,u32,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""forcedPhotCoadd""",405,450.520525,1.271026,4.041474,3.166687,0.579597,1.880367,3.286925,3.896869,0.706621
"""characterizeImage""",13766,272.800257,1.732588,2.335285,1.897912,0.077734,1.853223,1.864249,1.970995,0.086766
"""measure""",405,252.883013,1.275191,3.693711,3.01413,0.51801,1.774531,3.169948,3.589526,0.535769
"""calibrate""",13529,167.770902,1.732588,2.335285,1.896656,0.075569,1.853219,1.864229,1.970242,0.052445
"""isr""",14208,133.382186,1.732588,1.975103,1.883975,0.043117,1.853121,1.863995,1.967686,0.005689
"""forcedPhotCcd""",4602,87.111164,0.641524,1.406783,1.070343,0.120346,0.883073,1.062605,1.262907,0.161698
"""finalizeCharacterization""",128,70.806853,1.161986,1.436373,1.317475,0.068107,1.209164,1.303474,1.4262272,0.112452
"""makeWarp""",2815,59.838098,0.789139,2.107859,1.669589,0.287761,1.116328,1.806442,1.95652,0.412787
"""writeRecalibratedSourceTable""",12978,31.88944,0.497066,0.561394,0.517276,0.010427,0.502383,0.516116,0.535921,0.014959
"""deblend""",81,31.401998,2.578502,4.837626,4.209757,0.578263,2.890662,4.414087,4.779254,0.672244


# HERE

In [32]:
def make_figure_outliers_per_task(x: pl.Series, y: pl.Series, annotation: str = None) -> bkh.figure:
    """Return a figure representing the x VS y quantity, on outlier task
    """
    # Build the data source
    x_name, y_name = x.name, y.name
    source = bkhmodels.ColumnDataSource(data={
        x_name: x.to_list(), 
        y_name: y.to_list(),
    })
    print(x.to_list())
    print(y.to_list())
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(y)*1.1)
    fig = bkh.figure(
        x_axis_label = x_name,
        #x_range = steps,
        #y_range = y_range,
        y_axis_label = y_name,
        #y_axis_type = 'log',
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Maximum RSS per step", text_font_size="18pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = 1.5708/2

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    #fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        #y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add a scatter plot
    circle = fig.circle(x=x_name, y=y_name, size=20, source=source)
    print(circle)
    # Add annotation label
    if annotation is not None:
        label = bkhmodels.Label(x=4.5, y=max(y)*5, x_units='data', y_units='data',
                     text=f'\n {annotation} \n', text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)

    # Hide toolbar
    fig.toolbar.autohide = True

    # Add tooltips
    #fig.add_tools(bkhmodels.HoverTool(
    #    tooltips=[
    #        ('step', '@steps'), 
    #        ('max RSS', '@max_rss{0,0} GB'), 
    #     ], renderers=[circle], mode='mouse')) 

    return fig

In [33]:
df_tmp = df_per_task.sort(['RSS_iqr', 'task_count'], descending=True).filter(pl.col('RSS_iqr') > 0.5) # Get task whith inter-quartile range greater than 500MB

for t in df_tmp['task'].to_list():
    low_tresh, high_tresh = df_tmp.filter(pl.col('task') == t)['RSS_p05'], df_tmp.filter(pl.col('task') == t)['RSS_p95']
    df_outliers = df_all.filter((pl.col('task') == t) & ((pl.col("memory") < low_tresh) | (pl.col("memory") > high_tresh))) # filter outlier
    if len(df_outliers) == 0: continue #if no data passed the filter, go to next
    df_outliers = df_outliers[[s.name for s in df_outliers if not (s.null_count() == df_outliers.height)]] # remove null column
    cols_to_drop = []
    for c in df_outliers.get_columns():
        if len(c.unique()) == 1: cols_to_drop.append(c.name)
    df_data = df_outliers.group_by(cols_to_drop).agg() # keep repeating data
    df_outliers = df_outliers.drop(cols_to_drop) # remove column where all info is the same
    display(df_data)
    display(df_outliers)
    for x in df_outliers.get_columns():
        for y in df_outliers.get_columns():
            if x.name == y.name: continue
            #cur_fig = make_figure_outliers_per_task(x, y, annotation="no annot\n please")
            #bkh.show(cur_fig)
    

step,task,instrument,skymap
str,str,str,str
"""step8_20240909T130219Z""","""analyzeMatchedVisitCore""","""HSC""","""hsc_rings_v1"""


memory,init_time,run_time,tract
f64,f64,f64,i64
0.55935,0.001395,0.545653,9459
20.463735,0.00136,344.350804,9704


step,task,skymap,tract
str,str,str,i64
"""step3_20240904T145006Z""","""writeObjectTable""","""hsc_rings_v1""",9461


memory,init_time,run_time,patch
f64,f64,f64,i64
4.054237,0.000353,25.260405,0
3.243942,0.000265,15.462625,1
4.196319,0.000392,21.798547,2
4.327793,0.000368,27.066737,8
9.152283,0.000368,52.46107,60
8.913154,0.000262,38.070453,69
8.770863,0.000252,37.22936,72
8.910868,0.000288,39.289984,80


step,task,skymap,tract
str,str,str,i64
"""step3_20240904T145006Z""","""forcedPhotCoadd""","""hsc_rings_v1""",9461


band,memory,init_time,run_time,patch
str,f64,f64,f64,i64
"""g""",1.705181,0.062467,430.541281,0
"""g""",1.558442,0.055472,311.917842,1
"""g""",1.661223,0.064237,473.0547,7
"""g""",4.012585,0.064882,5448.748275,60
"""g""",3.910164,0.063551,4249.729586,66
…,…,…,…,…
"""z""",3.936625,0.046108,5358.174137,59
"""z""",4.041474,0.064401,9063.883401,60
"""z""",3.993817,0.063176,7341.019736,64
"""z""",3.922887,0.062055,5632.858129,66


step,task,skymap,tract
str,str,str,i64
"""step3_20240904T145006Z""","""deblend""","""hsc_rings_v1""",9461


memory,init_time,run_time,patch
f64,f64,f64,i64
2.579476,0.007913,216.340406,0
2.578502,0.011372,264.3442,1
2.78793,0.010111,297.577372,7
2.685706,0.01256,341.475985,8
4.810179,0.011866,2168.876055,55
4.786881,0.012539,2484.014178,60
4.793577,0.012134,2157.556995,70
4.837626,0.011565,2333.049258,72


step,task,skymap,tract
str,str,str,i64
"""step3_20240904T145006Z""","""measure""","""hsc_rings_v1""",9461


band,memory,init_time,run_time,patch
str,f64,f64,f64,i64
"""g""",1.703879,0.023069,190.074785,0
"""g""",1.277268,0.031167,198.880431,1
"""g""",1.757753,0.033203,575.266418,2
"""g""",1.589654,0.023088,181.271675,8
"""g""",3.595682,0.033528,2473.717281,50
…,…,…,…,…
"""z""",1.664582,0.034048,709.911251,7
"""z""",3.633242,0.033702,3536.94333,55
"""z""",3.617399,0.033056,4435.682248,60
"""z""",3.599454,0.03184,3860.573802,63


## Pipetask execution time

In [34]:
def make_figure_execution_time(tasks: List[str], elapsed_time: List[float], cpu_time: List[float], cpu_efficiency: List[float]) -> bkh.figure:
    """Return a figure representing the execution time spent on each task category.
    """
    # Build the data source
    sorted_by_elapsed = sorted(zip(elapsed_time, cpu_time, cpu_efficiency, tasks), reverse=True)
    elapsed_time, cpu_time, cpu_efficiency, tasks = zip(*sorted_by_elapsed)    
    total_elapsed = sum(elapsed_time)
    elapsed_percentage = [100.0 * v/total_elapsed for v in elapsed_time]
    elapsed_cumulated = np.cumsum(elapsed_percentage)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks, 
        'elapsed_time': elapsed_time,
        'elapsed_percentage': elapsed_percentage,
        'elapsed_cumulated': elapsed_cumulated,
        'cpu_time': cpu_time,
        'cpu_efficiency': cpu_efficiency,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(1, max(elapsed_time)*5)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        y_axis_label = 'hours',
        y_axis_type = 'log',
        plot_width = 1_600,
        plot_height = 800,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Elapsed and CPU time spent by pipetask", text_font_size="18pt"), 'above')

    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_orientation = math.pi/3
    fig.xaxis.major_label_text_font_size = "11pt"
    
    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Annotation: band to show the tasks consuming a given percentage of elapsed time
    threshold_percent = 90
    right = np.argmax(elapsed_cumulated >= threshold_percent) + 1
    box = bkhmodels.BoxAnnotation(left=0, right=right, fill_alpha=0.1, fill_color='lightcoral')
    fig.add_layout(box)
    
    # Annotation: label to show a message on the meaning of the band
    label = bkhmodels.Label(x=1, y=max(elapsed_time)*2, x_units='data', y_units='data',
                 text=f'{threshold_percent:.0f}% of elapsed time', text_font_size='12pt', text_color='crimson')
    fig.add_layout(label)
    
    overall_cpu_efficiency = 100.0 * sum(cpu_time) / sum(elapsed_time)
    efficiency = bkhmodels.Label(x=len(tasks)//2.5, y=max(elapsed_time)/5, x_units='data', y_units='data',
                 text=f'\n aggregated CPU efficiency: {overall_cpu_efficiency:.0f}% \n', text_font_size='12pt',
                 text_color='dimgray', text_alpha=0.8,
                 background_fill_color='white', background_fill_alpha=1.0,
                 border_line_color='dimgray', border_line_alpha=0.5)
    fig.add_layout(efficiency)

    # Add a dash for elapsed time and a vertical bar for CPU time
    dashes = fig.dash(x='tasks', y='elapsed_time', color='crimson', size=15, line_width=2, source=source, legend_label='elapsed time')
    bars = fig.vbar(x='tasks', top='cpu_time', bottom=0.001, width=0.8, color='steelblue', alpha=0.7, source=source, legend_label='CPU time')
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(
        tooltips=[
            ('Task', '@tasks'), 
            ('Elapsed time', '@elapsed_time{0,0} h'), 
            ('CPU time', '@cpu_time{0,0} h'), 
            ('CPU efficiency', '@cpu_efficiency{%0}'), 
            ('Elapsed time (% of total)', '@elapsed_percentage{0.2f} %'), 
            ('Cumulated elapsed time', '@elapsed_cumulated{0.2f} %'),
        ], renderers=[bars, dashes], mode='mouse')) 

    return fig

In [35]:
# Select the tasks having spent more than zero hours (CPU time)
df_per_task_non_zero = df_per_task#.filter(pl.col('cpu_time_hours') > 0.0)
task_elapsed_fig = make_figure_execution_time(
    tasks = get_column(df_per_task_non_zero, 'task'), 
    elapsed_time = get_column(df_per_task_non_zero, 'elapsed_time_hours'),
    cpu_time = get_column(df_per_task_non_zero, 'cpu_time_hours'),
    cpu_efficiency = get_column(df_per_task_non_zero, 'cpu_efficiency'),
    #cpu_time = range(len(,
    #cpu_efficiency = None,
)
bkh.show(task_elapsed_fig)

# Export this figure
save_figure(task_elapsed_fig, output_dir=output_dir, filename='elapsed-cpu-per-pipetask', title='DP0.2 Elapsed and CPU time per pipetask')

ColumnNotFoundError: "cpu_time_hours" not found

In [36]:
def make_figure_cpu_efficiency(tasks: List[str], cpu_efficiency: List[float]) -> bkh.figure:
    """Return a figure representing the CPU efficiency per pipetask.
    """
    # Build the data source
    sorted_by_efficiency = sorted(zip(cpu_efficiency, tasks), reverse=True)
    cpu_efficiency, tasks = zip(*sorted_by_efficiency)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks, 
        'cpu_efficiency': cpu_efficiency,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0, 1)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        plot_width = 1_600,
        plot_height = 800,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU efficiency by pipetask", text_font_size="18pt"), 'above')

    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LinearAxis(y_range_name="y_axis_right"), 'right')

    # Axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_orientation = math.pi/3
    fig.xaxis.major_label_text_font_size = "11pt"

    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.formatter = bokeh.models.NumeralTickFormatter(format='0%')
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
     
    # Add bars
    palette = ['crimson', 'crimson', 'crimson', 'sandybrown', 'mediumseagreen']
    color_map = bokeh.transform.linear_cmap(field_name='cpu_efficiency', palette=palette, low=0, high=max(cpu_efficiency))
    bars = fig.vbar(x='tasks', top='cpu_efficiency', bottom=0, width=0.8, source=source, color=color_map, fill_color=color_map, alpha=0.6)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('pipetask', '@tasks'), ('CPU efficiency', '@cpu_efficiency{%0}')], renderers=[bars], mode='mouse'))

    return fig

In [37]:
task_cpu_efficiency_fig = make_figure_cpu_efficiency(
    tasks = get_column(df_per_task, 'task'),
    cpu_efficiency = get_column(df_per_task, 'cpu_efficiency')
)
bkh.show(task_cpu_efficiency_fig)

# Export this figure
save_figure(task_cpu_efficiency_fig, output_dir=output_dir, filename='cpu-efficiency-per-pipetask', title='DP0.2 CPU effiency per pipetask')

ColumnNotFoundError: "cpu_efficiency" not found

## Pipetask memory consumption

In [38]:
def make_figure_rss_max(tasks: List[str], rss_max: List[float]) -> bkh.figure:
    """Return a figure representing the peak RSS per pipetask.
    """
    # Build the data source
    sorted_by_rss = sorted(zip(rss_max, tasks), reverse=True)
    rss_max, tasks = zip(*sorted_by_rss)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks, 
        'rss_max': rss_max,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0.1, max(rss_max)*5)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        y_axis_label = 'gigabyte',
        y_axis_type = 'log',
        width = 1_600,
        height = 800,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin Observatory – Processing for Data Preview 0.2 at FrDF (v23.0.1)", text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Peak RSS by pipetask kind", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = math.pi/3
    
    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')

    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.formatter = bokeh.models.NumeralTickFormatter(format='0,0')
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
    
    # Add vertical bars
    # bars = fig.vbar(x='tasks', top='rss_max', bottom=0, width=0.8, color='mediumturquoise', alpha=0.7, source=source)
    circles = fig.scatter(x='tasks', y='rss_max', color='crimson', size=10, alpha=0.7, source=source)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('Task', '@tasks'), ('Max RSS', '@rss_max{0.2f} GB')], renderers=[circles], mode='mouse'))
    
    return fig

In [39]:
task_memory_fig = make_figure_rss_max(
    tasks=df_per_task.get_column('task').to_list(),
    rss_max=df_per_task.get_column('RSS_max').to_list()
)
bkh.show(task_memory_fig)

In [40]:
def make_figure_rss_quantiles(tasks: List[str], rss_min: List[float], rss_max: List[float], rss_mean: List[float], rss_pct_low: List[float], rss_pct_high: List[float], label: str = None, note: str = None) -> bkh.figure:
    """Return a figure representing the RSS distribution for each pipetask.
    """
    # Build the data source: sort tasks by rss_max
    sorted_by_max = sorted(zip(rss_max, rss_min, rss_mean, rss_pct_low, rss_pct_high, tasks), reverse=True)
    rss_max, rss_min, rss_mean, rss_pct_low, rss_pct_high, tasks = zip(*sorted_by_max)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks,
        'rss_max': rss_max,
        'rss_min': rss_min,
        'rss_mean': rss_mean,
        'rss_pct_low': rss_pct_low,
        'rss_pct_high': rss_pct_high,
    })
    
    # Build and configure the figure
    y_range = bokeh.models.Range1d(0.1, max(rss_max)*3)
    fig = bkh.figure(
        x_range = tasks,
        y_range = y_range,
        x_axis_label = 'pipetask',
        y_axis_label = 'gigabyte',
        y_axis_type = 'log',
        width = 1_600,
        height = 1_000,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Memory consumption by pipetask", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = math.pi/2.5
    
    # Add secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')

    # Format Y axis
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Annotation
    if label is not None:
        annotation_label = bkhmodels.Label(x=len(tasks)//2.2, y=max(rss_max)/1.2, x_units='data', y_units='data',
                     text=label, text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(annotation_label)

    
    # Add glyphs
    dashes_max = fig.scatter(marker='dash', x='tasks', y='rss_max', color='indianred', size=15, line_width=3, source=source, legend_label='max')
    mean = fig.scatter(x='tasks', y='rss_mean', size=8, color='mediumseagreen', source=source, legend_label='mean')
    dashes_min = fig.scatter(marker='dash', x='tasks', y='rss_min', color='steelblue', size=15, line_width=3, source=source, legend_label='min')
    whisker = bkhmodels.Whisker(base='tasks', upper='rss_pct_high', lower='rss_pct_low', source=source)
    fig.add_layout(whisker)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('Task', '@tasks'),
        ('RSS min', '@rss_min{0.2f} GB'),
        ('RSS percentile low', '@rss_pct_low{0.2f} GB'),
        ('RSS mean', '@rss_mean{0.2f} GB'),
        ('RSS percentile high', '@rss_pct_high{0.2f} GB'),
        ('RSS max', '@rss_max{0.2f} GB'),
    ], renderers=[mean, dashes_max, dashes_min], mode='mouse'))
    
    # Add note below the figure
    if note is not None:
        fig.add_layout(bkhmodels.Title(text=note, text_font_style='italic'), 'below')
    
    return fig

In [41]:
# Select the tasks we want to include in our figure
max_rss_lower_bound = 0.5  # gigabytes
df_per_task_non_zero = df_per_task.filter((pl.col('RSS_p05') >= 0.1) & (pl.col('RSS_max') >= max_rss_lower_bound))

task_rss_quantiles_fig = make_figure_rss_quantiles(
    tasks = get_column(df_per_task_non_zero, 'task'),
    rss_max = get_column(df_per_task_non_zero, 'RSS_max'),
    rss_min = get_column(df_per_task_non_zero, 'RSS_min'),
    rss_mean = get_column(df_per_task_non_zero, 'RSS_mean'),
    rss_pct_low = get_column(df_per_task_non_zero, 'RSS_p05'),
    rss_pct_high = get_column(df_per_task_non_zero, 'RSS_p95'),
    label=f'\n Pipetasks with max RSS ≥ {max_rss_lower_bound} GB \n (whiskers show 5th to 95th percentiles) \n',
#    note=f'NOTE: whiskers show 5th to 95th percentiles.',
)
bkh.show(task_rss_quantiles_fig)

# Export this figure
save_figure(task_rss_quantiles_fig, output_dir=output_dir, filename='memory-consumption-per-pipetask', title='DP0.2 Memory consumption per pipetask')

## Memory distribution for pipetask consuming most of the compute time

In [42]:
def make_figure_big_task_consumers(tasks: List[str], elapsed_time_pct: List[float], rss_min: List[float], rss_max: List[float], rss_mean: List[float],
                                   rss_pct_low: List[float], rss_pct_high: List[float], label: str = None, note: str = None) -> bkh.figure:
    """Return a figure representing the RSS distribution for each pipetask and its consumption of compute time.
    """
    # Build the data source: sort tasks by rss_max
    sorted_by_elapsed = sorted(zip(elapsed_time_pct, rss_min, rss_max, rss_mean, rss_pct_low, rss_pct_high, tasks), reverse=True)
    elapsed_time_pct, rss_min, rss_max, rss_mean, rss_pct_low, rss_pct_high, tasks = zip(*sorted_by_elapsed)
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks,
        'elapsed_time_pct': elapsed_time_pct,
        'elapsed_time_cumulated': np.cumsum(elapsed_time_pct),
        'rss_min': rss_min,
        'rss_max': rss_max,
        'rss_mean': rss_mean,
        'rss_pct_low': rss_pct_low,
        'rss_pct_high': rss_pct_high,
    })
    
    # Build and configure the figure
    left_y_range = bokeh.models.Range1d(0.1, max(rss_max)*2)
    fig = bkh.figure(
        x_range = tasks,
        y_range = left_y_range,
        y_axis_type = 'log',
        y_axis_label = 'gigabyte',
        width = 1_600,
        height = 1_200,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Memory used by the most compute-intensive pipetasks", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_size = "11pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "11pt"
    fig.xaxis.major_label_orientation = math.pi/2.5
    
    # Add secondary Y axis
    bottom_right_y_axis = min(elapsed_time_pct) / 2
    top_right_axis = max(elapsed_time_pct) * 1.2
    fig.extra_y_ranges = {"y_axis_right": bokeh.models.Range1d(bottom_right_y_axis, top_right_axis)}
    fig.add_layout(bokeh.models.LinearAxis(y_range_name='y_axis_right', axis_label='total elapsed time'), 'right')
    fig.yaxis[1].formatter = bokeh.models.NumeralTickFormatter(format='0%')

    # Format Y axis
    # fig.yaxis[1].ticker = bokeh.models.tickers.LogTicker()
    for y_axis in fig.yaxis:
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"
        
    # Annotation
    if label is not None:
        annotation_label = bkhmodels.Label(x=len(tasks)//2.2, y=max(rss_max)/1.2, x_units='data', y_units='data',
                     text=label, text_font_size='12pt',
                     text_color='dimgray', text_alpha=0.8,
                     background_fill_color='white', background_fill_alpha=1.0,
                     border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(annotation_label)

    
    # Add glyphs
    dashes_max = fig.scatter(marker='dash', x='tasks', y='rss_max', color='indianred', size=15, line_width=3, source=source, legend_label='max RSS')
    mean = fig.scatter(x='tasks', y='rss_mean', size=8, color='mediumseagreen', source=source, legend_label='mean RSS ')
    dashes_min = fig.scatter(marker='dash', x='tasks', y='rss_min', color='steelblue', size=15, line_width=3, source=source, legend_label='min RSS')
    bars = fig.vbar(x='tasks', top='elapsed_time_pct', bottom=bottom_right_y_axis, width=0.8, color='tan', alpha=0.3, source=source, legend_label='elapsed time', y_range_name='y_axis_right')
    
    whisker = bkhmodels.Whisker(base='tasks', upper='rss_pct_high', lower='rss_pct_low', source=source)
    fig.add_layout(whisker)
    
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Hide legend on click
    fig.legend.click_policy = 'mute'

    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('Task', '@tasks'),
        ('RSS max', '@rss_max{0.2f} GB'),
        ('RSS percentile high', '@rss_pct_high{0.2f} GB'),
        ('RSS mean', '@rss_mean{0.2f} GB'),
        ('RSS percentile low', '@rss_pct_low{0.2f} GB'),
        ('RSS min', '@rss_min{0.2f} GB'),
        ('elapsed time', '@elapsed_time_pct{0.0%}'),
        ('cumulated elapsed time', '@elapsed_time_cumulated{0.0%}'),
    ], renderers=[bars, mean, dashes_max, dashes_min], mode='mouse'))
    
    # Add note below the figure
    if note is not None:
        fig.add_layout(bkhmodels.Title(text=note, text_font_style='italic'), 'below')
    
    return fig

In [43]:
# Select the tasks which consume in aggregate a threshold of the elapsed time
task_time_consumers_df = df_per_task#.filter(pl.col('cpu_time_hours') > 0.0)
tasks = get_column(task_time_consumers_df, 'task')
elapsed_time = get_column(task_time_consumers_df, 'elapsed_time_hours')

# Sort by elapsed time in decreasing order
sorted_by_elapsed_time = sorted(zip(elapsed_time, tasks), reverse=True)
elapsed_time, tasks = zip(*sorted_by_elapsed_time)

# Compute the percentage of total time each kind of pipetask spent in execution
elapsed_percentage = [100.0 * v/sum(elapsed_time) for v in elapsed_time]
elapsed_cumulated = np.cumsum(elapsed_percentage)

cumulated_elapsed_threshold = 98 # percent
time_consumers_tasks = np.array(tasks)[elapsed_cumulated <= cumulated_elapsed_threshold].tolist()

# Build a dataframe with big consumer tasks and their memory usage
# with columns:
#    task, elapsed_time_hours, min_RSS, mean_RSS, max_RSS, p05_RSS, p95_RSS
time_consumers_df = (
    df_per_task.with_columns(
        # Add column 'elapsed_time_pct' with the percentage of total time consumed by each kind of pipetask
        (pl.col('elapsed_time_hours') / pl.col('elapsed_time_hours').sum()).alias('elapsed_time_pct')
    )
    .filter(
        # Select only the tasks consuming the most
        pl.col('task').is_in(time_consumers_tasks)
    )
    .sort(by='elapsed_time_hours', descending=True)
)

# Build the plot
task_rss_consumers_fig = make_figure_big_task_consumers(
    tasks = get_column(time_consumers_df, 'task'),
    elapsed_time_pct = get_column(time_consumers_df, 'elapsed_time_pct'),
    rss_max = get_column(time_consumers_df, 'RSS_max'),
    rss_min = get_column(time_consumers_df, 'RSS_min'),
    rss_mean = get_column(time_consumers_df, 'RSS_mean'),
    rss_pct_low = get_column(time_consumers_df, 'RSS_p05'),
    rss_pct_high = get_column(time_consumers_df, 'RSS_p95'),
    # label=f'\n Pipetasks which consume in aggregate {cumulated_elapsed_threshold}% of elapsed time.\n',
    note = f'NOTE: the pipetasks shown consume in aggregate {cumulated_elapsed_threshold}% of the total elapsed time of the DP0.2 campaign. Whiskers show 5th to 95th RSS percentiles.',
)
bkh.show(task_rss_consumers_fig)

# Export this figure
save_figure(task_rss_consumers_fig, output_dir=output_dir, filename='memory-by-compute-intensive-pipetasks', title='DP0.2 Memory by most compute-intensive pipetasks')

## Memory vs execution time scatter plot 

In [172]:
from bokeh.palettes import Category20
from bokeh.transform import factor_cmap
from scipy.spatial import ConvexHull

def make_figure_correlation_memory_time(
    tasks: List[str], run_time: List[float], elapsed_time_pct: List[float], rss_max: List[float], 
    label: str = None
) -> bkh.figure:
    """Return a scatter plot figure with max RSS and compute time. Color by task name."""
    
    # Sort the data
    sorted_by_elapsed = sorted(zip(elapsed_time_pct, run_time, rss_max, tasks), reverse=True)
    elapsed_time_pct, run_time, rss_max, tasks = zip(*sorted_by_elapsed)
    
    # Create a data source
    source = bkhmodels.ColumnDataSource(data={
        'tasks': tasks,
        'run_time': run_time,
        'elapsed_time_pct': elapsed_time_pct,
        'rss_max': rss_max,
    })
    
    # Use counter to sort tasks by quantity
    from collections import Counter
    task_counter = dict(Counter(tasks).most_common())
    # Generate a color palette using Viridis256
    unique_tasks = list(task_counter.keys())  # Ensure unique tasks
    palette = Category20[min(20, len(unique_tasks))]  # Use at most 20 colors
    #palette = [Viridis256[i * 256 // len(unique_tasks)] for i in range(len(unique_tasks))] # Use infinite colors, but less visually striking
    color_map = factor_cmap('tasks', palette=palette, factors=unique_tasks)

    # Build and configure the figure
    left_y_range = bokeh.models.Range1d(min(rss_max)-0.1, max(rss_max) * 1.2)
    x_range = bokeh.models.Range1d(0.1, 10000)

    fig = bkh.figure(
        x_axis_type='log',
        x_axis_label='Elapsed time (s)',
        #x_range=x_range,
        y_range=left_y_range,
        #y_axis_type='log',
        y_axis_label='gigabyte',
        width=1_400,
        height=800,
        background_fill_color="#f4f3f3",
    )
    
    # Add title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Max memory used versus elapsed time, grouped by task,", text_font_size="18pt"), 'above')

    # Configure axes
    for x_axis in fig.xaxis:
        x_axis.axis_label_text_font_size = "11pt"
        x_axis.axis_label_text_font_style = "bold"
        x_axis.major_label_text_font_size = "11pt"

    for y_axis in fig.yaxis:
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"

    # Add scatter points with dynamic color by task
    renderers_pt, renderers_ch = [], []
    for task, count in task_counter.items():
        # Extract data points for the current task
        task_x = [rt for t, rt, pctt, mem in zip(tasks, run_time, elapsed_time_pct, rss_max) if t == task]
        task_y = [mem for t, rt, pctt, mem in zip(tasks, run_time, elapsed_time_pct, rss_max) if t == task]
        task_pctt = [pctt for t, rt, pctt, mem in zip(tasks, run_time, elapsed_time_pct, rss_max) if t == task]
        
        # Create a data source for the task points
        task_source = bkhmodels.ColumnDataSource(data={
            'run_time': task_x,
            'rss_max': task_y,
        })

        # Plot task points
        renderers_pt.append(
            fig.scatter(
                marker='circle', x='run_time', y='rss_max', size=5,
                line_width=3, color=palette[unique_tasks.index(task)],
                source=task_source, legend_label=f'{task} / {count} calls / {sum(task_pctt):.2%}',
                muted_alpha=0.05,
                
            )
        )

        # Compute convex hull for the task's points if there are enough points (min 3)
        if len(task_x) >= 3:
            points = np.array(list(zip(task_x, task_y)))
            hull = ConvexHull(points)

            # Get the vertices of the hull and close the polygon by appending the first point
            hull_vertices = points[hull.vertices]
            hull_vertices = np.append(hull_vertices, [hull_vertices[0]], axis=0)

            # Create a ColumnDataSource for the hull vertices
            hull_source = bkhmodels.ColumnDataSource(data={
                'task_name' : [task for _ in hull_vertices],
                'x': hull_vertices[:, 0],
                'y': hull_vertices[:, 1]
            })
            # Plot the convex hull as a patch
            r = fig.patch(
                x='x', y='y', source=hull_source,
                color=palette[unique_tasks.index(task)], 
                fill_alpha=0.2, line_alpha=0.5, line_width=2
                )
            renderers_ch.append(r)

    # Configure toolbar and legend
    fig.toolbar.autohide = True
    fig.legend.click_policy = 'hide'
    fig.legend.label_text_font_size = '8pt'

    # Add tooltips
    # TODO : tooltips for the convex hull (patch) does not work ?!
    #fig.add_tools(bkhmodels.HoverTool(
    #    tooltips=[('Task', '@{task_name}')],
    #    renderers=renderers_ch,
    #    mode='mouse')
    #)
    
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('RSS max', '@rss_max{0.2f} GB'),
        ('Elapsed time', '@run_time{0.1f} s'),
    ], renderers=renderers_pt, mode='mouse'))
    
    # Legend location
    #fig.add_layout(fig.legend[0],'right') # outside the plot
    fig.legend[0].location = 'top_left' # inside the plot

    # Add legend title
    if label is not None: fig.legend[0].title = label
    
    return fig

In [173]:
# Select the tasks to include
nb_top = 10

topn_task = df_summary.sort("quanta", descending=True)[:nb_top].select(pl.col("task")) # top n most executed task
df_all_non_zero = df_all.filter((pl.col('memory') >= 0.1) & (pl.col('task').is_in(topn_task)))

# Create and display the figure
task_correlation_memory_time_fig = make_figure_correlation_memory_time(
    tasks=get_column(df_all_non_zero, 'task'),
    run_time=get_column(df_all_non_zero, 'run_time'),
    elapsed_time_pct=get_column(df_all_non_zero, 'elapsed_time_pct'),
    rss_max=get_column(df_all_non_zero, 'memory'),
    label=f'{nb_top} most called pipetasks\nname / nb calls / % total time',
)

bkh.show(task_correlation_memory_time_fig)
save_figure(task_correlation_memory_time_fig, output_dir=output_dir, filename='memory-vs-runtime-top10', title='DP0.2 Memory vs Execution time, top10')

In [174]:
# Select the tasks to include
nb_from = nb_top
nb_top = 20

topn_task = df_summary.sort("quanta", descending=True)[nb_from:nb_top].select(pl.col("task")) # top n most executed task
df_all_non_zero = df_all.filter((pl.col('memory') >= 0.1) & (pl.col('task').is_in(topn_task)))

# Create and display the figure
task_correlation_memory_time_fig = make_figure_correlation_memory_time(
    tasks=get_column(df_all_non_zero, 'task'),
    run_time=get_column(df_all_non_zero, 'run_time'),
    elapsed_time_pct=get_column(df_all_non_zero, 'elapsed_time_pct'),
    rss_max=get_column(df_all_non_zero, 'memory'),
    label=f'{nb_from}-th to {nb_top}-th most called pipetasks\nname / nb calls / % total time',
)

bkh.show(task_correlation_memory_time_fig)
save_figure(task_correlation_memory_time_fig, output_dir=output_dir, filename='memory-vs-runtime-top20', title='DP0.2 Memory vs Execution time, top 10 to 20')

# Select the tasks to include
nb_from = nb_top
nb_top = 30

topn_task = df_summary.sort("quanta", descending=True)[nb_from:nb_top].select(pl.col("task")) # top n most executed task
df_all_non_zero = df_all.filter((pl.col('memory') >= 0.1) & (pl.col('task').is_in(topn_task)))

# Create and display the figure
task_correlation_memory_time_fig = make_figure_correlation_memory_time(
    tasks=get_column(df_all_non_zero, 'task'),
    run_time=get_column(df_all_non_zero, 'run_time'),
    elapsed_time_pct=get_column(df_all_non_zero, 'elapsed_time_pct'),
    rss_max=get_column(df_all_non_zero, 'memory'),
    label=f'{nb_from}-th to {nb_top}-th most called pipetasks\nname / nb calls / % total time',
)

bkh.show(task_correlation_memory_time_fig)
save_figure(task_correlation_memory_time_fig, output_dir=output_dir, filename='memory-vs-runtime-top30', title='DP0.2 Memory vs Execution time, top 20 to 30')

# Select the tasks to include
nb_from = nb_top
nb_top = 41

topn_task = df_summary.sort("quanta", descending=True)[nb_from:nb_top].select(pl.col("task")) # top n most executed task
df_all_non_zero = df_all.filter((pl.col('memory') >= 0.1) & (pl.col('task').is_in(topn_task)))

# Create and display the figure
task_correlation_memory_time_fig = make_figure_correlation_memory_time(
    tasks=get_column(df_all_non_zero, 'task'),
    run_time=get_column(df_all_non_zero, 'run_time'),
    elapsed_time_pct=get_column(df_all_non_zero, 'elapsed_time_pct'),
    rss_max=get_column(df_all_non_zero, 'memory'),
    label=f'{nb_from}-th to {nb_top}-th most called pipetasks\nname / nb calls / % total time',
)

bkh.show(task_correlation_memory_time_fig)
save_figure(task_correlation_memory_time_fig, output_dir=output_dir, filename='memory-vs-runtime-top41', title='DP0.2 Memory vs Execution time, top 30 to 41')

## Memory distribution per pipetask

In [143]:
df_all

step,task,band,instrument,day_obs,detector,physical_filter,visit,memory,init_time,run_time,group,exposure,skymap,tract,patch
str,str,str,str,i64,i64,str,i64,f64,f64,f64,str,i64,str,i64,i64
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20150813,0,"""HSC-G""",38440,1.860194,0.043232,17.485948,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42158,1.864344,0.03966,42.24704,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42170,1.867305,0.040183,24.076729,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42172,1.864229,0.03583,36.990868,,,,,
"""step1_20240828T132128Z""","""calibrate""","""g""","""HSC""",20151014,0,"""HSC-G""",42174,1.851531,0.046396,23.545905,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""step8_20240909T130219Z""","""plotPropertyMapTract""","""r""",,,,,,3.972592,0.000389,64.289125,,,"""hsc_rings_v1""",9461,
"""step8_20240909T130219Z""","""plotPropertyMapTract""","""y""",,,,,,3.975291,0.000495,94.022988,,,"""hsc_rings_v1""",9461,
"""step8_20240909T130219Z""","""plotPropertyMapTract""","""z""",,,,,,3.98524,0.000558,92.639373,,,"""hsc_rings_v1""",9461,
"""step8_20240909T130219Z""","""refCatObjectTract""",,,,,,,0.770441,0.000635,14.813182,,,"""hsc_rings_v1""",9461,


In [144]:
def make_figure_rss_histogram_per_task(distribution: List[float], task: str, annotation: str=None) -> bkh.figure:
    """Return a figure with a histogram of the RSS for a given task
    """
    # Compute histogram
    hist, edges = np.histogram(distribution, density=False, bins=bins)
    count = len(distribution)
    source = bokeh.models.ColumnDataSource(data=dict(base=[max(hist)],
                                                     upper=[distribution.quantile(0.95)], lower=[distribution.quantile(0.05)],
                                                     q1=[distribution.quantile(0.25)], q2=[distribution.quantile(0.50)], q3=[distribution.quantile(0.75)]
                                                    ))
    
    # Build and configure the figure
    width, height = 1_000, 600
    y_range = bokeh.models.Range1d(1, max(hist)*15)
    fig = bkh.figure(
        x_range = (edges[0], edges[-1]),
        x_axis_label = 'gigabyte',
        y_axis_label = 'frequency',
        y_range = y_range,
        y_axis_type = 'log',
        width = width,
        height = height,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text=f"Distribution of memory usage by {task} pipetask", text_font_size="16pt"), 'above')

    # Format X axis
    fig.xaxis.axis_label_text_font_size = "12pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.xaxis.formatter = bokeh.models.NumeralTickFormatter(format='0,.f')

    # Add and format secondary Y axis
    fig.extra_y_ranges = {"y_axis_right": y_range}
    fig.add_layout(bokeh.models.LogAxis(y_range_name="y_axis_right"), 'right')
    for y_axis in fig.yaxis:
        y_axis.ticker = bokeh.models.tickers.LogTicker()
        y_axis.axis_label_text_font_size = "14pt"
        y_axis.major_label_text_font_size = "12pt"
        y_axis.axis_label_text_font_style = "bold"


    
    # Add a histogram
    quad = fig.quad(top=hist, bottom=1, left=edges[:-1], right=edges[1:],
           fill_color="orange", line_color="black", alpha=0.3)
    ## Add descriptive statistics
    #whisker = bokeh.models.Whisker(base="base", upper="upper", lower="lower", dimension="width", source=source,
    #                               level="annotation", upper_units='screen', base_units='screen', lower_units='screen')
    #whisker.upper_head.size = whisker.lower_head.size = max(hist)
    #fig.add_layout(whisker)
    
    
    ## quantile boxes
    #fig.hbar("base", max(hist), "q2", "q3", source=source, color="red", line_color="black",)
    #fig.hbar("base", max(hist), "q1", "q2", source=source, color="red", line_color="black",)
    
    # Add annotation label
    if annotation is not None:
        # TODO: fix the issue with the location of the annotation in the plot
        label = bkhmodels.Label(x=int(width*0.75), y=int(height*0.60),
                                x_units='screen', y_units='screen',
                                x_offset=15, y_offset=-5,
                                text=f'{annotation}', text_font_size='10pt',
                                text_color='dimgray', text_alpha=0.8,
                                background_fill_color='white', background_fill_alpha=1.0,
                                border_line_color='dimgray', border_line_alpha=0.5)
        fig.add_layout(label)
        
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[
        ('Interval', 'from @left{0.1f} to @right{0.1f} GB'),
        ('Frequency', '@top{0,} quanta'),
    ], renderers=[quad, ], mode='mouse'))
       
    # Hide toolbar
    fig.toolbar.autohide = True

    return fig

In [145]:
def describe_column(df, column) -> Tuple[float, float, float, float]:
    df_description = df.with_columns(
        [
            pl.col(column).min().alias('min'),
            pl.col(column).max().alias('max'),
            pl.col(column).mean().alias('mean'),
            pl.col(column).std().alias('std'),
        ]
    ).select(['min', 'max', 'mean', 'std'])
    return df_description.row(0)

memory_distribution_figures = []
bins = 40
for task in sorted(time_consumers_tasks):
    # Collect the histogram data for this task
    task_df = df_all.filter(
        (pl.col('task') == task) & pl.col('memory').is_not_null()
    ).select(pl.col('memory'))
    
    # Compute descriptive statistics
    min_value, max_value, mean_value, std_value = describe_column(task_df, 'memory')
    
    annotation = f' N: {len(task_df):,} \n min: {min_value:.2f} \n mean: {mean_value:.2f} \n std: {std_value:.2f} \n max: {max_value:.2f} '
    
    # Plot the histogram
    fig = make_figure_rss_histogram_per_task(task_df.get_column('memory'), task, annotation=annotation)
    bkh.show(fig)
    
    # Export this figure
    filename = f'memory-distribution-{task}'
    memory_distribution_figures.append(filename)
    save_figure(fig, output_dir=output_dir, filename=filename, title=f'DP0.2 Memory distribution by pipetask {task}')
    

In [146]:
time_consumers_tasks = np.array(tasks)[elapsed_cumulated <= cumulated_elapsed_threshold].tolist()
print(time_consumers_tasks)

['forcedPhotCoadd', 'characterizeImage', 'measure', 'calibrate', 'isr', 'forcedPhotCcd', 'finalizeCharacterization', 'makeWarp', 'writeRecalibratedSourceTable', 'deblend', 'assembleCoadd']


# Publish the report

In [175]:
def publish(results_dir, publication_top_dir):
    """Publish this notebook results under top_dir
    """
    # Create the publication directories and remove .png and .html files
    # that may exist in them
    publication_dir = os.path.join(publication_top_dir, 'pipetasks')    
    os.makedirs(publication_dir, exist_ok=True)
    
    dst_images_dir = os.path.join(publication_dir, 'images')
    os.makedirs(dst_images_dir, exist_ok=True)
    for f in glob.glob(os.path.join(dst_images_dir, '*.png')):
        os.remove(f)
    
    dst_html_dir = os.path.join(publication_dir, 'html')
    os.makedirs(dst_html_dir, exist_ok=True)
    for f in glob.glob(os.path.join(dst_html_dir, '*.html')):
        os.remove(f)
    
    # Publish PNG files
    src_image_dir = os.path.join(results_dir, 'images')
    for src_file in glob.glob(os.path.join(src_image_dir, '*.png')):
        dst_file = os.path.join(publication_dir, 'images', os.path.basename(src_file))
        shutil.copy(src_file, dst_file)
    
    # Publish HTML files
    src_html_dir = os.path.join(results_dir, 'html')
    for src_file in glob.glob(os.path.join(src_html_dir, '*.html')):
        dst_file = os.path.join(publication_dir, 'html', os.path.basename(src_file))
        shutil.copy(src_file, dst_file)
        
    # Publish main HTML file
    src_index_html = os.path.join(results_dir, 'pipetasks.html')
    dst_index_html = os.path.join(publication_dir,  os.path.basename(src_index_html))
    shutil.copy(src_index_html, dst_index_html)

In [176]:
# Render the HTML template
import os
import jinja2

environment = jinja2.Environment(loader=jinja2.FileSystemLoader('./templates'))
template = environment.get_template('pipetasks-template.html')

pipetasks_html = os.path.join(output_dir, 'pipetasks.html')
with open(pipetasks_html, mode="w", encoding="utf-8") as f:
    context = {
        'figures': memory_distribution_figures,
    }
    f.write(template.render(context))

In [177]:
publication_dir='/sps/lsst/users/fabio/web/rubin-dp0.2-at-frdf'
publication_dir='/sps/lsst/users/abernard/web/rubin-HSC-DRP2-at-frdf'

publish(results_dir=output_dir, publication_top_dir=publication_dir)

## Categorize tasks per memory consumption

In [62]:
def categorize_tasks_per_memory(df: pl.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> dict:
    """Return a dictionnary with details about categories of tasks according to ``categories`` parameter.

    The returned dict contains for each category in ``categories`` the elapsed time spent on each category.
    """    
    result = {}
    key = f"0 GB ≤ max RSS ≤ {categories['small']} GB"
    elapsed = df.filter(
        (pl.col('RSS_max') >= 0) & (pl.col('RSS_max') <= categories['small'])
    ).select('elapsed_time_hours').sum().row(0)[0]
    result[key] = elapsed
    
    key = f"{categories['small']} GB < max RSS ≤ {categories['medium']} GB"
    elapsed = df.filter(
        (pl.col('RSS_max') > categories['small']) & (pl.col('RSS_max') <= categories['medium'])
    ).select('elapsed_time_hours').sum().row(0)[0]
    result[key] = elapsed

    key = f"max RSS > {categories['medium']} GB"
    elapsed = df.filter(
        pl.col('RSS_max') > categories['medium']
    ).select('elapsed_time_hours').sum().row(0)[0]
    result[key] = elapsed

    return result

In [63]:
def make_figure_task_category_by_memory(categories: List[str], elapsed_time: List[float]) -> bkh.figure:
    """Return a figure representing the elapsed time spent by each of the categories of tasks
    """
    # Build the data source
    total_elapsed = sum(elapsed_time)
    source = bkhmodels.ColumnDataSource(data={
        'category': categories, 
        'elapsed_time': [v/total_elapsed for v in elapsed_time], 
    })
    
    # Build and configure the figure
    fig = bkh.figure(
        x_range = bokeh.models.Range1d(0, 1.0),
        y_range = categories,
        width = 800,
        height = 600,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text=processing_label, text_font_style="italic", text_font_size="11pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Elapsed time spent per category of pipetask", text_font_size="18pt"), 'above')

    # Axis
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.xaxis.formatter = bokeh.models.NumeralTickFormatter(format='0%')

    # Format Y axis
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
     
    # Hide toolbar
    fig.toolbar.autohide = True
    
    # Add horizontal bars
    bars = fig.hbar(y='category', right='elapsed_time', left=0, height=0.8, source=source, fill_color='wheat', alpha=0.7, line_color='black')
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('Elapsed time', '@elapsed_time{%0}')], renderers=[bars], mode='mouse'))

    return fig

# I am here

## Overview of resource consumption by kind of task

The table below summarizes the CPU and memory consumption by kind of task. Tasks are presented in alphabetical order, not in the order they were executed.

In [430]:
df_per_task = df_all.group_by("task", maintain_order=True).agg(
    [
        pl.col("task").count().alias("task_count"),
        pl.col("task").n_unique().alias("task_kinds"),
        #(pl.col("cpu_time")/3_600).sum().alias("cpu_time_hours"),
        (0*pl.col("run_time")).sum().alias("cpu_time"),
        (0*pl.col("run_time")/3_600).sum().alias("cpu_time_hours"),

        (pl.col("run_time")/3_600).sum().alias("elapsed_time_hours"),
        #(pl.col("cpu_time").sum()/pl.col("elapsed_time").sum()).alias("cpu_efficiency"),
        (0*pl.col("run_time").sum()/pl.col("run_time").sum()).alias("cpu_efficiency"),
        pl.col("memory").max().alias("RSS_max"),
    ]
)

In [431]:
# Generate a table with a summary of the characteristics of each task kind
summary = f"""
| Pipetask     | Number of tasks | Cumulated elapsed time (h) | Cumulated CPU time (h) | Overall CPU efficiency      | Max RSS (GB) |
| -----------: | --------------: | -------------------------: | ---------------------: | --------------------------: | -----------: |
"""

for task in sorted(df_per_task.get_column("task")):
    task_info = df_per_task.filter(pl.col("task") == task)
    task_count = task_info["task_count"][0]
    elapsed_time = task_info['elapsed_time_hours'][0]
    cpu_time = task_info["cpu_time_hours"][0]
    cpu_efficiency = task_info['cpu_efficiency'][0]
    max_rss = task_info['RSS_max'][0]
    summary += f'| `{task}` | {task_count:,} | {math.ceil(elapsed_time):,.0f} | {math.ceil(cpu_time):,.0f} | {cpu_efficiency:.2f} | {max_rss:.0f} |\n'    

# Summarize the dataframe
total_task_count = df_per_task.select("task_count").sum()[0,0]
total_elapsed_time =df_per_task.select("elapsed_time_hours").sum()[0,0]
total_cpu_time = df_per_task.select("cpu_time_hours").sum()[0,0]
total_cpu_efficiency = (df_per_task.select("cpu_time_hours").sum() / df_per_task.select("elapsed_time_hours").sum())[0,0]
total_max_rss = "n/a"

summary += f'| **Total** | **{total_task_count:,}** | **{math.ceil(total_elapsed_time):,.0f}** | **{math.ceil(total_cpu_time):,.0f}** | **{total_cpu_efficiency:.2f}** | **{total_max_rss}** |\n'
summary += f"Ignore the CPU time columns. These informations are currently not available in the rassource usage."
print_md(summary)


| Pipetask     | Number of tasks | Cumulated elapsed time (h) | Cumulated CPU time (h) | Overall CPU efficiency      | Max RSS (GB) |
| -----------: | --------------: | -------------------------: | ---------------------: | --------------------------: | -----------: |
| `analyzeMatchedVisitCore` | 16 | 1 | 0 | 0.00 | 20 |
| `analyzeObjectTableCore` | 1 | 1 | 0 | 0.00 | 6 |
| `analyzeObjectTableSurveyCore` | 1 | 1 | 0 | 0.00 | 1 |
| `assembleCoadd` | 405 | 22 | 0 | 0.00 | 4 |
| `calibrate` | 13,529 | 168 | 0 | 0.00 | 2 |
| `catalogMatchTract` | 1 | 1 | 0 | 0.00 | 2 |
| `characterizeImage` | 13,766 | 273 | 0 | 0.00 | 2 |
| `consolidateHealSparsePropertyMaps` | 5 | 1 | 0 | 0.00 | 1 |
| `consolidateObjectTable` | 1 | 1 | 0 | 0.00 | 18 |
| `consolidateSourceTable` | 126 | 1 | 0 | 0.00 | 1 |
| `deblend` | 81 | 32 | 0 | 0.00 | 5 |
| `detection` | 405 | 17 | 0 | 0.00 | 1 |
| `fgcmBuildFromIsolatedStars` | 1 | 1 | 0 | 0.00 | 8 |
| `fgcmFitCycle` | 1 | 1 | 0 | 0.00 | 28 |
| `fgcmOutputProducts` | 1 | 1 | 0 | 0.00 | 8 |
| `finalizeCharacterization` | 128 | 71 | 0 | 0.00 | 1 |
| `forcedPhotCcd` | 4,602 | 88 | 0 | 0.00 | 1 |
| `forcedPhotCoadd` | 405 | 451 | 0 | 0.00 | 4 |
| `gbdesAstrometricFit` | 60 | 2 | 0 | 0.00 | 2 |
| `healSparsePropertyMaps` | 5 | 2 | 0 | 0.00 | 4 |
| `isolatedStarAssociation` | 16 | 1 | 0 | 0.00 | 1 |
| `isr` | 14,208 | 134 | 0 | 0.00 | 2 |
| `makeCcdVisitTable` | 1 | 1 | 0 | 0.00 | 2 |
| `makeVisitTable` | 1 | 1 | 0 | 0.00 | 2 |
| `makeWarp` | 2,815 | 60 | 0 | 0.00 | 2 |
| `measure` | 405 | 253 | 0 | 0.00 | 4 |
| `mergeDetections` | 81 | 1 | 0 | 0.00 | 0 |
| `mergeMeasurements` | 81 | 1 | 0 | 0.00 | 2 |
| `photometricCatalogMatch` | 1 | 1 | 0 | 0.00 | 2 |
| `photometricRefCatObjectTract` | 1 | 1 | 0 | 0.00 | 1 |
| `plotPropertyMapTract` | 5 | 1 | 0 | 0.00 | 4 |
| `refCatObjectTract` | 1 | 1 | 0 | 0.00 | 1 |
| `selectDeepCoaddVisits` | 405 | 1 | 0 | 0.00 | 1 |
| `transformObjectTable` | 81 | 1 | 0 | 0.00 | 2 |
| `transformPreSourceTable` | 13,529 | 1 | 0 | 0.00 | 2 |
| `transformSourceTable` | 12,978 | 1 | 0 | 0.00 | 0 |
| `updateVisitSummary` | 126 | 11 | 0 | 0.00 | 1 |
| `validateObjectTableCore` | 1 | 1 | 0 | 0.00 | 1 |
| `writeObjectTable` | 81 | 1 | 0 | 0.00 | 9 |
| `writePreSourceTable` | 13,529 | 2 | 0 | 0.00 | 2 |
| `writeRecalibratedSourceTable` | 12,978 | 32 | 0 | 0.00 | 1 |
| **Total** | **104,864** | **1,618** | **0** | **0.00** | **n/a** |
Ignore the CPU time columns. These informations are currently not available in the rassource usage.

## Tasks categorization by memory consumption

The table below presents a categorization of tasks based on their peak memory usage and present the CPU time spent executing each of those categories. Three classes are presented: small-, medium- and high-memory tasks.

In [432]:
def categorize_tasks_per_memory(df: pd.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> pd.DataFrame:
    """Return a data frame with details about categories of tasks according to ``categries``.

    The returned data frame is indexed by category and contains the CPU time spent in each category (in seconds)
    as well as the percentage of time spent by each category of task.
    """
    # Determine the CPU time consumed by three classes of tasks: small-memory, medium-memory and high-memory
    total_cpu_time = df['cpu_time'].sum()

    # Compute the CPU time spent running each category of tasks tasks
    is_small_memory_task = df['RSS_max'] <= categories['small']
    is_medium_memory_task = (df['RSS_max'] > categories['small']) & (df['RSS_max'] <= categories['medium'])
    is_high_memory_task = ~is_small_memory_task & ~is_medium_memory_task
    cpu_time = [
        df[is_small_memory_task]['cpu_time'].sum(),
        df[is_medium_memory_task]['cpu_time'].sum(),
        df[is_high_memory_task]['cpu_time'].sum(),
    ]
    
    # Compute the percentage CPU time spent by each category of tasks
    cpu_time_percent = [v/total_cpu_time for v in cpu_time]
    
    # Build the resulting dataframe
    data = {
        'category': categories.keys(),
        'cpu_time': cpu_time,
        'cpu_time_percent': cpu_time_percent,
    }
    
    out_df = pd.DataFrame.from_records(data=data)
    out_df.set_index('category', inplace=True)
    return out_df

In [433]:
def generate_task_categories_summary(df: pd.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> str:
    """Return a Markdown-formatted table with a summary of the percentage of CPU time spent in each category of task.
    """
    # Summarize
    summary = f"""
| category of task    |    CPU time                                         | comments                                                                      |
| ------------------- | --------------------------------------------------: | ----------------------------------------------------------------------------- |
| **small-memory**    | {100. * df.loc['small', 'cpu_time_percent']:.2f}%   | 0 < max RSS ≤ {categories['small']:.0f} GB                                    |
| **medium-memory**   | {100. * df.loc['medium', 'cpu_time_percent']:.2f}%  | {categories['small']:.0f} < max RSS ≤ {categories['medium']:.0f} GB           |
| **high-memory**     | {100. * df.loc['high', 'cpu_time_percent']:.2f}%    | max RSS > {categories['medium']:.0f} GB                                       |
    """
    return summary

In [434]:
# Categorize tasks by their max RSS and generate a summary
task_categories = {
    'small': 5,
    'medium': 20,
    'high': None
}
categories_df = categorize_tasks_per_memory(df_all, categories=task_categories)
summary = generate_task_categories_summary(categories_df, categories=task_categories)
print_md(summary)

ColumnNotFoundError: "cpu_time" not found

In [419]:
def make_task_category_figure(df: pd.DataFrame) -> bkh.figure:
    """Return a figure representing the CPU time spent on each task category
    """
    # Build the data source
    categories = [f"{v}-memory tasks" for v in df.index.values]
    source = bkhmodels.ColumnDataSource(data={
        'category': categories,
        'percentage': df['cpu_time_percent'].values,
        'percentage_tooltips': [100.0 * v for v in df['cpu_time_percent'].values],
        'color': ['mediumseagreen', 'gold', 'crimson'],
    })
    
    # Build and configure the figure
    fig = bkh.figure(
        x_axis_label = 'CPU time',
        x_range = bokeh.models.Range1d(0, 1.0),
        y_range=categories,
        plot_width=600,
        plot_height=400,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin processing at FrDF for Data Preview 0.2 (v23.0.1)", text_font_style="italic", text_font_size="12pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU time spent per task category", text_font_size="18pt"), 'above')

    # Axes
    fig.xaxis.axis_label_text_font_size = "14pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_size = "14pt"
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
    
    # Add bars
    bars = fig.hbar(y='category', left=0, right='percentage', height=0.5, color='color', source=source)
    
    # Add tooltips
    fig.add_tools(bkhmodels.HoverTool(tooltips=[('task category', '@category'), ('CPU time', '@percentage_tooltips{0.2f} %')], renderers=[bars], mode='mouse'))
    
    # Add formatters
    fig.xaxis[0].formatter = bokeh.models.NumeralTickFormatter(format="0%")

    return fig

In [420]:
category_fig = make_task_category_figure(categories_df)
bkh.show(category_fig)

print_md(f"""
The figure above presents the fraction of CPU time spent for executing each category of tasks. Hover over the bars to get more details about each category. Tasks are categorized as small-, medium- and high-memory.
Small-memory tasks are those requiring up to {task_categories['small']} GB. Medium-memory tasks are those
requiring more than {task_categories['small']} GB and up {task_categories['medium']} GB. High-memory tasks are those requiring more than {task_categories['medium']} GB.
""")

NameError: name 'categories_df' is not defined

In [421]:
def summarize_cpu_time_per_task_category(df: pd.DataFrame, categories: dict = {'small': 5, 'medium': 20, 'high': None}) -> pd.DataFrame:
    """Return a data frame with details about each task type. The columns of the dataframe are the percentage of CPU time spent for each task category.
    The returned data frame is indexed by task type.
    """
    
    out_df = pd.DataFrame()
    for name, group in df.groupby('task'):
        # Compute the CPU spent per task category, for this group
        is_small_memory_task = group['RSS_max'] <= categories['small']
        is_medium_memory_task = (group['RSS_max'] > categories['small']) & (group['RSS_max'] <= categories['medium'])
        is_high_memory_task = ~is_small_memory_task & ~is_medium_memory_task
        
        group_cpu_time = group['cpu_time'].sum()
        row = { 'task': name }
        row['small'] = group[is_small_memory_task]['cpu_time'].sum() / group_cpu_time
        row['medium'] = group[is_medium_memory_task]['cpu_time'].sum() / group_cpu_time
        row['high'] = group[is_high_memory_task]['cpu_time'].sum() / group_cpu_time
        out_df = pd.concat([out_df, pd.DataFrame.from_records(data=[row,], columns=row.keys())])

    out_df.set_index('task', inplace=True)
    out_df.sort_values(by='task', ascending=True, inplace=True)
    return out_df

In [422]:
def make_figure_category_within_task_type(df: pd.DataFrame) -> bkh.figure:
    """Return a figure representing the CPU time spent on each task category, for each kind of task
    """
    # Sort by task name in reverse alphabetical order so that veertical reading makes sense
    df = df.sort_values(by=['small', 'medium', 'high'], ascending=True)

    # Build the data source
    source = bkhmodels.ColumnDataSource(data={
        'task': df.index.values,
        'small': df['small'].values,
        'medium': df['medium'].values,
        'high': df['high'].values,
    })
    
    # Build and configure the figure
    fig = bkh.figure(
        x_axis_label = 'CPU time',
        y_axis_label = 'task',
        x_range = bokeh.models.Range1d(0, 1.0),
        y_range = df.index.values,
        width = 1200,
        height = 1200,
        background_fill_color="#f4f3f3", 
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin processing at FrDF for Data Preview 0.2 (v23.0.1)", text_font_style="italic", text_font_size="12pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="CPU time spent per task category", text_font_size="18pt"), 'above')

    # Axes
    fig.xaxis.axis_label_text_font_size = "14pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_size = "14pt"
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
    fig.xaxis[0].formatter = bokeh.models.NumeralTickFormatter(format="0%")
    
    # Legend
    fig.add_layout(bokeh.models.Legend(), 'right')
    
    # Add bars
    bars = fig.hbar_stack(['small', 'medium', 'high'], y='task', color=['mediumseagreen', 'gold', 'crimson'], height=0.7, source=source, legend_label=['small-memory', 'medium-memory', 'high-memory'])
    
    # Add tooltips
    # fig.add_tools(bkhmodels.HoverTool(tooltips=[('task', '@task'), ('small memory', '@small{0.2f} %')], renderers=[bars], mode='mouse'))
    # fig.add_tools(bkhmodels.HoverTool(tooltips=[('task', '@task'),], renderers=[bars], mode='mouse'))
    return fig

In [423]:
taks_df = summarize_cpu_time_per_task_category(df_all, task_categories)
all_tasks_fig = make_figure_category_within_task_type(taks_df)
bkh.show(all_tasks_fig)
print_md(f"""
The figure above presents a more detailed view of the share of CPU time spent by each task category for every kind of pipeline task. Tasks are categorized as small-, medium- and high-memory.
Small-memory tasks are those requiring up to {task_categories['small']} GB. Medium-memory tasks are those
requiring more than {task_categories['small']} GB and up {task_categories['medium']} GB. High-memory tasks are those requiring more than {task_categories['medium']} GB.
""")

AttributeError: 'DataFrame' object has no attribute 'groupby'

In [424]:
def build_boxplot_dataframe(df: pd.DataFrame, column: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Build a dataframe with data to populate a figure composed of box plots for each
    kind of task based on the values of the column ``column``.
    
    Parameters
    ----------
    df : pd.DataFrame
       dataframe which contains one row for each executed task.
       
    column: str
       name of the column in ``df`` to be used as criterion for building the boxplot
       data.
       
    Returns
    -------
    first: pd.DataFrame
       a dataframe with columns 'task', 'q0', 'q1', 'q2', 'q3', 'q4', 'lower' and 'upper'.
       The columns 'qN' contain the value of the corresponding quartile. The columns
       'lower' and 'upper' contain the values for the lower and upper whiskers of the
       box plot.
       This dataframe is indexed by task name.
       
    second: pd.DataFrame
        outliers dataframe
       
    Notes
    -----
    See also: https://en.wikipedia.org/wiki/Box_plot
    """
    # Build a new dataframe where each row contains values computed for each kind of task.
    # Those values are used later for building a figure with a boxplot per task type.
    # The values are the first, second and third quartiles for 'RSS_max' column, the
    # inter-quartile range and the lower and upper boxplot limits.
    out_df = pd.DataFrame()
    
    for name, group in df.group_by('task'):
        # Compute the 25th, 50th and 75th percentile for this group
        q0, q1, q2, q3, q4 = np.percentile(group[column].dropna(), (0.0, 25.0, 50.0, 75.0, 100.0))
                
        # Compute inter-quartile range and values for lower and upper whiskers
        iqr = q3 - q1
        lower = max(q1 - 1.5*iqr, q0)
        upper = min(q3 + 1.5*iqr, q4)

        # Append a new row to the resulting dataframe for this kind of task
        row = {
            'task': name, 
            'q0': q0, 
            'q1': q1, 
            'q2': q2,
            'q3': q3,
            'q4': q4,
            'lower': lower,
            'upper': upper,
        }
        out_df = pd.concat([out_df, pd.DataFrame.from_records(data=[row,], columns=row.keys())])

    # Set the dataframe index to the task name
    out_df.set_index('task', inplace=True)
    
    # Select outliers
    def select_outliers(group):
        """Return outliers for the given DataFrame group. It expects a single group per task.

        Outliers are those data points below or higher than the task's lower and upper limits.
        """
        task = group.name
        lower = out_df.loc[task]['lower']
        upper = out_df.loc[task]['upper']
        is_outlier = (group[column] < lower) | (group[column] > upper)
        return group[is_outlier][column]

    outliers = df.group_by('task').apply(select_outliers).dropna()
    return out_df, outliers

In [425]:
def make_figure_boxplot_rss(df: pd.DataFrame, unit: str, outliers: pd.Series) -> bkh.figure:
    """Builds a figure with a boxplot per task
    """
    # Sort the dataframe by median value
    df = df.sort_values(by='q2', ascending=False)
    
    # Retrieve the names of the tasks
    tasks = df.index.values
    
    # Build an configure the figure
    fig = bkh.figure(
        x_axis_label = 'max RSS (gigabyte)',
        y_axis_label = 'pipeline task',
        background_fill_color = "#f4f3f3",
        y_range = tasks,
        width = 1200,
        height = 1400
    )

    # Title and subtitle
    fig.add_layout(bkhmodels.Title(text="Rubin processing at FrDF for Data Preview 0.2 (v23.0.1)", text_font_style="italic", text_font_size="12pt"), 'above')
    fig.add_layout(bkhmodels.Title(text="Memory Consumption by Pipeline Tasks", text_font_size="18pt"), 'above')

    # Axes
    fig.xaxis.axis_label_text_font_size = "14pt"
    fig.xaxis.axis_label_text_font_style = "bold"
    fig.xaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_size = "14pt"
    fig.yaxis.major_label_text_font_size = "12pt"
    fig.yaxis.axis_label_text_font_style = "bold"
    
    # Data source
    data_source = bkhmodels.ColumnDataSource({
        'task': df.index.values,
        'q0': df['q0'].values,
        'q1': df['q1'].values,
        'q2': df['q2'].values,
        'q3': df['q3'].values,
        'q4': df['q4'].values,
        'lower': df['lower'].values,
        'upper': df['upper'].values,
    })

    # Stems
    line_color = 'black'
    fig.segment(x0='q3', y0='task', x1='upper', y1='task', source=data_source, line_color=line_color)
    fig.segment(x0='lower', y0='task', x1='q1', y1='task', source=data_source, line_color=line_color)

    # Boxes
    box_fill_color = 'tan'
    box_height = 0.8
    boxes = fig.hbar(y='task', height=box_height, right='q3', left='q1', source=data_source, fill_color=box_fill_color, line_color=box_fill_color)

    # Add tooltips for boxes
    fig.add_tools(
        bkhmodels.HoverTool(
            tooltips=[('task', '@task'), ('min', f'@q0{{0.2}} {unit}'), ('median', f'@q2{{0.}} {unit}'),  ('max', f'@q4{{0.}} {unit}')],
            renderers=[boxes],
            mode='mouse'
        )
    )

    # Whiskers
    whisker_height = box_height * 0.50
    fig.rect(x='lower', y='task', width=0.001, height=whisker_height, source=data_source, line_color=line_color, fill_color=line_color)
    fig.rect(x='upper', y='task', width=0.001, height=whisker_height, source=data_source, line_color=line_color, fill_color=line_color)

    # Median
    median_color = 'red'
    fig.rect(x='q2', y='task', width=0.001, height=box_height, source=data_source, line_color=median_color, fill_color=median_color)
    
    # Outliers
    if not outliers.empty:
        outliers_data = bkhmodels.ColumnDataSource({
            'x': list(outliers.values),
            'task': list(outliers.index.get_level_values(0)),
        })
        outliers_color = '#F38630' # 'darksalmon'
        circles = fig.circle(x='x', y='task', source=outliers_data, size=6, color=outliers_color, fill_alpha=0.6)
        fig.add_tools(bkhmodels.HoverTool(tooltips=[('task', '@task'), ('max RSS', f'$x{{0.1}} {unit}')], renderers=[circles], mode='mouse'))

    return fig

In [426]:
rss_boxplot_df, rss_outliers = build_boxplot_dataframe(df_all, column='RSS_max')
box_plot_rss_fig = make_figure_boxplot_rss(rss_boxplot_df, 'GB', rss_outliers)
bkh.show(box_plot_rss_fig)

ColumnNotFoundError: "RSS_max" not found

## CPU efficiency

In [427]:
# Compare the CPU efficiency of tasks of step 4 by discrimating among the tasks by run
# Some runs were executed after downloading the input data to the worker's local disk, while others
# accessed directly the input data directly from the remote server
df_step_4 = df[df['step'] == 4]

run = 4
is_group_1 = df_step_4['run'] <= run
is_group_2 = ~(df_step_4['run'] <= run)
cpu_efficiency_group_1 = df_step_4[is_group_1]['cpu_efficiency']
cpu_efficiency_group_1_mean, cpu_efficiency_group_1_std = cpu_efficiency_group_1.mean(), cpu_efficiency_group_1.std()
overall_cpu_efficiency_group_1 = df_step_4[is_group_1]['cpu_time'].sum() / df_step_4[is_group_1]['utc_time'].sum()

cpu_efficiency_group_2 = df_step_4[is_group_2]['cpu_efficiency']
cpu_efficiency_group_2_mean, cpu_efficiency_group_2_std = cpu_efficiency_group_2.mean(), cpu_efficiency_group_2.std()
overall_cpu_efficiency_group_2 = df_step_4[is_group_2]['cpu_time'].sum() / df_step_4[is_group_2]['utc_time'].sum()

summary = f"""
### CPU efficiency of step 4 tasks:

| run         | overall                                          | mean per task                               | std per task                                |
| ----------- | -----------------------------------------------: | ------------------------------------------: | ------------------------------------------: |
| **≤ {run}** | {100.0 * overall_cpu_efficiency_group_1:.1f}%    | {100.0 * cpu_efficiency_group_1_mean:.1f}%  | {100.0 * cpu_efficiency_group_1_std:.1f}%   |
| **> {run}** | {100.0 * overall_cpu_efficiency_group_2:.1f}%    | {100.0 * cpu_efficiency_group_2_mean:.1f}%  | {100.0 * cpu_efficiency_group_2_std:.1f}%   |

({df_step_4.shape[0]:,} tasks)
"""
print_md(summary)

NameError: name 'df' is not defined

In [428]:
# Compute the CPU efficiency distribution for each kind of task
from scipy.stats import gaussian_kde

efficiencies = df[df['task'] == 'matchCatalogsTract']['cpu_efficiency']
pdf = gaussian_kde(efficiencies)

NameError: name 'df' is not defined

In [None]:
tasks = ('makeWarp',)

x = np.linspace(start=0, stop=100, num=10)
source = bokeh.models.ColumnDataSource(data=dict(x=x))

def ridge(category, data, scale=100):
    return list(zip([category]*len(data), scale*data))

p = bokeh.plotting.figure(y_range=tasks, width=900, x_range=(-5, 105), toolbar_location=None)

y = ridge('makeWarp', pdf(x))
source.add(y, 'makeWarp')
p.patch('x', 'makeWarp', color='blue', alpha=0.6, line_color="black", source=source)

bokeh.plotting.show(p)

# TODO

* Categorize tasks per memory consumption
* Add a plot to show tasks ordered by memory consumption
* Add a plot to show tasks by CPU elapsed time
* Add a plot to show tasks by CPU efficiency

* For each kind of task, display the 98th percentile of CPU efficiency using horizontal bars or alternatively a [ridge plot](https://docs.bokeh.org/en/latest/docs/gallery/ridgeplot.html) for showing the distribution of CPU efficiency
* Improve comments on the notebook for preparing for publication
* Create git repo and publish the results