## Introduction

`ptool` aims to provide comprehensive analysis of a pool on different sites.
It is designed to provide the most prominent information to either clean up a pool or
to sync the missing files from the other site for the same pool.
In order to perform analysis, this tool requires the checksum (md5) of all
the files in the pool at both the sites. As this operation is time consuming,
the workflow is divided into 2 phases.

The first phase is to gather the checksum of all the files as a csv file.
The tool aids in generating the run scripts required the gather the checksum information.
After the checksum files at different sites are created, they can be transferred to
a single location (say local machine or either of the sites) to perform the analysis.

The second phase is the analysis part. Through this analysis the aim is to understand
the current state of the pool at both the sites and also to have actionable information
to perform for instance a clean up or a sync operation.


### Analysis part

In [1]:
import ptool

In [2]:
levante_csv = "checksum_fesom2_levante.csv"
albedo_csv = "checksum_fesom2_albedo.csv"

First thing is to do is get an overview is a call `summary` function

In [3]:
ptool.summary(levante_csv, albedo_csv)

Table 1: Summary with respect to LEVANTE site

                 levante                      albedo
---------------  ---------------------------  --------------------------
pool             fesom2                       fesom2
checksum file    checksum_fesom2_levante.csv  checksum_fesom2_albedo.csv
prefix           /pool/data/AWICM/FESOM2/     /albedo/pool/FESOM2/
files            133387 (25.4 TB)             123946 (1.1 TB)
duplicate files  17720 (2.7 TB)               11463 (26.2 GB)
identical files  18101 (40.8 GB)              18101 (40.8 GB)
unique files     88196 (25.4 TB)              nan
renamed files    nan                          9 (896.2 MB)
modified files   nan                          27081 (314.6 MB)
----------------------------------------------------------------------
Table 2: Common directory mapping

    rparent_levante                 rparent_albedo
--  ------------------------------  ---------------------------
 0  MESHES_FESOM2.1/core2           CORE2
 1  MESHES_FE

In the above overview, notice `dist_` directories were also included. This can be omitted by providing the `ignore` argument.

In [4]:
ptool.summary(levante_csv, albedo_csv, ignore='dist_')

Table 1: Summary with respect to LEVANTE site

                 levante                      albedo
---------------  ---------------------------  --------------------------
pool             fesom2                       fesom2
checksum file    checksum_fesom2_levante.csv  checksum_fesom2_albedo.csv
prefix           /pool/data/AWICM/FESOM2/     /albedo/pool/FESOM2/
files            2791 (25.4 TB)               1359 (1.0 TB)
duplicate files  90 (896.5 GB)                218 (25.4 GB)
identical files  472 (37.8 GB)                472 (37.8 GB)
unique files     2299 (25.4 TB)               nan
renamed files    nan                          9 (896.2 MB)
modified files   nan                          6 (17.4 MB)
----------------------------------------------------------------------
Table 2: Common directory mapping

    rparent_levante        rparent_albedo
--  ---------------------  --------------------------
 0  MESHES_FESOM2.1/core2  CORE2
 1  MESHES/BOLD_RT_FIXED   HR
 2  MESHES/CORE2/figur

The `summary` function is comparing `fesom2` pool on `Levante` site to `fesom2` pool on `Albedo` site.
The order of the arguments matter. The other way around yields different results.

In [5]:
ptool.summary(albedo_csv, levante_csv, ignore='dist_')

Table 1: Summary with respect to ALBEDO site

                 albedo                      levante
---------------  --------------------------  ---------------------------
pool             fesom2                      fesom2
checksum file    checksum_fesom2_albedo.csv  checksum_fesom2_levante.csv
prefix           /albedo/pool/FESOM2/        /pool/data/AWICM/FESOM2/
files            1359 (1.0 TB)               2791 (25.4 TB)
duplicate files  218 (25.4 GB)               90 (896.5 GB)
identical files  474 (37.8 GB)               474 (37.8 GB)
modified files   36 (6.9 GB)                 nan
unique files     849 (999.3 GB)              nan
renamed files    nan                         9 (896.2 MB)
----------------------------------------------------------------------
Table 2: Common directory mapping

    rparent_albedo              rparent_levante
--  --------------------------  ---------------------
 0  forcing/CORE2               FORCING/CORE2
 1  hydrography                 INITIAL/phc3.

The important information from this summary is as follows:
In Table 1, the `modified files` are important to consider as they get over-written in a (directory level) sync operation if these files are not omitted. In the case of `renamed files`, they end up as duplicate files in a similar (directory level) sync operation if they are not considered.

Table 2 shows the directory mapping of folders which have some associated due to either matching identical files or modified files or renamed files. The `rparent` in the header is the relative parent path to a file. Also notice that Table 2 always shows one-to-one mapping of the associated folders but in reality, it is possible to have one-to-many or many-to-one folder associations. In this case, the tool chooses the associated folders which has the maximum counts of associations (number of identical files, renamed files, modified files).

The next to do is to get the associated files.

In [6]:
df_levante, df_levante_duplicates = ptool.read_csv(levante_csv, ignore='dist_')
df_albedo, df_albedo_duplicates = ptool.read_csv(albedo_csv, ignore='dist_')

keep in mind that the order of arguments matter.

In [7]:
cmp_LA = ptool.compare_compact(df_levante, df_albedo)
cmp_LA.head()

Unnamed: 0_level_0,rpath_left,rpath_right
flag,Unnamed: 1_level_1,Unnamed: 2_level_1
identical,MESHES_FESOM2.1/core2/fesom.mesh.diag.nc,CORE2/fesom.mesh.diag.nc
identical,MESHES_FESOM2.1/core2/core2_griddes_elements.nc,CORE2/core2_griddes_elements.nc
identical,MESHES_FESOM2.1/core2/core2_griddes_nodes.nc,CORE2/core2_griddes_nodes.nc
identical,MESHES_FESOM2.1/core2/distances_126858_-180.0_...,CORE2/distances_126858_-180.0_180.0_60.0_90.0_...
identical,MESHES_FESOM2.1/core2/nod2d.out,CORE2/nod2d.out


In the above data-frame, `rpath` stands for `relative path` to the file. The `flag` in the index column has association names. 

In [8]:
cmp_LA.index.unique()

Index(['identical', 'renamed', 'modified_latest_right', 'unique'], dtype='object', name='flag')

`modified_latest_right` means that the most recent version of the file (w.r.t timestamp) is in the right column. Modified files are the files which have the same file name on both the sites but their checksum do not match.

In [9]:
cmp_LA.loc['modified_latest_right']

Unnamed: 0_level_0,rpath_left,rpath_right
flag,Unnamed: 1_level_1,Unnamed: 2_level_1
modified_latest_right,MESHES_FESOM2.1/core2/pickle_mesh_py3_fesom2,CORE2/pickle_mesh_py3_fesom2
modified_latest_right,MESHES_FESOM2.1/core2/distances_126858_-180.0_...,CORE2/distances_126858_-180.0_180.0_-80.0_90.0...
modified_latest_right,MESHES_FESOM2.1/core2/inds_126858_-180.0_180.0...,CORE2/inds_126858_-180.0_180.0_-80.0_90.0_360_...
modified_latest_right,MESHES_FESOM2.1/core2/distances_126858_-180.0_...,CORE2/distances_126858_-180.0_180.0_-90.0_90.0...
modified_latest_right,MESHES_FESOM2.1/core2/inds_126858_-180.0_180.0...,CORE2/inds_126858_-180.0_180.0_-90.0_90.0_360_...
modified_latest_right,MESHES_FESOM2.1/core2/distances_126858_-180.0_...,CORE2/distances_126858_-180.0_180.0_-89.0_90.0...


Some terminology

- `identical`: both checksum and filename match
- `renamed`: checksum match but not the filename
- `modified`: filename match but not checksum
- `unique`: files that are site specific (i.e., not available on other site)

- `rpath`: relative path to file
- `rparent`: relative parent path to file
- `fpath`: full path to file
- `fname`: only the filename

Getting the list of `renamed` files

In [10]:
cmp_LA.loc['renamed']

Unnamed: 0_level_0,rpath_left,rpath_right
flag,Unnamed: 1_level_1,Unnamed: 2_level_1
renamed,FORCING/CORE2/t_10.2008.nc.bak,forcing/CORE2/t_10.2007.nc
renamed,FORCING/CORE2/u_10.2009.nc,forcing/CORE2/u_10.2009.23OCT2012.nc
renamed,FORCING/CORE2/u_10.2008.nc.bak,forcing/CORE2/u_10.2007.nc
renamed,FORCING/CORE2/ncar_rad.2009.nc,forcing/CORE2/ncar_rad.2009.23OCT2012.nc
renamed,FORCING/CORE2/q_10.2008.nc.bak,forcing/CORE2/q_10.2007.nc
renamed,FORCING/CORE2/slp.2008.nc,forcing/CORE2/slp.2008.23OCT2012.nc
renamed,FORCING/CORE2/u_10.2008_ori.nc,forcing/CORE2/u_10.2008.23OCT2012.nc
renamed,FORCING/CORE2/q_10.2009.nc,forcing/CORE2/q_10.2009.23OCT2012.nc
renamed,FORCING/CORE2/t_10.2008_ori.nc,forcing/CORE2/t_10.2008.23OCT2012.nc


Instead of `rpath_left` and `rpath_right`, it is also possible to have site names replaced as a suffix.

In [11]:
cmp_LA = ptool.compare_compact(df_levante, df_albedo, relabel=True)

In [12]:
cmp_LA.loc['renamed']

Unnamed: 0_level_0,rpath_levante,rpath_albedo
flag,Unnamed: 1_level_1,Unnamed: 2_level_1
renamed,FORCING/CORE2/t_10.2008.nc.bak,forcing/CORE2/t_10.2007.nc
renamed,FORCING/CORE2/u_10.2009.nc,forcing/CORE2/u_10.2009.23OCT2012.nc
renamed,FORCING/CORE2/u_10.2008.nc.bak,forcing/CORE2/u_10.2007.nc
renamed,FORCING/CORE2/ncar_rad.2009.nc,forcing/CORE2/ncar_rad.2009.23OCT2012.nc
renamed,FORCING/CORE2/q_10.2008.nc.bak,forcing/CORE2/q_10.2007.nc
renamed,FORCING/CORE2/slp.2008.nc,forcing/CORE2/slp.2008.23OCT2012.nc
renamed,FORCING/CORE2/u_10.2008_ori.nc,forcing/CORE2/u_10.2008.23OCT2012.nc
renamed,FORCING/CORE2/q_10.2009.nc,forcing/CORE2/q_10.2009.23OCT2012.nc
renamed,FORCING/CORE2/t_10.2008_ori.nc,forcing/CORE2/t_10.2008.23OCT2012.nc


It is also possible to get the full path instead of relative path

In [13]:
cmp_LA = ptool.compare_compact(df_levante, df_albedo, columns='fpath', relabel=True)

In [14]:
cmp_LA.head()

Unnamed: 0_level_0,fpath_levante,fpath_albedo
flag,Unnamed: 1_level_1,Unnamed: 2_level_1
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...,/albedo/pool/FESOM2/CORE2/fesom.mesh.diag.nc
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...,/albedo/pool/FESOM2/CORE2/core2_griddes_elemen...
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...,/albedo/pool/FESOM2/CORE2/core2_griddes_nodes.nc
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...,/albedo/pool/FESOM2/CORE2/distances_126858_-18...
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...,/albedo/pool/FESOM2/CORE2/nod2d.out


Things covered so far are also available from the command line interface (cli)

### Command line interface

In [15]:
! ptool --help

Usage: ptool [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  compare    Compare csv files containing checksum to infer the status of...
  config     shows config information for a given site
  runscript  makes run script for job submission
  summary    Prints a short summary by analysing csv files.


In [16]:
! ptool summary --help

Usage: ptool summary [OPTIONS] LEFT RIGHT

  Prints a short summary by analysing csv files.

  LEFT: csv file containing checksums of all files in the pool for a given
  project and HPC site.

  RIGHT: similar file as LEFT but from different HPC site for the same
  project.

Options:
  --ignore TEXT  ignores directory and files
  --help         Show this message and exit.


In [17]:
! ptool summary --ignore dist_ checksum_fesom2_levante.csv checksum_fesom2_albedo.csv

Table 1: Summary with respect to LEVANTE site

                 levante                      albedo
---------------  ---------------------------  --------------------------
pool             fesom2                       fesom2
checksum file    checksum_fesom2_levante.csv  checksum_fesom2_albedo.csv
prefix           /pool/data/AWICM/FESOM2/     /albedo/pool/FESOM2/
files            2791 (25.4 TB)               1359 (1.0 TB)
duplicate files  90 (896.5 GB)                218 (25.4 GB)
identical files  472 (37.8 GB)                472 (37.8 GB)
unique files     2299 (25.4 TB)               nan
renamed files    nan                          9 (896.2 MB)
modified files   nan                          6 (17.4 MB)
----------------------------------------------------------------------
Table 2: Common directory mapping

    rparent_levante        rparent_albedo
--  ---------------------  --------------------------
 0  MESHES_FESOM2.1/core2  CORE2
 1  MESHES/BOLD_RT_FIXED   HR
 2  MESHES/CORE2/figur

In [18]:
! ptool compare --help

Usage: ptool compare [OPTIONS] LEFT RIGHT

  Compare csv files containing checksum to infer the status of data in these
  data pools. The results include, synced files at both HPC sites. unsynced
  files. directory mapping of synced files. filename mis-matches.

  LEFT: csv file containing checksums of all files in the pool for a given
  project and HPC site.

  RIGHT: similar file as LEFT but from different HPC site for the same
  project.

Options:
  -o, --outfile FILENAME  csv file to write results
  --fullpath              displays full path instead of relative path
  --ignore TEXT           ignores directory and files
  --help                  Show this message and exit.


In [19]:
! ptool compare -o LA.csv --fullpath --ignore dist_ checksum_fesom2_levante.csv checksum_fesom2_albedo.csv

Writing results as csv to file LA.csv
                                               fpath_levante                                       fpath_albedo
flag                                                                                                           
identical  /pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...       /albedo/pool/FESOM2/CORE2/fesom.mesh.diag.nc
identical  /pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...  /albedo/pool/FESOM2/CORE2/core2_griddes_elemen...
identical  /pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...   /albedo/pool/FESOM2/CORE2/core2_griddes_nodes.nc
identical  /pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...  /albedo/pool/FESOM2/CORE2/distances_126858_-18...
identical  /pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/...                /albedo/pool/FESOM2/CORE2/nod2d.out
...                                                      ...                                                ...
unique     /pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4....      

In [20]:
! grep modified LA.csv

modified_latest_right,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/pickle_mesh_py3_fesom2,/albedo/pool/FESOM2/CORE2/pickle_mesh_py3_fesom2
modified_latest_right,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/distances_126858_-180.0_180.0_-80.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/distances_126858_-180.0_180.0_-80.0_90.0_360_180_1
modified_latest_right,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/inds_126858_-180.0_180.0_-80.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/inds_126858_-180.0_180.0_-80.0_90.0_360_180_1
modified_latest_right,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/distances_126858_-180.0_180.0_-90.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/distances_126858_-180.0_180.0_-90.0_90.0_360_180_1
modified_latest_right,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/inds_126858_-180.0_180.0_-90.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/inds_126858_-180.0_180.0_-90.0_90.0_360_180_1
modified_latest_right,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/distances_126858_-180.0_180.

In [21]:
! grep renamed LA.csv

renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/t_10.2008.nc.bak,/albedo/pool/FESOM2/forcing/CORE2/t_10.2007.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/u_10.2009.nc,/albedo/pool/FESOM2/forcing/CORE2/u_10.2009.23OCT2012.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/u_10.2008.nc.bak,/albedo/pool/FESOM2/forcing/CORE2/u_10.2007.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/ncar_rad.2009.nc,/albedo/pool/FESOM2/forcing/CORE2/ncar_rad.2009.23OCT2012.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/q_10.2008.nc.bak,/albedo/pool/FESOM2/forcing/CORE2/q_10.2007.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/slp.2008.nc,/albedo/pool/FESOM2/forcing/CORE2/slp.2008.23OCT2012.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/u_10.2008_ori.nc,/albedo/pool/FESOM2/forcing/CORE2/u_10.2008.23OCT2012.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/q_10.2009.nc,/albedo/pool/FESOM2/forcing/CORE2/q_10.2009.23OCT2012.nc
renamed,/pool/data/AWICM/FESOM2/FORCING/CORE2/t_10.2008_ori.nc,/albedo/pool/FESO

In [22]:
! head LA.csv

flag,fpath_levante,fpath_albedo
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/fesom.mesh.diag.nc,/albedo/pool/FESOM2/CORE2/fesom.mesh.diag.nc
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/core2_griddes_elements.nc,/albedo/pool/FESOM2/CORE2/core2_griddes_elements.nc
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/core2_griddes_nodes.nc,/albedo/pool/FESOM2/CORE2/core2_griddes_nodes.nc
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/distances_126858_-180.0_180.0_60.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/distances_126858_-180.0_180.0_60.0_90.0_360_180_1
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/nod2d.out,/albedo/pool/FESOM2/CORE2/nod2d.out
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/inds_126858_-180.0_180.0_60.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/inds_126858_-180.0_180.0_60.0_90.0_360_180_1
identical,/pool/data/AWICM/FESOM2/MESHES_FESOM2.1/core2/inds_126858_-180.0_180.0_-89.0_90.0_360_180_1,/albedo/pool/FESOM2/CORE2/in

In [23]:
! ptool --help

Usage: ptool [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  compare    Compare csv files containing checksum to infer the status of...
  config     shows config information for a given site
  runscript  makes run script for job submission
  summary    Prints a short summary by analysing csv files.


In [24]:
! ptool config --help

Usage: ptool config [OPTIONS] [SITE] [POOL]

  shows config information for a given site

Options:
  --all   Gets the whole config in the absence of arguments.
  --help  Show this message and exit.


In [25]:
! ptool config --all

albedo:
  extras:
  - module load analysis-toolbox/03.2023
  pool:
    fesom2:
      ignore:
      - .git
      output: ~/checksum_fesom2_albedo.csv
      path: /albedo/pool/FESOM2
  slurm:
    --mem: 100GB
    --qos: 12h
    -p: mpp
    -t: '3:00:00'
levante:
  extras:
  - module load python3/2023.01-gcc-11.2.0
  pool:
    fesom:
      ignore:
      - .git
      output: ~/checksum_fesom1_levante.csv
      path: /pool/data/AWICM/FESOM1
    fesom2:
      ignore:
      - .git
      - __pycache__
      output: ~/checksum_fesom2_levante.csv
      path: /pool/data/AWICM/FESOM2
  slurm:
    -A: ab0246
    -p: compute
    -t: '3:00:00'



In [26]:
! ptool config levante fesom2

levante:
  extras:
  - module load python3/2023.01-gcc-11.2.0
  pool:
    fesom2:
      ignore:
      - .git
      - __pycache__
      output: ~/checksum_fesom2_levante.csv
      path: /pool/data/AWICM/FESOM2
  slurm:
    -A: ab0246
    -p: compute
    -t: '3:00:00'



In [27]:
! ptool runscript --help

Usage: ptool runscript [OPTIONS] {levante|albedo} {fesom2|fesom}

  makes run script for job submission

Options:
  -f, --filename TEXT  name of the run script
  --help               Show this message and exit.


In [28]:
! ptool runscript levante fesom2

{'path': '/pool/data/AWICM/FESOM2', 'output': '~/checksum_fesom2_levante.csv', 'ignore': ['.git', '__pycache__']}
created 'fesom2_levante.sh'.
submit to slurm as 'sbatch fesom2_levante.sh' on levante


In [29]:
! cat fesom2_levante.sh

#!/bin/bash

#SBATCH -A ab0246
#SBATCH -p compute
#SBATCH -t 3:00:00

module load python3/2023.01-gcc-11.2.0

export POOL_SITE=levante
export POOL_NAME=fesom2
export POOL_CONF=ignore:NEWLINE- .gitNEWLINE- __pycache__NEWLINEoutput: ~/checksum_fesom2_levante.csvNEWLINEpath: /pool/data/AWICM/FESOM2NEWLINE

python checksums.py


---

copy the `fesom2_levante.sh` and `checksums.py` files to `levante` and submit the shell script to slurm. As the `checksums.py` can run independently (without requiring installing the `ptool` package), it becomes a bit easy to deploy and get the checksums.