## Examples of using the `GroupOperation` class - Manipulating a Group

Any operation to be applied to a specific group can be accessed via the ``GroupOperation`` class, imported as below from padocc's ``operations`` module.


In [1]:
from padocc.operations import GroupOperation
import logging

my_group = GroupOperation(
    'my_group',
    workdir='../../auto_testdata_dir', # The directory to create pipeline files.
    verbose=logging.INFO,
)
my_group


<PADOCC Group: my_group>



Get some general information about the group:

In [2]:
my_group.info()

Group: my_group
General Methods:
 > group.run() - Run a specific operation across part of the group.
 > group.init_from_file() - Initialise the group based on an input csv file
 > group.init_from_stac() - Initialise the group based on a STAC index
 > group.add_project() - Add an new project/dataset to this group
 > group.save_files() - Save any changes to any files in the group as part of an operation
 > group.check_writable() - Check if all directories are writable for this group.
Assessment methods:
 > group.summary_data() - Get a printout summary of data representations in this group
 > group.remove_projects() - Remove projects fitting some parameters from this group
 > group.progress_display() - Get a human-readable display of progress within the group.
 > group.progress_repr() - Get a dict version of the progress report (for AirFlow)


In [3]:
my_group.values()

Group: my_group
 - Workdir: ../../auto_testdata_dir
 - Groupdir: ../../auto_testdata_dir/groups/my_group
 - forceful: False
 - thorough: False
 - dryrun: False


## Initialise the group from a file
The group has been created but contains no data currently, so we need to fill it from either a file or STAC index.

In [4]:
csv_file = '../../tests/data/myfile.csv'
my_group.init_from_file(csv_file)

INFO [group-operation]: Starting initialisation
INFO [group-operation]: Copying input file from relative path - resolved to /home/users/dwest77/cedadev/padocc/docs/source
INFO [group-operation]: Creating project directories
INFO [group-operation]: Creating directories/filelists for 1/2


INFO [group-operation]: Updated new status: init - Success
INFO [group-operation]: Creating directories/filelists for 2/2
INFO [group-operation]: Updated new status: init - Success
INFO [group-operation]: Created 12 files, 4 directories in group my_group
INFO [group-operation]: Written as group ID: my_group


The group has now been initialised. The CSV file we loaded contains two 'projects' which will each produce a single dataset object at the end of the pipeline. This is an aggregation of multiple data files into a single product. We can view the contents of the CSV file the group was loaded with as below:

In [5]:
print(my_group.datasets)

padocc-test-1,/home/users/dwest77/cedadev/padocc/tests/data/test1.txt,,
padocc-test-2,/home/users/dwest77/cedadev/padocc/tests/data/test2.txt,,


Each project came with a file listing all the data files under that project, in this case there are 5 netCDF files in each project, which we can find using:

In [7]:
project = my_group.get_project('padocc-test-1')
print(project.allfiles)

/home/users/dwest77/cedadev/padocc/tests/data/rain/example1.0.nc
/home/users/dwest77/cedadev/padocc/tests/data/rain/example1.1.nc
/home/users/dwest77/cedadev/padocc/tests/data/rain/example1.2.nc
/home/users/dwest77/cedadev/padocc/tests/data/rain/example1.3.nc
/home/users/dwest77/cedadev/padocc/tests/data/rain/example1.4.nc


## Run a group operation

We can now run a process on the group as a whole via the ``run`` method. There are three main phases that form the central section of the pipeline; ``scan``, ``compute`` and ``validate``. These can be run individually (recommended) or if you are running for a single project you may run all steps with ``all``. 

In [3]:
my_group.run(
    'scan', 
    mode='kerchunk', # Default format
    repeat_id='main', # All projects
    proj_code=None,   # Or run a specific project.
    forceful=True,
)

INFO [group-operation]: Starting operation: 1/2 (padocc-test-1)
INFO [project-operation_0]: Starting scan-kerchunk operation for padocc-test-1
INFO [project-operation_0]: Starting scan-kerchunk operation for padocc-test-1
INFO [project-operation_0]: Determined 2 files to scan (out of 5)
INFO [project-operation_0]: Determined 2 files to scan (out of 5)
INFO [project-operation_0]: Starting scan process for Kerchunk cloud format
INFO [project-operation_0]: Starting scan process for Kerchunk cloud format
INFO [project-operation_0]: Starting computation for components of padocc-test-1
INFO [project-operation_0]: Starting computation for components of padocc-test-1
INFO [project-operation_0]: Loading cache file
INFO [project-operation_0]: Loading cache file
INFO [project-operation_0]: Loaded refs: 1/2
INFO [project-operation_0]: Loaded refs: 1/2
INFO [project-operation_0]: Loading cache file
INFO [project-operation_0]: Loading cache file
INFO [project-operation_0]: Loaded refs: 2/2
INFO [pro







1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  ds = open_dataset(
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  ds = open_dataset(
INFO [project-operation_1]: Determining concatenation dimensions
INFO [project-operation_1]: Found ['time'] concatenation dimensions.
INFO [project-operation_1]: Determining identical variables
INFO [project-operation_1]: Found ['latitude', 'longitude'] identical variables.
INFO [project-operation_1]: Concatenating to JSON format Ke







The first step is to scan part of the existing data to assess viability for the chosen mode of aggregation. In this example, two of the files for each project were converted to kerchunk and combined to ensure the whole dataset can be converted. A detailed file with scan results has been created, which we can access using the same project object as we previously created.

In [5]:
project = my_group.get_project('padocc-test-1')
print(project.detail_cfg)

addition: 0.064 %
chunks_per_file: '4.0'
driver: hdf5
estm_chunksize: 260.28 KB
estm_spatial_res: 254.56 deg
kerchunk_data: 3.34 KB
netcdf_data: 5.21 MB
num_files: 5
timings:
  concat_actual: null
  concat_estm: 0.014077
  convert_actual: null
  convert_estm: 0.014628
  validate_actual: null
  validate_estm: 0.00494
total_chunks: '20.00'
type: JSON
variable_count: 4
variables:
- latitude
- longitude
- p
- time
version_no: 1



There is a significant amount of information present here. The inportant elements are the estimates for the size of the kerchunk file which will be created, the ``type`` which can be ``JSON`` or ``PARQ`` for kerchunk.