# Assess a file for Kerchunkability with Padocc pipeline
Take an accepted input (single) file and attempt to convert using known methods.

In [1]:
from pipeline.compute import KerchunkConverter
from pipeline.scan import summarise_json

nfile     = '/badc/cmip6/data/CMIP6/C4MIP/CCCma/CanESM5/1pctCO2-rad/r1i1p1f1/AERmon/ps/gn/v20190429/ps_AERmon_CanESM5_1pctCO2-rad_r1i1p1f1_gn_185001-200012.nc'

converter = KerchunkConverter(bypass_driver=True)
refs      = converter.try_all_drivers(nfile)

The kerchunk converter will fail for invalid types for Kerchunking, some solutions may be possible but it may be that your files must instead be converted to Zarr.
We can otherwise assess the refs generated here.

In [2]:
volume, cpf, varchunks, ctype = summarise_json(refs, converter.ctype)

Summarised outputs below.

In [3]:
print(f"""
Assessment for selected file:
    Size in bytes        : {volume}
    Chunks in file       : {cpf}
    Variables            : {list(varchunks.keys())}
    Kerchunk Driver Type : {ctype}
""")


Assessment for selected file:
    Size in bytes        : 47520998
    Chunks in file       : 5440
    Variables            : ['lat', 'lat_bnds', 'lon', 'lon_bnds', 'ps', 'time', 'time_bnds']
    Kerchunk Driver Type : hdf5



`varchunks` contains the chunk size for each variable. Chunk size is N-dimensional, either the whole array for a specific variable or some subsection.

In [4]:
for var in varchunks.keys():
    print(var, varchunks[var])

lat [64]
lat_bnds [64, 2]
lon [128]
lon_bnds [128, 2]
ps [1, 64, 128]
time [1]
time_bnds [1, 2]


From this we can see for example the `lat` dimension is chunked in sets of 64 values.

In [5]:
refs['refs']['lat/.zarray']

'{"chunks":[64],"compressor":null,"dtype":"<f8","fill_value":9.969209968386869e+36,"filters":null,"order":"C","shape":[64],"zarr_format":2}'

Comparing this to the `shape` of the `lat` dimension, we can understand that no internal chunking is present. The whole 64-length array is considered one chunk.

In [6]:
refs['refs']['time/.zarray']

'{"chunks":[1],"compressor":null,"dtype":"<f8","fill_value":null,"filters":null,"order":"C","shape":[1812],"zarr_format":2}'

When we contrast this with the `time` array, we can see the shape is 1812 but the chunksize is just 1. Meaning every time value is considered its own chunk.
Other tools and functions exist within the pipeline to analyse Kerchunk refs:

In [12]:
from pipeline.utils import find_zarrays
import json

zarrays = find_zarrays(refs)
for z in zarrays.keys():
    zarray = json.loads(zarrays[z])
    print(f"{z}         Shape: {zarray['shape']}, Chunks: {zarray['chunks']}")

lat/.zarray         Shape: [64], Chunks: [64]
lat_bnds/.zarray         Shape: [64, 2], Chunks: [64, 2]
lon/.zarray         Shape: [128], Chunks: [128]
lon_bnds/.zarray         Shape: [128, 2], Chunks: [128, 2]
ps/.zarray         Shape: [1812, 64, 128], Chunks: [1, 64, 128]
time/.zarray         Shape: [1812], Chunks: [1]
time_bnds/.zarray         Shape: [1812, 2], Chunks: [1, 2]


From the above, there is a single chunk for each of: lat, lon, lat_bnds, lon_bnds. But there are 1812 chunks for each of: ps, time, time_bnds. So adding these up we get 1816*3 = 5448. There is an 8-chunk discrepancy here, most likely because 8 of the original chunks are filled with NaN values, so Kerchunk ignores these when writing the reference file.