### Converting Argo data to parquet with dask

This notebook downloads and converts Argo Core and BGC profiles, given:

* the local path `gdac_path` to the argo index files (if they don't exist, they'll be downloaded to the folder),
* the path `outdir_nc` where to download the most recent Argo profile files (this is required to end with `GDAC/dac/`,
* the path `outdir_pqt` where the parquet database will be stored,
* the path `schema_path` to the parquet schemas, this should not need to be changed.

In [1]:
import argo_tools as at

gdac_path = '/vortexfs1/share/boom/data/nc2pqt_test/'
outdir_nc = '/vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/'
outdir_pqt = '/vortexfs1/share/boom/data/nc2pqt_test/pqt2/'
schema_path = '../schemas/'

#### Downloading Argo profiles and generating list of file paths

The following cell downloads the most recent version of the profiles from the GDAC, and returns the list of paths to each stacked profile file (`*_prof.nc` and `*_Sprof.nc` files, for Core and BGC Argo respectively). For the Argo Core database, set `dataset='phy'`, for the BGC database, set `dataset='bgc'`.

If you already have the profiles stored somewhere, you can set the arguments `skip_downloads=False` and `dryrun=True` to simply generate the path list without downloading the profiles (or you can generate the list of file pahts yourself, just call it `flistPHY` or `flistBGC` and the rest of the notebook should work).

If you don't want the download to be multithreaded, set the argument `NPROC=1`.

##### BGC dataset

In [2]:
%%time
# bgc
wmos, df2, flistBGC = at.argo_gdac(gdac_path=gdac_path, dataset='bgc', save_to=outdir_nc, download_individual_profs=False, skip_downloads=False, dryrun=True, overwrite_profiles=True, NPROC=20, verbose=True, checktime=True)

  gdac_index = pd.read_csv(gdac_file,delimiter=',',header=8,parse_dates=['date','date_update'],


CPU times: user 11 s, sys: 814 ms, total: 11.8 s
Wall time: 13.9 s


#### File conversion

The conversion from netCDF to parquet uses the dask package to optimize and parallelize the operations of loading into memory multiple datasets and convert them taking in consideration their in-memory size.

The new parquet files will be stored in the directory `outdir_pqt` that you specified earlier.

The next cell sets up the dask cluster. Adjust the input parameters for your machine, you can see a list [here](https://distributed.dask.org/en/latest/api.html#client) (NB: Client() takes also the arguments needed for LocalCluster()).

In [3]:
import dask
from dask.distributed import Client
client = Client(
    n_workers=10, 
    threads_per_worker=10, 
    processes=True, 
    memory_limit='auto'
)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 10
Total threads: 100,Total memory: 271.27 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:33343,Workers: 10
Dashboard: http://127.0.0.1:8787/status,Total threads: 100
Started: Just now,Total memory: 271.27 GiB

0,1
Comm: tcp://127.0.0.1:42656,Total threads: 10
Dashboard: http://127.0.0.1:36195/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45654,
Local directory: /tmp/dask-scratch-space/worker-zv5lbaf3,Local directory: /tmp/dask-scratch-space/worker-zv5lbaf3

0,1
Comm: tcp://127.0.0.1:44308,Total threads: 10
Dashboard: http://127.0.0.1:42052/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34990,
Local directory: /tmp/dask-scratch-space/worker-lfi9xbrf,Local directory: /tmp/dask-scratch-space/worker-lfi9xbrf

0,1
Comm: tcp://127.0.0.1:38784,Total threads: 10
Dashboard: http://127.0.0.1:43935/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:38000,
Local directory: /tmp/dask-scratch-space/worker-zayiajwx,Local directory: /tmp/dask-scratch-space/worker-zayiajwx

0,1
Comm: tcp://127.0.0.1:39750,Total threads: 10
Dashboard: http://127.0.0.1:33425/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:40071,
Local directory: /tmp/dask-scratch-space/worker-zf7ks3_4,Local directory: /tmp/dask-scratch-space/worker-zf7ks3_4

0,1
Comm: tcp://127.0.0.1:38530,Total threads: 10
Dashboard: http://127.0.0.1:35858/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:33609,
Local directory: /tmp/dask-scratch-space/worker-z7xa5rp_,Local directory: /tmp/dask-scratch-space/worker-z7xa5rp_

0,1
Comm: tcp://127.0.0.1:38431,Total threads: 10
Dashboard: http://127.0.0.1:38889/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:37004,
Local directory: /tmp/dask-scratch-space/worker-cuvasjwb,Local directory: /tmp/dask-scratch-space/worker-cuvasjwb

0,1
Comm: tcp://127.0.0.1:40219,Total threads: 10
Dashboard: http://127.0.0.1:33025/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:37623,
Local directory: /tmp/dask-scratch-space/worker-tru2wg7s,Local directory: /tmp/dask-scratch-space/worker-tru2wg7s

0,1
Comm: tcp://127.0.0.1:35902,Total threads: 10
Dashboard: http://127.0.0.1:40884/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:40212,
Local directory: /tmp/dask-scratch-space/worker-ha7bx7kj,Local directory: /tmp/dask-scratch-space/worker-ha7bx7kj

0,1
Comm: tcp://127.0.0.1:32999,Total threads: 10
Dashboard: http://127.0.0.1:38899/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:46261,
Local directory: /tmp/dask-scratch-space/worker-f1sv411t,Local directory: /tmp/dask-scratch-space/worker-f1sv411t

0,1
Comm: tcp://127.0.0.1:44193,Total threads: 10
Dashboard: http://127.0.0.1:44582/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:36316,
Local directory: /tmp/dask-scratch-space/worker-bvndk5eu,Local directory: /tmp/dask-scratch-space/worker-bvndk5eu




Here, we set the parameters needed for the conversions (e.g. the database name) and then execute the conversion.

In [5]:
from daskTools import daskTools

daskConverter = daskTools(
    db_type = "BGC",
    out_dir = outdir_pqt,
    flist = flistBGC,
    schema_path = schema_path
)

In [6]:
%%time
daskConverter.convert_to_parquet()

python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:426: nc4_find_nc_grp_h5: Assertion `my_h5 && my_h5->root_grp' failed.
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc && !((NC_FILE_INFO_T *)(nc)->dispatchdata) && path' failed.
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc && !((NC_FILE_INFO_T *)(nc)->dispatchdata) && path' failed.
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc && !((NC_FILE_INFO_T *)(nc)->dispatchdata) && path' failed.
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 1327 in H5Aiterate2(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5VLint.c line 1749 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc

Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3901081/3901081_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3902128/3902128_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/1902621/1902621_Sprof.nc
Oops! <class 'ValueError'> occurred.
Fail to cast PROFILE_CHLA_QC[('N_PROF',)] from 'object' to <class 'str'>
Unique values: [b'A' b'B']
Oops! <class 'ValueError'> occurred.
Fail to cast: PROFILE_CHLA_QC 
Encountered unique values: [b'A' b'B']
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3902471/3902471_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/1901364/1901364_Sprof.nc
Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COMMENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                  

HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5F.c line 620 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5VLcallback.c line 3501 in H5VL_file_open(): failed to iterate over available VOL connector plugins
    major: Virtual Object Layer
    minor: Iteration failed
  #002: H5PLpath.c line 578 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
    major: Plugin for dynamically loaded library
    minor: Iteration failed
  #003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
  #004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to o

stored.
CPU times: user 8min 32s, sys: 43.3 s, total: 9min 15s
Wall time: 29min 24s


##### Core dataset

In [7]:
%%time
# phy
wmos, df2, flistPHY = at.argo_gdac(gdac_path=gdac_path, dataset='phy', save_to=outdir_nc, download_individual_profs=False, skip_downloads=False, dryrun=True, overwrite_profiles=True, NPROC=1, verbose=True, checktime=True)

  gdac_index = pd.read_csv(gdac_file,delimiter=',',header=8,parse_dates=['date','date_update'],


CPU times: user 3min 25s, sys: 12.4 s, total: 3min 37s
Wall time: 4min 49s


In [8]:
from daskTools import daskTools

daskConverter = daskTools(
    db_type = "PHY",
    out_dir = outdir_pqt+'partitionPHY_300MB/',
    flist = flistPHY,
    schema_path = schema_path
)

In [12]:
len(flistBGC)

2252

In [9]:
%%time
daskConverter.convert_to_parquet()



[b'                                                                                                                                                                                                                                                                '
 b'BBP700_ADJUSTED is being filled with BBP700 directly in real time. Adjustment method may be enhanced in the future. RTQC_APPLIED 11110 RTQC_FAILED 00000                                                                                                        '
 b'BBP700_ADJUSTED is being filled with BBP700 directly in real time. Adjustment method may be enhanced in the future. RTQC_APPLIED 11110 RTQC_FAILED 00100                                                                                                        '
 b'BBP700_ADJUSTED is being filled with BBP700 directly in real time. Adjustment method may be enhanced in the future. RTQC_APPLIED 11111 RTQC_FAILED 00000                                                              



stored.
CPU times: user 3min 36s, sys: 14.3 s, total: 3min 50s
Wall time: 7min 6s


#### Done!

When we are done, we can shut down the dask cluster.

In [13]:
client.shutdown()

Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/6901472/6901472_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3901083/3901083_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/csiro/5901699/5901699_Sprof.nc
Oops! <class 'RuntimeError'> occurred.
Fail to cast: SCIENTIFIC_CALIB_DATE 
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/csiro/5901646/5901646_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/6901026/6901026_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/6902736/6902736_Sprof.nc
Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COMMENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                    

2024-08-21 13:42:37,195 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/distributed/utils_comm.py", line 459, in retry_operation
    return await retry(
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-pack