### Converting Argo data to parquet with dask

This notebook downloads and converts Argo Core and BGC profiles, given:

* the local path `gdac_path` to the argo index files (if they don't exist, they'll be downloaded to the folder),
* the path `outdir_nc` where to download the most recent Argo profile files (this is required to end with `GDAC/dac/`,
* the path `outdir_pqt` where the parquet database will be stored.

In [1]:
import argo_tools as at

gdac_path = '/vortexfs1/share/boom/data/nc2pqt_test/'
outdir_nc = '/vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/'
outdir_pqt = '/vortexfs1/share/boom/data/nc2pqt_test/pqt2/'

#### Downloading Argo profiles and generating list of file paths

The following cell downloads the most recent version of the profiles from the GDAC, and returns the list of paths to each stacked profile file (`*_prof.nc` and `*_Sprof.nc` files, for Core and BGC Argo respectively). For the Argo Core database, set `dataset='phy'`, for the BGC database, set `dataset='bgc'`.

If you already have the profiles stored somewhere, you can set the arguments `skip_downloads=False` and `dryrun=True` to simply generate the path list without downloading the profiles (or you can generate the list of file pahts yourself, just call it `flistPHY` or `flistBGC` and the rest of the notebook should work).

If you don't want the download to be multithreaded, set the argument `NPROC=1`.

##### BGC dataset

In [2]:
%%time
# bgc
wmos, df2, flistBGC = at.argo_gdac(gdac_path=gdac_path, dataset='bgc', save_to=outdir_nc, download_individual_profs=False, skip_downloads=False, dryrun=True, overwrite_profiles=True, NPROC=20, verbose=True, checktime=True)

  gdac_index = pd.read_csv(gdac_file,delimiter=',',header=8,parse_dates=['date','date_update'],


CPU times: user 11 s, sys: 814 ms, total: 11.8 s
Wall time: 13.9 s


#### File conversion

The conversion from netCDF to parquet uses the dask package to optimize and parallelize the operations of loading into memory multiple datasets and convert them taking in consideration their in-memory size.

The new parquet files will be stored in the directory `outdir_pqt` that you specified earlier.

The next cell sets up the dask cluster. Adjust the input parameters for your machine, you can see a list [here](https://distributed.dask.org/en/latest/api.html#client) (NB: Client() takes also the arguments needed for LocalCluster()).

In [3]:
import dask
from dask.distributed import Client
client = Client(
    n_workers=10, 
    threads_per_worker=10, 
    processes=True, 
    memory_limit='auto'
)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 10
Total threads: 100,Total memory: 271.27 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:33343,Workers: 10
Dashboard: http://127.0.0.1:8787/status,Total threads: 100
Started: Just now,Total memory: 271.27 GiB

0,1
Comm: tcp://127.0.0.1:42656,Total threads: 10
Dashboard: http://127.0.0.1:36195/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45654,
Local directory: /tmp/dask-scratch-space/worker-zv5lbaf3,Local directory: /tmp/dask-scratch-space/worker-zv5lbaf3

0,1
Comm: tcp://127.0.0.1:44308,Total threads: 10
Dashboard: http://127.0.0.1:42052/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34990,
Local directory: /tmp/dask-scratch-space/worker-lfi9xbrf,Local directory: /tmp/dask-scratch-space/worker-lfi9xbrf

0,1
Comm: tcp://127.0.0.1:38784,Total threads: 10
Dashboard: http://127.0.0.1:43935/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:38000,
Local directory: /tmp/dask-scratch-space/worker-zayiajwx,Local directory: /tmp/dask-scratch-space/worker-zayiajwx

0,1
Comm: tcp://127.0.0.1:39750,Total threads: 10
Dashboard: http://127.0.0.1:33425/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:40071,
Local directory: /tmp/dask-scratch-space/worker-zf7ks3_4,Local directory: /tmp/dask-scratch-space/worker-zf7ks3_4

0,1
Comm: tcp://127.0.0.1:38530,Total threads: 10
Dashboard: http://127.0.0.1:35858/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:33609,
Local directory: /tmp/dask-scratch-space/worker-z7xa5rp_,Local directory: /tmp/dask-scratch-space/worker-z7xa5rp_

0,1
Comm: tcp://127.0.0.1:38431,Total threads: 10
Dashboard: http://127.0.0.1:38889/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:37004,
Local directory: /tmp/dask-scratch-space/worker-cuvasjwb,Local directory: /tmp/dask-scratch-space/worker-cuvasjwb

0,1
Comm: tcp://127.0.0.1:40219,Total threads: 10
Dashboard: http://127.0.0.1:33025/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:37623,
Local directory: /tmp/dask-scratch-space/worker-tru2wg7s,Local directory: /tmp/dask-scratch-space/worker-tru2wg7s

0,1
Comm: tcp://127.0.0.1:35902,Total threads: 10
Dashboard: http://127.0.0.1:40884/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:40212,
Local directory: /tmp/dask-scratch-space/worker-ha7bx7kj,Local directory: /tmp/dask-scratch-space/worker-ha7bx7kj

0,1
Comm: tcp://127.0.0.1:32999,Total threads: 10
Dashboard: http://127.0.0.1:38899/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:46261,
Local directory: /tmp/dask-scratch-space/worker-f1sv411t,Local directory: /tmp/dask-scratch-space/worker-f1sv411t

0,1
Comm: tcp://127.0.0.1:44193,Total threads: 10
Dashboard: http://127.0.0.1:44582/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:36316,
Local directory: /tmp/dask-scratch-space/worker-bvndk5eu,Local directory: /tmp/dask-scratch-space/worker-bvndk5eu




Here, we set the parameters needed for the conversions (e.g. the database name) and then execute the conversion.

In [5]:
from daskTools import daskTools

daskConverter = daskTools(
    db_type = "BGC",
    out_dir = outdir_pqt,
    flist = flistBGC
)

In [6]:
%%time
daskConverter.convert_to_parquet()

python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:426: nc4_find_nc_grp_h5: Assertion `my_h5 && my_h5->root_grp' failed.
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc && !((NC_FILE_INFO_T *)(nc)->dispatchdata) && path' failed.
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc && !((NC_FILE_INFO_T *)(nc)->dispatchdata) && path' failed.
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc && !((NC_FILE_INFO_T *)(nc)->dispatchdata) && path' failed.
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 1327 in H5Aiterate2(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5VLint.c line 1749 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:326: nc4_nc4f_list_add: Assertion `nc

Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3901081/3901081_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3902128/3902128_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/1902621/1902621_Sprof.nc
Oops! <class 'ValueError'> occurred.
Fail to cast PROFILE_CHLA_QC[('N_PROF',)] from 'object' to <class 'str'>
Unique values: [b'A' b'B']
Oops! <class 'ValueError'> occurred.
Fail to cast: PROFILE_CHLA_QC 
Encountered unique values: [b'A' b'B']
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3902471/3902471_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/1901364/1901364_Sprof.nc
Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COMMENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                  

HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5F.c line 620 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5VLcallback.c line 3501 in H5VL_file_open(): failed to iterate over available VOL connector plugins
    major: Virtual Object Layer
    minor: Iteration failed
  #002: H5PLpath.c line 578 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
    major: Plugin for dynamically loaded library
    minor: Iteration failed
  #003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
  #004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to o

stored.
CPU times: user 8min 32s, sys: 43.3 s, total: 9min 15s
Wall time: 29min 24s


##### Core dataset

In [7]:
%%time
# phy
wmos, df2, flistPHY = at.argo_gdac(gdac_path=gdac_path, dataset='phy', save_to=outdir_nc, download_individual_profs=False, skip_downloads=False, dryrun=True, overwrite_profiles=True, NPROC=1, verbose=True, checktime=True)

  gdac_index = pd.read_csv(gdac_file,delimiter=',',header=8,parse_dates=['date','date_update'],


CPU times: user 3min 25s, sys: 12.4 s, total: 3min 37s
Wall time: 4min 49s


In [8]:
from daskTools import daskTools

daskConverter = daskTools(
    db_type = "PHY",
    out_dir = outdir_pqt+'partitionPHY_300MB/',
    flist = flistPHY
)

In [12]:
len(flistBGC)

2252

In [9]:
%%time
daskConverter.convert_to_parquet()



[b'                                                                                                                                                                                                                                                                '
 b'BBP700_ADJUSTED is being filled with BBP700 directly in real time. Adjustment method may be enhanced in the future. RTQC_APPLIED 11110 RTQC_FAILED 00000                                                                                                        '
 b'BBP700_ADJUSTED is being filled with BBP700 directly in real time. Adjustment method may be enhanced in the future. RTQC_APPLIED 11110 RTQC_FAILED 00100                                                                                                        '
 b'BBP700_ADJUSTED is being filled with BBP700 directly in real time. Adjustment method may be enhanced in the future. RTQC_APPLIED 11111 RTQC_FAILED 00000                                                              



stored.
CPU times: user 3min 36s, sys: 14.3 s, total: 3min 50s
Wall time: 7min 6s


#### Done!

When we are done, we can shut down the dask cluster.

In [13]:
client.shutdown()

Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/6901472/6901472_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/3901083/3901083_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/csiro/5901699/5901699_Sprof.nc
Oops! <class 'RuntimeError'> occurred.
Fail to cast: SCIENTIFIC_CALIB_DATE 
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/csiro/5901646/5901646_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/6901026/6901026_Sprof.nc
Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/coriolis/6902736/6902736_Sprof.nc
Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COMMENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                    

2024-08-21 13:42:37,195 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/distributed/utils_comm.py", line 459, in retry_operation
    return await retry(
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-pack

In [15]:
%%time
import pyarrow as pa
schema_path = "/vortexfs1/share/boom/data/nc2pqt_test/pqt/data/metadata/ArgoPHY_schema.metadata"
PHY_schema = pq.read_schema(schema_path)
todrop = ["DOXY","DOXY_ADJUSTED","DOXY_ADJUSTED_QC","DOXY_ADJUSTED_ERROR","DOXY_QC"]
for name in todrop:
    idx = PHY_schema.get_field_index(name)
    PHY_schema = PHY_schema.remove(idx)
pd_dict = translate_pq_to_pd(PHY_schema)


# we need to add field to partition on to schema
PHY_schema = PHY_schema.append(
    pa.field('JULD_D', 
             pa.from_numpy_dtype(np.dtype('datetime64[ns]'))
            )
)

chunk = 2000
partition_on_time = True
for j in range( int(np.ceil(len(flist)/chunk)) ):
    initchunk = j*chunk 
    endchunk = (j+1)*chunk
    if endchunk > len(flist):
        endchunk = len(flist)  

    df = [ read_argo(file,pd_dict,VARS_PHY,partition_on_time) for file in flist[initchunk:endchunk] ]
    df = dd.from_delayed(df) # creating unique df from list of df    
    # df = df.repartition(partition_size="100MB")
    name_function = lambda x: f"ArgoPHY_dask_{x}.parquet"
    df.to_parquet(
        outdir_pqt+'partitionYYYYMM/',
        engine="pyarrow",
        schema = PHY_schema,
        name_function = name_function,
        partition_on = 'JULD_D'
    )
# df.compute()
print("stored.")   

Exception ignored in: <function CachingFileManager.__del__ at 0x2aaad3754f70>
Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/xarray/backends/file_manager.py", line 250, in __del__
    self.close(needs_lock=False)
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/xarray/backends/file_manager.py", line 234, in close
    file.close()
  File "src/netCDF4/_netCDF4.pyx", line 2627, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2590, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID
Exception ignored in: <function CachingFileManager.__del__ at 0x2aaad391c310>
Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/xarray/backends

Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/aoml/1900167/1900167_prof.nc
/vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/aoml/1900167/1900167_prof.nc is empty, discarding.
Oops! <class 'RuntimeError'> occurred.
Fail to cast FLOAT_SERIAL_NO[('N_PROF',)] from 'object' to <class 'str'>
Can't read unique values !
Oops! <class 'RuntimeError'> occurred.
Fail to cast FIRMWARE_VERSION[('N_PROF',)] from 'object' to <class 'str'>
Can't read unique values !
Oops! <class 'RuntimeError'> occurred.
Fail to cast WMO_INST_TYPE[('N_PROF',)] from 'object' to <class 'float'>
Can't read unique values !
Oops! <class 'RuntimeError'> occurred.
Fail to cast WMO_INST_TYPE[('N_PROF',)] from 'object' to <class 'int'>
Can't read unique values !
Oops! <class 'RuntimeError'> occurred.
Fail to cast JULD_QC[('N_PROF',)] from 'object' to <class 'str'>
Can't read unique values !
Oops! <class 'RuntimeError'> occurred.
Fail to cast JULD_QC[('N_PROF',)] from 'object' to <class 'int'>
Can't read unique values

Exception ignored in: <function CachingFileManager.__del__ at 0x2aaad3751e50>
Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/xarray/backends/file_manager.py", line 250, in __del__
    self.close(needs_lock=False)
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/xarray/backends/file_manager.py", line 234, in close
    file.close()
  File "src/netCDF4/_netCDF4.pyx", line 2627, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2590, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID
Exception ignored in: <function CachingFileManager.__del__ at 0x2aaad375b280>
Traceback (most recent call last):
  File "/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/venv/venv3.9/lib/python3.9/site-packages/xarray/backends

Oops! <class 'ValueError'> occurred.
Fail to cast: SCIENTIFIC_CALIB_DATE 
Encountered unique values: [b'' b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00N/'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00N/A '
 b'\x00\x00\x00\x00N/A       ' b'\x00\x00N/A         ' b'    ' b'      '
 b'        ' b'          ' b'            ' b'              '
 b'            N/' b'            PR' b'            PS' b'            TE'
 b'          N/A ' b'          PRES' b'          PSAL' b'          TEMP'
 b'        N/A   ' b'        PRES_A' b'        PSAL_A' b'      N/A     '
 b'      PRES_ADJ' b'      PSAL_ADJ' b'    N/A       ' b'    PSAL_ADJUS'
 b'    TEMP_ADJUS' b'  N/A         ' b'  PSAL_ADJUSTE' b'  TEMP_ADJUSTE'
 b' - dP, where d' b' - dS         ' b' 5 dbar for Ap' b' PRESSURE (min'
 b' dP is SURFACE' b' dP, where dP ' b' dS           ' b' dbar for Apf-'
 b' from next cyc' b'(minus 5 dbar ' b')             ' b'),TEMP,PRES_AD'
 b',PRES,e_time,a' b',PRES_ADJUSTED' b',TEMP,PRES),TE' b',alpha,tau)   '




Oops! <class 'ValueError'> occurred.
Fail to cast: SCIENTIFIC_CALIB_DATE 
Encountered unique values: [b'' b'              ' b'            No' b'            Pr'
 b'            Th' b'          No s' b'          Pres' b'          The '
 b'        No sig' b'        Pressu' b'        The qu' b'      No signi'
 b'      Pressure' b'      The quot' b'    No signifi' b'    Pressures '
 b'    The quoted' b'  No significa' b'  Pressures ad' b'  The quoted e'
 b' accuracy in d' b' accuracy with' b' adjusted for ' b' at time of la'
 b' by using pres' b' calibration. ' b' dbar.        ' b' error is manu'
 b' error is max[' b' in PSS-78.   ' b' is manufactur' b' is max[0.01, '
 b' of laboratory' b' offset at the' b' pressure adju' b' respect to IT'
 b' salinity drif' b' sea surface. ' b' significant s' b' specified acc'
 b' surface. The ' b' the sea surfa' b' to ITS-90 at ' b', 1xWJO uncert'
 b'.             ' b'. Salinity adj' b'. The quoted e' b'0.01, 1xWJO un'
 b'01, 1xWJO unce' b'1xWJO uncertai' 




Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                                                                '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -0.3db                                                                                                                                                                                                          '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS 0db                                                                                                                                                                                                             '
 b'CONDUCTIVITY WAS NOT ADJUSTED. COEFFICIENT r F




Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                                                                '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -4.1db                                                                                                                                                                                                          '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -4.2db                                                                                                                                                                                                          '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT I



Failed on /vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/bodc/3901581/3901581_prof.nc
/vortexfs1/share/boom/data/nc2pqt_test/GDAC/dac/bodc/3901581/3901581_prof.nc is empty, discarding.
Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'from cycle 130: r = 0.9998 (\xb1 0.00003), vertically averaged sS = -0.009 (\xb1 0.002)                                                                                                                                                                               '
 b'none                                                                                                                                                                                                                                                            '
 b'r = 0.9998 (\xb1 0.00003), vertically averaged dS = -0.009 (\xb1 0.002)                                             




Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                                                                '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -0.1db                                                                                                                                                                                                          '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS 0db                                                                                                                                                                                                             '
 b'COEFFICIENT r FOR CONDUCTIVITY IS 1.000149, +/




Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                                                                '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -0.3db                                                                                                                                                                                                          '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS 0db                                                                                                                                                                                                             '
 b'CONDUCTIVITY WAS NOT ADJUSTED. COEFFICIENT r F




Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
Unique values: [b'                                                                                                                                                                                                                                                                '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -0.1 dbar                                                                                                                                                                                                       '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT IS -0.2 dbar                                                                                                                                                                                                       '
 b'ADDITIVE COEFFICIENT FOR PRESSURE ADJUSTMENT I

In [None]:
# %%time

# if not single_process:
#     import multiprocessing

# def xr2pqt(rank,files_list,loop_id):
#     df_list = []
#     df_memory = 0
#     counter = 0
#     rank_str = "#" + str(rank) + ": "
#     nb_files = len(files_list)
#     argo_file_fail = []
#     for argo_file in files_list:
#         counter += 1
#         if counter%10==0:
#             print(rank_str + "processing file " + str(counter) + " of " + str(nb_files))
            
#         try:
#             ds = xr.load_dataset(argo_file, engine="argo") #loading into memory the profile
#         except:
#             print(rank_str + 'Failed on ' + str(argo_file))
#             argo_file_fail.append(argo_file)
#             continue

#         # some floats don't have all the vars specified in VARS
#         invars = list(set(VARS) & set(list(ds.data_vars)))
#         df = ds[invars].to_dataframe()
#         # df = ds.to_dataframe()

#         if not df.empty:
#             # for col in VARS:
#             #     if col not in invars:
#             #         df[col] = np.nan #ensures that all parquet files have all the VARS as columns
#             # df_memory += df.memory_usage().sum()/(10**6) # tracking memory usage (in MB)
#             df_list.append(df)

#             df = None
#             ds = None
#             del df
#             del ds
#             gc.collect()

#     # store to parquet once a large enough dataframe has been created
#     print(rank_str + "Storing to parquet...")

#     try:
#         df_list = pd.concat(df_list, axis=0) # it automatically adds NaNs where needed
#     except:
#         print(rank_str + 'Could not concatenate pandas dataframes')
#         print(rank_str + 'Failed on ' + str(argo_file) + '. Caution: more files might be affected.')
#         print(rank_str + 'Data frames list:')
#         print(df_list)
#         return argo_file_fail    

#     df_memory = df_list.memory_usage().sum()/(1024**2)
#     if df_memory < 1e3:
#         print(rank_str + "In-memory filesize: " + "{:.2f}".format(df_memory) + " MB")
#     else:
#         print(rank_str + "In-memory filesize: " + "{:.2f}".format(df_memory/1024) + " GB")

#     parquet_filename = outdir_pqt + "test_profiles_levels_" + str(rank) + "_" + str(loop_id) + ".parquet"
#     df_list.to_parquet(parquet_filename)
#     print(rank_str + str(parquet_filename) + " stored.")

#     df_list = None
#     del df_list
#     gc.collect()

#     return argo_file_fail
    
# ############################################################################################################

# def poolParams(flist,nc_size_per_pqt):
#     size_flist = []
#     for f in flist:
#         try:
#             f_size = os.path.getsize(f)/1024**2
#             size_flist.append( f_size ) #size in MB
#         except:
#             if not os.path.isfile(f):
#                 gdac_root = 'https://usgodae.org/pub/outgoing/argo/dac/'
#                 fpath = os.path.join( *f.split(os.path.sep)[-3:] )
#                 response = at.get_func( gdac_root + fpath )
#                 if response.status_code == 404:
#                     print('File ' + f + ' returned 404 error from URL ' + str(gdac_root+fpath) + '. Skipping it.')
#                 else:
#                     print('File ' + f + ' likely present at URL ' + str(gdac_root+fpath) + '. You might want to check why it is not in the local drive.')
#             else:
#                 print('File ' + f + ' seems to exist in the local drive: not sure what is going on here.')
#             continue
            
#     size_tot = sum(size_flist)
#     NPROC = int(np.ceil(size_tot/nc_size_per_pqt))
#     size_per_proc = size_tot/NPROC

#     print('')
#     print('Size per processor (MB)')
#     print(size_per_proc)
#     print('')
    
#     ids_sort = np.argsort(np.array(size_flist))
    
#     chunks_ids = []
#     x = np.copy(ids_sort)
    
#     for j in range(NPROC):
#         chunk_ids = []
#         chunk_size = 0
#         while ((chunk_size<size_per_proc) and (len(x) > 0)):
#             if len(chunk_ids)%2 == 0:
#                 chunk_ids.append(x[-1])
#                 x = x[:-1]
#             else:
#                 chunk_ids.append(x[0])
#                 x = x[1:]
#             chunk_size = sum(np.asarray(size_flist)[chunk_ids])
#         print(chunk_size)
#         chunks_ids.append(chunk_ids)
    
#     if len(x) > 0:
#         warnings.warn(str(len(x)) + " files have not been assigned to a processor.")
    
#     print('')
#     chunks=[]
#     skip_proc = 0
#     total_memory = 0
#     for j,chunk_ids in enumerate(chunks_ids):
#         print('Size in processor ' + str(j) + ' (MB):')
#         size_proc = sum(np.asarray(size_flist)[chunk_ids])
#         total_memory += size_proc
#         print(size_proc)
#         if size_proc == 0:
#             skip_proc += 1
#             continue
#         chunk = [flist[k] for k in chunk_ids]
#         chunks.append(chunk)

#     NPROC -= skip_proc
        
#     print('')
#     print("Using " + str(NPROC) + " processors")
    
#     return NPROC, chunks, size_per_proc

# ########################################################

# def inMemorySize(flist):
#     mem = []
#     print('Parsing in-memory usage of Argo files.')
#     print('This might take a while, and if you have an old file with this information, passing its path could be faster.')
#     print('You can also decide to use in-disk memory to optimize file size conversion by calling def poolParams() instead, although it performs worse.')
#     for file in flist:
#         try:
#             ds = xr.open_dataset(file, engine="argo")
#             mem.append( ds.nbytes/(1024*1024) )
#         except:
#             print('skipping ' + file)
#             mem.append(-1)
            
#     with open( outdir_pqt + 'inmemory_file_size.json', 'w') as f:
#         json.dump(mem, f)

#     print('done.')
    
#     return mem

# def poolParamsMem(flist,inmem_size_per_pqt,fmem_path=None):

#     if fmem_path is None:
#         fmem = inMemorySize(flist)
#     else:    
#         with open(fmem_path, 'r') as f:
#             fmem = json.load(f)
            
#     size_tot = sum(fmem)
#     NPROC = int(np.ceil(size_tot/inmem_size_per_pqt))
#     size_per_proc = size_tot/NPROC

#     print('')
#     print('Size per processor (MB)')
#     print(size_per_proc)
#     print('')
    
#     ids_sort = np.argsort(np.array(fmem))
    
#     chunks_ids = []
#     x = np.copy(ids_sort)
    
#     for j in range(NPROC):
#         chunk_ids = []
#         chunk_size = 0
#         while ((chunk_size<size_per_proc) and (len(x) > 0)):
#             if len(chunk_ids)%2 == 0:
#                 chunk_ids.append(x[-1])
#                 x = x[:-1]
#             else:
#                 chunk_ids.append(x[0])
#                 x = x[1:]
#             chunk_size = sum(np.asarray(fmem)[chunk_ids])
#         print(chunk_size)
#         chunks_ids.append(chunk_ids)
    
#     if len(x) > 0:
#         warnings.warn(str(len(x)) + " files have not been assigned to a processor.")
    
#     print('')
#     chunks=[]
#     skip_proc = 0
#     total_memory = 0
#     for j,chunk_ids in enumerate(chunks_ids):
#         size_proc = sum(np.asarray(fmem)[chunk_ids])
#         total_memory += size_proc
#         print('In-memory size in processor ' + str(j) + ' (MB): ' + str(size_proc) )
#         # print(size_proc)
#         if size_proc == 0:
#             skip_proc += 1
#             continue
#         chunk = [flist[k] for k in chunk_ids]
#         chunks.append(chunk)

#     NPROC -= skip_proc
        
#     print('')
#     print("Using " + str(NPROC) + " processors")
    
#     return NPROC, chunks, size_per_proc

# ############################################################################################################

# # Metadata
# metadata_dir = outdir_pqt + "metadata/"
# Path(metadata_dir).mkdir(parents= True, exist_ok= True)
# parquet_filename = metadata_dir + "test_metadata.parquet"
# df2.to_parquet(parquet_filename)
# print(str(parquet_filename) + " stored.")

# # Profiles
# print("Processing " + str(len(flist)) + " files.")

# if not single_process:
#     nc_size_per_pqt = 40 # Empirically, 40 MB of average .nc file size gives in-memory sizes between 100-330 MB, which is what Dask recommens
#     NPROC, chunks,size_per_proc = poolParams(flist,nc_size_per_pqt)

# # if not single_process:
# #     inmem_size_per_pqt = 300
# #     NPROC, chunks,size_per_proc = poolParamsMem(flist,inmem_size_per_pqt)

# # fixing max nb of processes to prevent bottleneck likely due to I/O on disk queing operations and filling up the memory
# MAXPROC = 20
# if size_per_proc > 300:
#     MAXPROC = 20

# if NPROC > MAXPROC and not single_process:
#     print("Estimated number of processors might create bottleneck issues. Forcing to use " + str(MAXPROC) + " processors at a time.")
#     # force to use at most MAXPROC processes, by looping over chunks
#     full_loops = NPROC//MAXPROC  #nb of loops to use at most MAXPROC
#     RESPROC = NPROC%MAXPROC   #nb of residual processors after the loops
    
#     i_start = 0
#     i_end   = 0
#     failed_files = []
#     for full_loop in range(full_loops):
#         i_start = MAXPROC*full_loop
#         i_end   = MAXPROC*(full_loop+1)
#         pool_obj = multiprocessing.Pool(processes=MAXPROC)
#         failed_files.append( pool_obj.starmap(xr2pqt, [(rank, chunk, full_loop) for rank, chunk in enumerate(chunks[i_start:i_end])] ) )
#         pool_obj.close()

#     # multiprocessing across residual processor pool with NPROC<MAXPROC
#     if RESPROC > 0:
#         pool_obj = multiprocessing.Pool(processes=RESPROC)
#         failed_files.append( pool_obj.starmap(xr2pqt, [(rank, chunk, full_loop+1) for rank, chunk in enumerate(chunks[(i_end+1):])] ) )
#         pool_obj.close()

# elif NPROC > 1 and not single_process:
#     failed_files = []
#     pool_obj = multiprocessing.Pool(processes=NPROC)
#     failed_files.append( pool_obj.starmap(xr2pqt, [(rank, chunk, 0) for rank, chunk in enumerate(chunks)] ) )
#     pool_obj.close()

# else:
#     failed_files = xr2pqt(0,flist,0)

# failed = []
# for f in failed_files:
#     for g in f:
#         if len(g) > 0:
#             for h in g:
#                 failed.append(h)
#                 print(h)

# print('Files that encountered an error and were not converted:')
# print(failed)

#### Converting all metadata to parquet

In [None]:
parquet_filename = outdir_pqt + "test_metadata.parquet"
df_list.to_parquet(parquet_filename)
print(str(parquet_filename) + " stored.")

In [None]:
def metadata2pqt(rank,files_list,refVARS):
    rank_str = "#" + str(rank) + ": "
    
    df_list = []
    df_memory = 0
    counter = 0
    nb_files = len(files_list)
    argo_file_fail = []
    for argo_file in files_list:
        counter += 1
        if counter%10==0:
            print(rank_str + "processing file " + str(counter) + " of " + str(nb_files))
            
        try:
            ds = xr.load_dataset(argo_file, engine="argo") #loading into memory the profile
        except:
            print(rank_str + 'Failed on ' + str(argo_file))
            argo_file_fail.append(argo_file)
        
        d0 = ds[refVARS]

        # adding dimension = PLATFORM_NUMBER and its value to the current array
        d0=d0.assign_coords(PLATFORM_NUMBER=ds.PLATFORM_NUMBER.values[0])
        for da in d0.data_vars:
            d0[da]=d0[da].expand_dims(dim={"PLATFORM_NUMBER": 1}, axis=0)
        
        df = d0.to_dataframe()
        df_memory += df.memory_usage().sum()/(10**6) # tracking memory usage (in GB)
        df_list.append(df)

        if df_memory > 200:
            print(rank_str + "In-memory filesize: " + "{:.2f}".format(df_memory) + " MB. This is above the recommended size for parquet.")
    
    df_list = pd.concat(df_list)

    print(rank_str + "Returning list of dataframes to main processor.")

    return df_list, argo_file_fail

####################################

flist = glob.glob("/vortexfs1/share/boom/data/gdac_snapshot/202403-ArgoData/dac/*/*/*_Sprof.nc")
# refVARS = ['PARAMETER', 'SCIENTIFIC_CALIB_EQUATION', 'SCIENTIFIC_CALIB_COEFFICIENT', 'SCIENTIFIC_CALIB_COMMENT', 'SCIENTIFIC_CALIB_DATE']
refVARS = ["DATA_TYPE","FORMAT_VERSION","HANDBOOK_VERSION","REFERENCE_DATE_TIME","DATE_CREATION","DATE_UPDATE"]
nb_of_checks = len(flist)

NPROC = 100
if NPROC > nb_of_checks:
    NPROC = nb_of_checks
CHUNK_SZ = int(np.ceil(nb_of_checks/NPROC))
chunks = batched(flist,CHUNK_SZ)

print(CHUNK_SZ)

pool_obj = multiprocessing.Pool(processes=NPROC)
outputs = pool_obj.starmap(metadata2pqt, [(rank, chunk, refVARS) for rank, chunk in enumerate(chunks)])
pool_obj.close()

df_list = []
failed_files = []
for processor_output in outputs:
    df_list.append(processor_output[0])
    if len(processor_output[1])>0:
        for f in processor_output[1]:
            failed_files.append(f)
df_list = pd.concat(df_list)

parquet_filename = outdir_pqt + "test_metadata.parquet"
df_list.to_parquet(parquet_filename)
print(str(parquet_filename) + " stored.")

print("Failed files:")
print(failed_files)

#### Reading from parquet

There are a couple of way to read parquet files. One is directly using pandas (make sure you have pyarrow, fastparquet or other suitable engine installed), the other is with Dask. Generally speaking, you'll want to use Dask if you need a large amount of data at the same time so that you can benefit from its parallelization. You should avoid Dask and just go for pandas whenever the data fits in your RAM.

When reading parquet files with pandas, you can either specificy the file name if you know which file you want, or the directory containing all the parquet files. In latter case if you apply any filter, pandas and pyarrow will sort through all the files in the folder, reading into memory only the subsets that satisfy your filter.

In [None]:
sel = [("PLATFORM_NUMBER", "==", 6901494)]
df = pd.read_parquet( glob.glob(outdir_pqt + "test_profiles*") , engine='pyarrow', filters = sel )
sorted(df.columns.to_list())

In [None]:
df["PSAL_ADJUSTED"]

In [None]:
# Example with metadata file
sel = [("PLATFORM_NUMBER", "==", 6990526)]
df = pd.read_parquet( parquet_filename , engine='pyarrow', filters = sel )
df[ ["DATA_TYPE", "DATE_CREATION"] ]

#### Testing conversion

The following cell performs integrity tests on the parquet files. As the number of floats, profiles, and variables is large, the check is performed over all the variables, but not all the files. For each variable in `VARS`, files are randomly selected from the input list (in a number set to 5% of the .nc files) and for each of them, the selected `VARS` is compared to the one obtained from the parquet file. Each of these checks can:
* succeed, if the nc and parquet variables are equal
* fail, if the nc and parquet variables are not equal
* be skipped, if the randomly selected file does not contain the selected variable

If a file is skipped, another one is randomly selected, until a minimum number of files that contain the selected variable is found. For each variable, no file can be randomly picked two or more times (it can happen across variables).

The variables `successes` and `fails` contain the file id and the name of the variable for which the check was succesful or failed.

In [None]:
def checkVars(rank, flist, VARS):

    rank_str = "#" + str(rank) + ": "
    
    rand_max = len(flist)
    nb_of_checks_per_var = np.min( [5,len(flist)] ) #int(np.ceil(rand_share*rand_max))
    nb_of_checks = nb_of_checks_per_var*len(VARS)

    print(rank_str + "Checking " + str(nb_of_checks) + " random files.")
    
    check_nb = 0
    successes = []
    fails = []
    skipped = []
    for v in range(len(VARS)):
    
        rand_idces = []
        target_var = VARS[v]
    
        r = 0
        while ((r < nb_of_checks_per_var) and (len(rand_idces) < len(flist) )):
            print(rank_str + "Check " + str(r) + " of " + str(nb_of_checks_per_var) )
            
            check_nb += 1
            rand_avail = np.arange(0,rand_max)[~np.isin(np.arange(0,rand_max), rand_idces)]
            rand_idx = np.random.choice( rand_avail )
            rand_idces.append(rand_idx)

            try:
                ref_ds = xr.load_dataset(flist[rand_idx], engine="argo")
            except:
                print(rank_str + 'Failed on ' + str(flist[rand_idx]))
                continue
                
            # print(rank_str + "Reading file " + flist[rand_idx] )
            ref_platform = ref_ds.PLATFORM_NUMBER.values[0]
        
            invars = list(set(VARS) & set(list(ref_ds.data_vars)))
            
            if target_var not in invars:
                ref_var = None
                del ref_var
                gc.collect()

                skipped.append( (rand_idx, target_var ) )
                # print(rank_str + "Current random file does not contain variable " + target_var + ", skipping this check.")
                continue
                
            print(rank_str + "Checking " + target_var + " in platform number " + str(ref_platform) + ".")
        
            dim0 = ref_ds.sizes["N_PROF"]
            dim1 = ref_ds.sizes["N_LEVELS"]
        
            if np.issubdtype(ref_ds[target_var].dtype, np.datetime64):
                ref_var = np.zeros( dim0*dim1, dtype='datetime64[ns]' )
            else:
                ref_var = np.zeros( dim0*dim1, dtype=np.float64 )
        
            for j in range(dim0):
                for k in range(dim1):
                    ref_id = j*dim1+k
                    if len(ref_ds[target_var].dims) > 1:
                        ref_var[ref_id] = ref_ds[target_var][j,k].values
                    else:
                        ref_var[ref_id] = ref_ds[target_var][j].values
        
            sel_pqt = [("PLATFORM_NUMBER", "==", ref_platform)]

            try:
                df_pqt = pd.read_parquet( outdir_pqt , engine='pyarrow', filters = sel_pqt )
            except:
                print("Loading parquet file failed for platform " + str(ref_platform) + "!")
                continue
    
            if target_var not in df_pqt.columns.tolist():
                fails.append( (rand_idx, target_var ) )
                r += 1
                print(rank_str + "Warning: " + target_var + " not found in parquet file.")
                continue
        
            df_pqt_var = df_pqt[target_var].values
        
            success = np.array_equal(ref_var, df_pqt_var, equal_nan=True)

            ref_var = None
            df_pqt_var = None
            del ref_var
            del df_pqt_var
            gc.collect()
            
            if success:
                successes.append( (flist[rand_idx], target_var ) )
            else:
                fails.append( (flist[rand_idx], target_var ) )
    
            r += 1

    print(rank_str + "All checks in process done")
    print(rank_str +  str(len(successes)) + " checks were succesful.")
    print(rank_str +  str(len(fails)) + " checks failed.")

    return successes, fails

############################################################################################################

nb_of_checks = len(flist)

NPROC = 20
CHUNK_SZ = int(np.ceil(nb_of_checks/NPROC))
chunks = batched(flist,CHUNK_SZ)

print(CHUNK_SZ)

# print(list(chunks))

print("Checking " + str(nb_of_checks) + " random files.")
print("")

# print([(rank, chunk) for rank, chunk in enumerate(chunks)])
pool_obj = multiprocessing.Pool(processes=NPROC)
outputs = pool_obj.starmap(checkVars, [(rank, chunk, VARS) for rank, chunk in enumerate(chunks)])
pool_obj.close()
print("")
print("All checks were done.")

successes = []
fails = []
for output in outputs:
    successes.append(output[0])
    fails.append(output[1])

print("Successful tests file names and variable name.")
print(successes)
print("Failed tests file names and variable name.")
print(fails)

#### Testing conversion

The following cell checks that the parquet files contain all the floats, by checking that all original platform numbers exist. Note: it *does not* check that the variables of the original float exist and are correct in the parquet file (see previous cell for this).

In [None]:
def checkPlatformNb(rank,flist):

    rank_str = "#" + str(rank) + ": "
    outdir_pqt = '/vortexfs1/share/boom/data/nc2pqt_test/PQT/'
    
    check_nb = 0
    successes = []
    fails = []
    failed_on_read = []

    for idx in range(len(flist)):
    
        check_nb += 1
        try:
            ref_ds = xr.load_dataset(flist[idx], engine="argo")
        except:
            print(rank_str + 'Failed on ' + str(flist[idx]))
            failed_on_read.append(flist[idx])
            continue
        
        ref_platform = ref_ds.PLATFORM_NUMBER.values[0]

        ref_ds = None
        del ref_ds
        gc.collect()
        
        sel_pqt = [("PLATFORM_NUMBER", "==", ref_platform)]
        try:
            df_pqt = pd.read_parquet( outdir_pqt , engine='pyarrow', filters = sel_pqt )
        except:
            fails.append( (idx ) )
            continue

        df_pqt = None
        del df_pqt
        gc.collect()
        successes.append( (idx ) )
        
        if check_nb%10:
        print(rank_str + "Check " + str(check_nb) + " of " + str(len(flist)) + " done.")

    print(rank_str + "All checks in process done")
    print(rank_str +  str(len(successes)) + " checks were succesful.")
    print(rank_str +  str(len(fails)) + " checks failed.")
    if len(failed_on_read)>0:
        print(rank_str +  str(len(failed_on_read)) + " original Argo file(s) could not be loaded, likely due to errors in the original file. These files were likely never converted to parquet.")
        print("File list:")
        print(failed_on_read)
    else:
        print(rank_str +  str(len(failed_on_read)) + " original Argo file(s) could not be loaded.")

############################################################################################################

flist = glob.glob("/vortexfs1/share/boom/data/gdac_snapshot/202403-ArgoData/dac/coriolis/*/*_Sprof.nc")

nb_of_checks = len(flist)

NPROC = 20
CHUNK_SZ = int(np.ceil(nb_of_checks/NPROC))
chunks = batched(flist,CHUNK_SZ)

print(CHUNK_SZ)

# print(list(chunks))

print("Checking " + str(nb_of_checks) + " random files.")
print("")

pool_obj = multiprocessing.Pool(processes=NPROC)
pool_obj.starmap(checkPlatformNb, [(rank, chunk) for rank, chunk in enumerate(chunks)])
pool_obj.close()

print("")
print("All checks were done.")

### Example loading Sprof from snapshot
```
ds = xr.load_dataset('/vortexfs1/share/boom/data/gdac_snapshot/202403-ArgoData/dac/aoml/1902304/1902304_Sprof.nc')
df = ds.to_dataframe()
```

In [None]:

from dotenv import load_dotenv
load_dotenv()
import os
from pyarrow import fs
s3 = fs.S3FileSystem(region='us-east-1')


In [None]:
s3

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Table

ds = xr.load_dataset('/vortexfs1/share/boom/data/gdac_snapshot/202403-ArgoData/dac/aoml/1902304/1902304_Sprof.nc',engine="argo")
df = ds[['DOXY','PRES','NITRATE','PLATFORM_NUMBER']].to_dataframe()

s3_filepath = 'data.parquet'

pq.write_to_dataset(
    Table.from_pandas(df),
    s3_filepath,
    filesystem=s3,
    use_dictionary=True,
    compression="snappy",
    version="2.4",
)

