# Creating a virtual Zarr store for MUR SST

This notebook uses virtualizarr and icechunk to create a virtual Zarr store for MUR SST until 2023-09-23. However, because there are some dates where the encoding and chunking change from the default (which I infer to be the settings used for the marjority of the data), some dates are written as Zarr. 

The default encoding is `[{'id': 'shuffle', 'elementsize': 2}, {'id': 'zlib', 'level': 6}]` and the default chunk shape is `(1, 1023, 2047)` (time, lat, lon) for the main data variable `analysed_sst`. Changes to these 2 settings are detailed below where you find the word "report".

## Methodology

Mostly, going year by year, the steps are as follows:

1. Initiate the Icechunk repo or open the existing repo.

### For most data, append a virtual dataset

2. List the files on S3 from that year (see `helpers.list_mur_sst_files`). Then create a list of the corresponding `.dmrpp` files.
3. Create virtual representations of those files using VirtualiZarr's `dmrpp` reader and concatenate them into one virtual dataset.
4. Instantiate a writable icechunk session for the repo, and use the store accessor for that session to append the virtual dataset to the existing Icechunk store (`virtual_dataset.virtualize.to_icechunk(store, append_dim='time')`.
5. (Optional but recommended) Once you get a commit response (snapshot id), validate the newly written data with `helpers.validate_data`.

### For days with different encoding or chunking

2. Use `dask_write_zarr.update` to append those dates as native Zarr. This function resizes the arrays in the existing Zarr store to accomodate the new data, breaks up the files and variables in each file into separate tasks, and writes that data to Zarr in parallel. This takes about 1 minute per day. This process could definitely be improved, as memory errors arose when trying to write 3 or more days at a time.
3. (Optional but recommended) Once you get a commit response, validate the newly written data.



## Todos for this workflow

- [x] Use dask for distributed writing of zarr (https://icechunk.io/icechunk-python/examples/dask_write/)
- [x] Add information on the datasets with different encodings
- [ ] Estimate total time to create 2004-2024 and associated cost
- [ ] Remove old stores (local and s3)
- [ ] Nice to have: Improve validation by reusing an already open icechunk store, since opening the store is slow.
- [ ] Use something other than dask since this is still error prone and slow if not properly managing memory (current limit on an instance with 60GB of memory is 2 days at a time, although this complets in 2 minutes).

## Todos for this dataset

- [ ] Validate with PO.DAAC
- [ ] If validated, publish and document in VEDA datastore
- [ ] continue on with a new store for 2023-09-04 to current day.

# 0. Setup

## Make sure the required versions of icechunk, virtualizarr, xarray and zarr-python are installed

In [1]:
#!pip install git+https://github.com/zarr-developers/VirtualiZarr.git@ab/upgrade-icechunk#egg=VirtualiZarr[icechunk]
!pip list | grep -E '^(virtualizarr|icechunk|zarr|xarray)\s'

icechunk                  0.1.0a12
virtualizarr              1.2.1.dev20+g079e480
xarray                    2025.1.1
zarr                      3.0.1


## Import packages

In [2]:
import fsspec
import xarray as xr
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

import dask_write_zarr as dwz
import helpers

In [3]:
import zarr

zarr.config.set({
    'async': {'concurrency': 100, 'timeout': None},
    'threading': {'max_workers': None}
})

<donfig.config_obj.ConfigSet at 0x7fc8622ef850>

# 1. Start a dask cluster

The dask cluster will help parallelize generating references and in computation for validation.

In [97]:
from dask.distributed import Client
# for zarr
client = Client(n_workers=4, threads_per_worker=1)
#client = Client(n_workers=8)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /user/abarciauskas-bgse/proxy/41577/status,

0,1
Dashboard: /user/abarciauskas-bgse/proxy/41577/status,Workers: 4
Total threads: 4,Total memory: 60.62 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:35209,Workers: 4
Dashboard: /user/abarciauskas-bgse/proxy/41577/status,Total threads: 4
Started: Just now,Total memory: 60.62 GiB

0,1
Comm: tcp://127.0.0.1:41661,Total threads: 1
Dashboard: /user/abarciauskas-bgse/proxy/41931/status,Memory: 15.16 GiB
Nanny: tcp://127.0.0.1:45211,
Local directory: /tmp/dask-scratch-space/worker-1u2et9fu,Local directory: /tmp/dask-scratch-space/worker-1u2et9fu

0,1
Comm: tcp://127.0.0.1:46477,Total threads: 1
Dashboard: /user/abarciauskas-bgse/proxy/41001/status,Memory: 15.16 GiB
Nanny: tcp://127.0.0.1:41723,
Local directory: /tmp/dask-scratch-space/worker-bluk91hc,Local directory: /tmp/dask-scratch-space/worker-bluk91hc

0,1
Comm: tcp://127.0.0.1:35289,Total threads: 1
Dashboard: /user/abarciauskas-bgse/proxy/46387/status,Memory: 15.16 GiB
Nanny: tcp://127.0.0.1:41825,
Local directory: /tmp/dask-scratch-space/worker-r8zwhhux,Local directory: /tmp/dask-scratch-space/worker-r8zwhhux

0,1
Comm: tcp://127.0.0.1:45397,Total threads: 1
Dashboard: /user/abarciauskas-bgse/proxy/39697/status,Memory: 15.16 GiB
Nanny: tcp://127.0.0.1:34649,
Local directory: /tmp/dask-scratch-space/worker-mcqclaed,Local directory: /tmp/dask-scratch-space/worker-mcqclaed


In [96]:
#client.shutdown()

# 2. Initialize file stores for reading and writing

## 2a. Initialize a filesystem for accessing the MUR SST data files.

In [5]:
fs = fsspec.filesystem("s3", anon=False)

## 2b. Initialize the store we are writing to (icechunk).

**NOTE:** If just appending to the store, `overwrite` should `=False`.

If overwriting an existing s3 store, you may need to run the following lines:

<code>
!pip install awscli
!aws s3 rm --recursive s3://nasa-veda-scratch/icechunk/{store_name}
</code>

In [116]:
repo = helpers.find_or_create_icechunk_repo(
    store_name="MUR-JPL-L4-GLOB-v4.1-virtual-v4",
    store_type="s3",
    overwrite=False
)

## Optional: Check the current state of the store

### Verify the store by opening it with xarray

Note how long it takes to open as well.

In [100]:
%%time
import xarray as xr
session = repo.readonly_session(branch="main")
xds = xr.open_zarr(session.store, consolidated=False)

CPU times: user 10.3 s, sys: 2.44 s, total: 12.7 s
Wall time: 19.9 s


In [101]:
xds

Unnamed: 0,Array,Chunk
Bytes,33.74 TiB,15.98 MiB
Shape,"(7156, 17999, 36000)","(1, 1023, 2047)"
Dask graph,2318544 chunks in 2 graph layers,2318544 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 33.74 TiB 15.98 MiB Shape (7156, 17999, 36000) (1, 1023, 2047) Dask graph 2318544 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  7156,

Unnamed: 0,Array,Chunk
Bytes,33.74 TiB,15.98 MiB
Shape,"(7156, 17999, 36000)","(1, 1023, 2047)"
Dask graph,2318544 chunks in 2 graph layers,2318544 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,33.74 TiB,15.98 MiB
Shape,"(7156, 17999, 36000)","(1, 1023, 2047)"
Dask graph,2318544 chunks in 2 graph layers,2318544 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 33.74 TiB 15.98 MiB Shape (7156, 17999, 36000) (1, 1023, 2047) Dask graph 2318544 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  7156,

Unnamed: 0,Array,Chunk
Bytes,33.74 TiB,15.98 MiB
Shape,"(7156, 17999, 36000)","(1, 1023, 2047)"
Dask graph,2318544 chunks in 2 graph layers,2318544 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,33.74 TiB,31.96 MiB
Shape,"(7156, 17999, 36000)","(1, 1447, 2895)"
Dask graph,1209364 chunks in 2 graph layers,1209364 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 33.74 TiB 31.96 MiB Shape (7156, 17999, 36000) (1, 1447, 2895) Dask graph 1209364 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  7156,

Unnamed: 0,Array,Chunk
Bytes,33.74 TiB,31.96 MiB
Shape,"(7156, 17999, 36000)","(1, 1447, 2895)"
Dask graph,1209364 chunks in 2 graph layers,1209364 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.87 TiB,15.98 MiB
Shape,"(7156, 17999, 36000)","(1, 1447, 2895)"
Dask graph,1209364 chunks in 2 graph layers,1209364 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 16.87 TiB 15.98 MiB Shape (7156, 17999, 36000) (1, 1447, 2895) Dask graph 1209364 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  7156,

Unnamed: 0,Array,Chunk
Bytes,16.87 TiB,15.98 MiB
Shape,"(7156, 17999, 36000)","(1, 1447, 2895)"
Dask graph,1209364 chunks in 2 graph layers,1209364 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## Check the snapshots

In [8]:
for snapshot in repo.ancestry(branch="main"):
    print(f"{snapshot.message}, snapshot_id: {snapshot.id}")

Repository initialized, snapshot_id: RH014HBQ7X9TSG5K9A3G


## Other useful tips

### Reseting the repo to a previous snapshot

In [9]:
#repo.reset_branch(branch="main", snapshot_id='PHXYBRNBJ7T70SS6DSVG')

## Inspecting a store with async generators

In [None]:
# async def fn():
#     return [item async for item in store.list_dir('/')]

# await fn()

# 3. Create initial store with data from 2002

## 3a. List, virtualize and concatenize datasets

This step uses the dmrpp reader of VirtualiZarr. This reader makes this process very fast since we don't actually have to open and read any of the original files.

In [10]:
mur_sst_files_2002 = helpers.list_mur_sst_files(start_date="2002-06-01", end_date="2002-12-31")
mur_sst_dmrpps_2002 = [f + '.dmrpp' for f in mur_sst_files_2002]
virtual_ds_2002 = helpers.create_virtual_ds(dmrpps=mur_sst_dmrpps_2002)

In [11]:
# sanity check
len(mur_sst_dmrpps_2002)

214

## 3b. Write to icechunk

In [13]:
%%time
session = repo.writable_session("main")
store = session.store
virtual_ds_2002.virtualize.to_icechunk(store)
session.commit("Wrote 2002 data")

CPU times: user 2.27 s, sys: 111 ms, total: 2.38 s
Wall time: 2.64 s


'QGKV998ZHQDVWRKY72VG'

## 3c. Validate

In [14]:
%%time
helpers.validate_data(store, dates=["2002-06-01", "2002-12-31"], fs=fs)

Open icechunk store...
Computing icechunk store result...
Icechunk store result: 284.47378231035907
Opening original files...
Computing original files result
Result from original files: 284.47378231035907
CPU times: user 10.2 s, sys: 1.23 s, total: 11.5 s
Wall time: 4min 13s


# 4. 2003

One file in 2003 (2003-09-11) had a different encoding, so the the list of 2003 files is split into 3 lists. All dates apart from the date with the different encoding are written as virtual stores. The problematic data is written as zarr.

See and run `helpers.get_codecs` with a list of virtual datasets to check all codecs are the same.

## 4a. Discover files with different codecs

In [15]:
mur_sst_files_2003 = helpers.list_mur_sst_files(start_date="2003-01-01", end_date="2003-12-31")
mur_sst_files_2003_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2003]
vdss = [helpers.open_virtual(f) for f in mur_sst_files_2003_dmrpps]
helpers.check_codecs(vdss)

Codec(compressor=None, filters=[{'id': 'zlib', 'level': 6}])
s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20030911090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc



**Encoding report:**
There's just one different encoding this year: 2003-09-11, which doesn't have the expected shuffle operation. We will write that file as Zarr.

## 4b. Write first set of files as virtual datasets using the DMRPP reader

In [16]:
mur_sst_files_2003_1 = helpers.list_mur_sst_files(start_date="2003-01-01", end_date="2003-09-10")
mur_sst_files_2003_1_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2003_1]
virtual_ds_2003_1 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2003_1_dmrpps)

In [17]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2003_1.virtualize.to_icechunk(store, append_dim='time')
session.commit("Wrote first part of 2003 data")

'YT2DG5DSBYECYMM3VY60'

In [18]:
%%time
session = repo.readonly_session(branch="main")
store = session.store
helpers.validate_data(store, dates=["2003-01-01", "2003-09-10"], fs=fs)

Open icechunk store...
Computing icechunk store result...
Icechunk store result: 284.3320804300873
Opening original files...
Computing original files result
Result from original files: 284.3320804300873
CPU times: user 11.9 s, sys: 1.51 s, total: 13.4 s
Wall time: 5min 5s


## 4c. Write data with different encoding as zarr

In [19]:
%%time
dwz.update(
    repo=repo,
    start_date="2003-09-11 09:00",
    end_date="2003-09-11 09:00",
    fs=fs,
    client=client
)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 5.32 s, sys: 672 ms, total: 6 s
Wall time: 1min 42s


In [20]:
%%time
session = repo.readonly_session(branch="main")
store = session.store
helpers.validate_data(store, dates=["2003-09-11", "2003-09-11"], fs=fs)

Open icechunk store...
Computing icechunk store result...
Icechunk store result: 285.29159611231097
Opening original files...
Computing original files result
Result from original files: 285.29159611231097
CPU times: user 887 ms, sys: 90.6 ms, total: 978 ms
Wall time: 14.2 s


## 4d. Write the rest of 2003 as virtual data

In [22]:
mur_sst_files_2003_2 = helpers.list_mur_sst_files(start_date="2003-09-12", end_date="2003-12-31")
mur_sst_files_2003_2_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2003_2]
virtual_ds_2003_2 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2003_2_dmrpps)

In [23]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2003_2.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote to end of 2003.")

'G8BHPTWYMTWD97V15Q7G'

## 4e. Validate

In [24]:
%%time
helpers.validate_data(store, dates=["2003-09-12", "2003-12-31"], fs=fs)

Open icechunk store...
Computing icechunk store result...
Icechunk store result: 283.9340395773743
Opening original files...
Computing original files result
Result from original files: 283.9340395773743
CPU times: user 5.83 s, sys: 716 ms, total: 6.55 s
Wall time: 2min


# 5. Append 2004

## 5a. List files

In [25]:
dates = ['2004-01-01', '2004-12-31']
mur_sst_files_2004 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2004_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2004]

In [26]:
len(mur_sst_files_2004_dmrpps)

366

## 5b. Write data

In [27]:
virtual_ds_2004 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2004_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2004.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2004 to store.")

'W2ESV77NNTJ5AADD431G'

## 5c. Validate data

In [28]:
%%time
helpers.validate_data(store, dates=dates, fs=fs)

Open icechunk store...
Computing icechunk store result...
Icechunk store result: 284.13816998533554
Opening original files...
Computing original files result
Result from original files: 284.13816998533554
CPU times: user 18.8 s, sys: 1.87 s, total: 20.7 s
Wall time: 6min 25s


# 6. Let's try 2 years! 2005-2006

## 6a. List files

In [9]:
dates = ['2005-01-01', '2006-12-31']
mur_sst_files_2005_2006 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2005_2006_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2005_2006]

In [10]:
len(mur_sst_files_2005_2006_dmrpps)

730

## 6b. Write data

In [11]:
virtual_ds_2005_2006 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2005_2006_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2005_2006.virtualize.to_icechunk(store, append_dim='time')

In [12]:
session.commit(f"Wrote 2005-2006 to store.")

'APJNASJV5VB0KYXNN200'

## 6c. Validate data

In [14]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 7. Let's try 5 years! 2007 through end of 2011

## 7a. List files

In [13]:
dates = ['2007-01-01', '2011-12-31']
mur_sst_files_2007_2011 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2007_2011_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2007_2011]

In [15]:
len(mur_sst_files_2007_2011_dmrpps)

1826

## 7b. Write data

In [16]:
virtual_ds_2007_2011 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2007_2011_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2007_2011.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2007-2011 to store.")

'ZRW8E4MY7GJ66QHRWP5G'

## 7c. Validate data

In [17]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 8. 2012

## 8a. List files

In [18]:
dates = ['2012-01-01', '2012-12-31']
mur_sst_files_2012 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2012_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2012]

In [19]:
len(mur_sst_files_2012_dmrpps)

366

## 8b. Write data

In [20]:
virtual_ds_2012 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2012_dmrpps)

In [21]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2012.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2012 to store.")

'4DZA38KYMVR39EKW4RNG'

## 8c. Validate data

In [22]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 9. 2013

## 9a. List files

In [23]:
dates = ['2013-01-01', '2013-12-31']
mur_sst_files_2013 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2013_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2013]

In [24]:
len(mur_sst_files_2013_dmrpps)

365

## 9b. Write data

In [25]:
virtual_ds_2013 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2013_dmrpps)

In [26]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2013.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2013 to store.")

'3X1PR4MSWQQ42KPTQM1G'

## 9c. Validate data

In [27]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 10. 2014

## 10a. List files

In [28]:
dates = ['2014-01-01', '2014-12-31']
mur_sst_files_2014 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2014_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2014]

In [29]:
len(mur_sst_files_2014_dmrpps)

365

## 10b. Write data

In [30]:
virtual_ds_2014 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2014_dmrpps)

In [31]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2014.virtualize.to_icechunk(store, append_dim='time')

In [32]:
session.commit(f"Wrote 2014 to store.")

'A6EAP944RQTYDW559P50'

## 10c. Validate data

In [33]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 11. 2015

## 11a. List files

In [34]:
dates = ['2015-01-01', '2015-12-31']
mur_sst_files_2015 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2015_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2015]

In [35]:
len(mur_sst_files_2015_dmrpps)

365

## 11b. Write data

In [36]:
virtual_ds_2015 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2015_dmrpps)

In [37]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2015.virtualize.to_icechunk(store, append_dim='time')

In [38]:
session.commit(f"Wrote 2015 to store.")

'P68R1FJD7FN38PKG6Y1G'

## 11c. Validate data

In [39]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 12. 2016

## 12a. List files

In [40]:
dates = ['2016-01-01', '2016-12-31']
mur_sst_files_2016 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2016_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2016]

In [41]:
len(mur_sst_files_2016_dmrpps)

366

## 12b. Write data

In [42]:
virtual_ds_2016 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2016_dmrpps)

In [43]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2016.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2016 to store.")

'7PJ4TPQ31MJ8H621GVZG'

In [44]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 13. 2017

## 13a. List files

In [45]:
%%time
dates = ['2017-01-01', '2017-12-31']
mur_sst_files_2017 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2017_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2017]

CPU times: user 2.21 ms, sys: 0 ns, total: 2.21 ms
Wall time: 2.19 ms


In [46]:
len(mur_sst_files_2017_dmrpps)

365

## 13b. Write data

In [47]:
%%time
virtual_ds_2017 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2017_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2017.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2017 to store.")

CPU times: user 23 s, sys: 689 ms, total: 23.7 s
Wall time: 41.7 s


'74KK4TSA62NARFF89VQ0'

## 13c. Validate data

In [48]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 14. 2018

## 14a. List files

In [49]:
%%time
dates = ['2018-01-01', '2018-12-31']
mur_sst_files_2018 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2018_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2018]

CPU times: user 1.67 ms, sys: 0 ns, total: 1.67 ms
Wall time: 1.66 ms


In [50]:
len(mur_sst_files_2018_dmrpps)

365

## 14b. Write data

In [51]:
%%time
virtual_ds_2018 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2018_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2018.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2018 to store.")

CPU times: user 23.6 s, sys: 763 ms, total: 24.4 s
Wall time: 42.4 s


'KSS8NH4RB6XHCK9KSEF0'

## 14c. Validate data

In [52]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 15. 2019

## 15a. List files

In [53]:
%%time
dates = ['2019-01-01', '2019-12-31']
mur_sst_files_2019 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2019_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2019]

CPU times: user 1.67 ms, sys: 0 ns, total: 1.67 ms
Wall time: 1.66 ms


In [54]:
len(mur_sst_files_2019_dmrpps)

365

## 15b. Write data

In [55]:
%%time
virtual_ds_2019 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2019_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2019.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2019 to store.")

CPU times: user 24.3 s, sys: 875 ms, total: 25.2 s
Wall time: 44.1 s


'WZBJGAX7QJ7VESNE4RJG'

## 15c. Validate data

In [56]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 16. 2020

## 16a. List files

In [57]:
%%time
dates = ['2020-01-01', '2020-12-31']
mur_sst_files_2020 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2020_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2020]

CPU times: user 1.71 ms, sys: 77 μs, total: 1.79 ms
Wall time: 1.77 ms


In [58]:
len(mur_sst_files_2020_dmrpps)

366

## 16b. Write data

In [59]:
%%time
virtual_ds_2020 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2020_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2020.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2020 to store.")

CPU times: user 24.6 s, sys: 836 ms, total: 25.4 s
Wall time: 43.7 s


'QGKT50J6EV1AK10QP9C0'

## 16c. Validate data

In [60]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

# 17. 2021

## 17a. List files with different encodings

In [61]:
mur_sst_files_2021 = helpers.list_mur_sst_files(start_date="2021-01-01", end_date="2021-12-31")
mur_sst_files_2021_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2021]
vdss = [helpers.open_virtual(f) for f in mur_sst_files_2021_dmrpps]
helpers.check_codecs(vdss)

Codec(compressor=None, filters=[{'id': 'shuffle', 'elementsize': 2}, {'id': 'zlib', 'level': 7}])
s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210220090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

Codec(compressor=None, filters=[{'id': 'shuffle', 'elementsize': 2}, {'id': 'zlib', 'level': 7}])
s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210221090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

Codec(compressor=None, filters=[{'id': 'zlib', 'level': 6}, {'id': 'shuffle', 'elementsize': 2}])
s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20211224090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

Codec(compressor=None, filters=[{'id': 'zlib', 'level': 6}, {'id': 'shuffle', 'elementsize': 2}])
s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20211225090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

Codec(compressor=None, filters=[{'id': 'zlib', 'level': 6}, {'id': 'shuffle', 'elementsize': 2}])
s3://podaac-ops-cumulus-protected/MUR-JPL-

**Encoding report:**
* 2021-02-20 and 2021-02-21 use `zlib` level `7`.
* 2021-12-24 to 2021-12-31 implement `shuffle` after `zlib` (the wrong order).

These dates will be written as zarr.

## 17a. List files from first period

In [62]:
%%time
dates = ['2021-01-01', '2021-02-19']
mur_sst_files_2021_1 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2021_1_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2021_1]

CPU times: user 603 μs, sys: 0 ns, total: 603 μs
Wall time: 599 μs


In [63]:
len(mur_sst_files_2021_1_dmrpps)

50

## 17b. Write data

In [65]:
%%time
virtual_ds_2021_1 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2021_1_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2021_1.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2021-01-01 to 2021-02-19 to store.")

CPU times: user 20.4 s, sys: 626 ms, total: 21 s
Wall time: 23.8 s


'MM87KZG60DZG8GHS8HH0'

In [64]:
# %%time
# helpers.validate_data(store, dates=dates, fs=fs)

## 17c. Write 2 special days as zarr

In [66]:
%%time
dwz.update(
    repo=repo,
    start_date="2021-02-20 09:00",
    end_date="2021-02-21 09:00",
    fs=fs,
    client=client
)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 34.7 s, sys: 2.21 s, total: 36.9 s
Wall time: 4min 7s


In [67]:
helpers.trim_dask_worker_memory(client)

{'tcp://127.0.0.1:34403': 1,
 'tcp://127.0.0.1:40051': 1,
 'tcp://127.0.0.1:40731': 1,
 'tcp://127.0.0.1:46695': 1}

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=["2021-02-20", "2021-02-21"], fs=fs)

## 17d. List files for 2021-02-22 to 2021-12-23

In [68]:
%%time
dates = ['2021-02-22', '2021-12-23']
mur_sst_files_2021_2 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2021_2_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2021_2]

CPU times: user 1.48 ms, sys: 0 ns, total: 1.48 ms
Wall time: 1.45 ms


In [69]:
len(mur_sst_files_2021_2_dmrpps)

305

## 17e. Write data

In [70]:
%%time
virtual_ds_2021_2 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2021_2_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2021_2.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote 2021-02-22 to 2021-12-23 to store.")

CPU times: user 26.2 s, sys: 866 ms, total: 27 s
Wall time: 40.2 s


'ZC8A2JK80QTV8AWBVJ30'

In [72]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

## 17f. Write the rest of the days as Zarr

In [77]:
%%time
dwz.update(
    repo=repo,
    start_date="2021-12-24 09:00",
    end_date="2021-12-25 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 47.9 s, sys: 1.83 s, total: 49.7 s
Wall time: 4min 2s


{'tcp://127.0.0.1:34157': 1,
 'tcp://127.0.0.1:38767': 1,
 'tcp://127.0.0.1:38917': 1,
 'tcp://127.0.0.1:45499': 1}

In [78]:
%%time
dwz.update(
    repo=repo,
    start_date="2021-12-26 09:00",
    end_date="2021-12-27 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 46.8 s, sys: 1.92 s, total: 48.7 s
Wall time: 3min 56s


{'tcp://127.0.0.1:34157': 1,
 'tcp://127.0.0.1:38767': 1,
 'tcp://127.0.0.1:38917': 1,
 'tcp://127.0.0.1:45499': 1}

In [79]:
%%time
dwz.update(
    repo=repo,
    start_date="2021-12-28 09:00",
    end_date="2021-12-29 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 47.2 s, sys: 1.85 s, total: 49 s
Wall time: 4min 4s


{'tcp://127.0.0.1:34157': 1,
 'tcp://127.0.0.1:38767': 1,
 'tcp://127.0.0.1:38917': 1,
 'tcp://127.0.0.1:45499': 1}

In [80]:
%%time
dwz.update(
    repo=repo,
    start_date="2021-12-30 09:00",
    end_date="2021-12-31 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 47.3 s, sys: 2.3 s, total: 49.6 s
Wall time: 3min 57s


{'tcp://127.0.0.1:34157': 1,
 'tcp://127.0.0.1:38767': 1,
 'tcp://127.0.0.1:38917': 1,
 'tcp://127.0.0.1:45499': 1}

* 1 day at a time: With 4 workers, each having 1 thread and 15GB of memory. The maximum amount of memory used I saw was 25%. Some workers did use more than 100% CPU. It took about 2 minutes.
* 2 days at a time: Same worker configuration, maximum memory I saw used was 40%. Often workers exceeded 100% CPU. Took 3 minutes.
* 3 days at a time: Same worker configuration, maximum memory reached 80%, this definitely seems the upper limit. 2 days at a time for 15GB per worker is probably safer.
  * 2 more days after this caused many warnings of memory usage >80%. This is probably because some memory is not released back to the OS: https://distributed.dask.org/en/stable/worker-memory.html#memory-not-released-back-to-the-os.

Each array is at most 5.2GB (17999 * 36000 * 8 bytes), so not sure why workers are using so much memory.

## 17*. Validate data

In [81]:
# %%time
# dates = ['2021-12-24', '2021-12-31']
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

# 18. 2022

## 18a. List files

In [None]:
%%time
dates = ['2022-01-01', '2022-12-31']
mur_sst_files_2022 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2022_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2022]

In [None]:
len(mur_sst_files_2022_dmrpps)

## 18b. Find files with different encodings

In [None]:
# vdss = [helpers.open_virtual(f) for f in mur_sst_files_2022_dmrpps]
# helpers.check_codecs(vdss)

**Encoding report:**
* 2022-01-01 to 2022-01-26 has `shuffle` after `zlib` (wrong order)
* 2022-11-09 has `zlib` level `7` (`zlib` level `6` is the standard)

## 18c. Write 2022-01-01 to 2022-01-26 as Zarr

In [82]:
%%time
dates = ["2022-01-01", "2022-01-02"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 50.3 s, sys: 2.37 s, total: 52.7 s
Wall time: 4min 31s


{'tcp://127.0.0.1:34157': 1,
 'tcp://127.0.0.1:38767': 1,
 'tcp://127.0.0.1:38917': 1,
 'tcp://127.0.0.1:45499': 1}

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

In [102]:
%%time
dates = ["2022-01-03", "2022-01-04"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 37.6 s, sys: 1.59 s, total: 39.2 s
Wall time: 4min 12s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

In [103]:
%%time
dates = ["2022-01-05", "2022-01-06"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 38.3 s, sys: 2.58 s, total: 40.9 s
Wall time: 4min 8s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [104]:
%%time
dates = ["2022-01-07", "2022-01-08"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 39.2 s, sys: 1.84 s, total: 41 s
Wall time: 4min 20s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [105]:
%%time
dates = ["2022-01-09", "2022-01-10"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 36.5 s, sys: 2.71 s, total: 39.2 s
Wall time: 3min 48s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [106]:
%%time
dates = ["2022-01-11", "2022-01-12"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 37.6 s, sys: 2.05 s, total: 39.7 s
Wall time: 4min 18s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [107]:
%%time
dates = ["2022-01-13", "2022-01-14"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 38.8 s, sys: 2.18 s, total: 41 s
Wall time: 4min 28s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [108]:
%%time
dates = ["2022-01-15", "2022-01-16"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 39.2 s, sys: 2.51 s, total: 41.7 s
Wall time: 4min 35s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [109]:
%%time
dates = ["2022-01-17", "2022-01-18"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 37.1 s, sys: 2.26 s, total: 39.4 s
Wall time: 4min 1s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [110]:
%%time
dates = ["2022-01-19", "2022-01-20"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 37.3 s, sys: 1.92 s, total: 39.2 s
Wall time: 3min 56s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [111]:
%%time
dates = ["2022-01-21", "2022-01-22"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 38.1 s, sys: 1.87 s, total: 40 s
Wall time: 3min 59s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [112]:
%%time
dates = ["2022-01-23", "2022-01-24"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 37.2 s, sys: 1.4 s, total: 38.6 s
Wall time: 3min 57s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [113]:
%%time
dates = ["2022-01-25", "2022-01-26"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 37.3 s, sys: 1.46 s, total: 38.8 s
Wall time: 3min 49s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=["2022-01-01", "2022-01-26"], fs=fs)

## 18d. Continue with 2022 until 2022-11-08

In [114]:
%%time
dates = ['2022-01-27', '2022-11-08']
mur_sst_files_2022 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2022_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2022]

CPU times: user 2.07 ms, sys: 0 ns, total: 2.07 ms
Wall time: 2.06 ms


In [117]:
%%time
virtual_ds_2022 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2022_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2022.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote {dates[0]} to {dates[1]} to store.")

CPU times: user 30 s, sys: 2.1 s, total: 32.1 s
Wall time: 43.8 s


'49R1F52S4BA8TK0YAF20'

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

## 18e. Add 11-09-2022 as zarr

In [118]:
%%time
dates = ["2022-11-09", "2022-11-09"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 33.5 s, sys: 2.13 s, total: 35.7 s
Wall time: 2min 28s


In [119]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

## 18e. Finish 2022 as virtual refs

In [120]:
%%time
dates = ['2022-11-10', '2022-12-31']
mur_sst_files_2022_2 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2022_2_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2022_2]

CPU times: user 588 μs, sys: 32 μs, total: 620 μs
Wall time: 614 μs


In [121]:
len(mur_sst_files_2022_2_dmrpps)

52

In [122]:
%%time
virtual_ds_2022_2 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2022_2_dmrpps)
session = repo.writable_session("main")
store = session.store
virtual_ds_2022_2.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote {dates[0]} to {dates[1]} to store.")

CPU times: user 23.3 s, sys: 680 ms, total: 24 s
Wall time: 27.2 s


'0EK5PYGJDMF2148JN4G0'

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

In [123]:
helpers.trim_dask_worker_memory(client)

{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

# 19. 2023

**Chunking report:**

We will similarly break up 2023 into some virtual chunks and some zarr chunks. This is because of chunking changes. See https://forum.earthdata.nasa.gov/viewtopic.php?t=5909.

Since the different chunk shape seems to be the default one after 09-04-2023, we will create this store until then and then start a new store. So there will be one store from 06-01-2002 to 09-03-2023 and another starting on 09-04-2023 that will go to the present day.

## 19a. First set of virtual data (2023-01-01 to 2023-02-23)

In [124]:
%%time
dates = ['2023-01-01', '2023-02-23']
mur_sst_files_2023_1 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2023_1_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2023_1]


CPU times: user 609 μs, sys: 33 μs, total: 642 μs
Wall time: 633 μs


In [125]:
len(mur_sst_files_2023_1_dmrpps)

54

In [126]:
virtual_ds_2023_1 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2023_1_dmrpps)

In [127]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2023_1.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote {dates[0]} to {dates[1]} to store.")

'PMPF1X2NHQB1NC3X5DW0'

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

## 19b. Append 2023-02-24 to 2023-02-28 as zarr

In [128]:
%%time
dates = ["2023-02-24", "2023-02-25"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 31.4 s, sys: 1.14 s, total: 32.6 s
Wall time: 1min 34s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [129]:
%%time
dates = ["2023-02-26", "2023-02-27"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 32.6 s, sys: 1.14 s, total: 33.7 s
Wall time: 1min 49s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [130]:
%%time
dates = ["2023-02-28", "2023-02-28"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 28.2 s, sys: 1 s, total: 29.2 s
Wall time: 1min 25s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=["2023-02-24", "2023-02-28"], fs=fs)

## 19c. Write 2023-03-01 to 2023-04-21 as virtual refs

In [131]:
%%time
dates = ['2023-03-01', '2023-04-21']
mur_sst_files_2023_2 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2023_2_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2023_2]


CPU times: user 571 μs, sys: 30 μs, total: 601 μs
Wall time: 593 μs


In [132]:
len(mur_sst_files_2023_2_dmrpps)

52

In [133]:
virtual_ds_2023_2 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2023_2_dmrpps)

In [134]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2023_2.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote {dates[0]} to {dates[1]} to store.")

'5D5V4HGNDS81QB3P3PV0'

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=dates, fs=fs)

## 19d. Write 2023-04-22 as Zarr

In [135]:
%%time
dates = ["2023-04-22", "2023-04-22"]
dwz.update(
    repo=repo,
    start_date=f"{dates[0]} 09:00",
    end_date=f"{dates[1]} 09:00",
    fs=fs,
    client=client
)
helpers.trim_dask_worker_memory(client)

Opening files
Files opened
Mapping write tasks to dask client


  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)
  super().__init__(**codec_config)


Starting distributed commit
Distributed commit done
CPU times: user 27.5 s, sys: 1.01 s, total: 28.5 s
Wall time: 1min 13s


{'tcp://127.0.0.1:35289': 1,
 'tcp://127.0.0.1:41661': 1,
 'tcp://127.0.0.1:45397': 1,
 'tcp://127.0.0.1:46477': 1}

## 19e. Write 2023-04-23 to 2023-09-03 as virtual refs

In [136]:
%%time
dates = ['2023-04-23', '2023-09-03']
mur_sst_files_2023_3 = helpers.list_mur_sst_files(start_date=dates[0], end_date=dates[1])
mur_sst_files_2023_3_dmrpps = [f + '.dmrpp' for f in mur_sst_files_2023_3]

CPU times: user 0 ns, sys: 1.03 ms, total: 1.03 ms
Wall time: 1.01 ms


In [137]:
len(mur_sst_files_2023_3_dmrpps)

134

In [138]:
virtual_ds_2023_3 = helpers.create_virtual_ds(dmrpps=mur_sst_files_2023_3_dmrpps)

In [139]:
session = repo.writable_session("main")
store = session.store
virtual_ds_2023_3.virtualize.to_icechunk(store, append_dim='time')
session.commit(f"Wrote {dates[0]} to {dates[1]} to store.")

'3714221ZXFMCZNDSKMAG'

In [None]:
# %%time
# session = repo.readonly_session(branch="main")
# store = session.store
# helpers.validate_data(store, dates=['2023-04-22', '2023-09-03'], fs=fs)