# Why and How to Save NumPy Arrays with Zarr

This notebook tells you why and how to save your NumPy arrays with Zarr.

We'll look at:
1. The limitations of common ways of storing NumPy arrays
2. The benefits of using Zarr
3. How to combine Zarr with Dask for parallel read/write operations
4. Scaling out to a Coiled cluster to process larger-than-memory arrays

If you have any questions about the code, reach out to us [on Slack](https://join.slack.com/t/coiledcomputing/shared_invite/zt-112g7nu8y-XAVCE2rIBqv074DtUhjYAg).

## 1. Common Ways to Store NumPy Arrays (and their limitations)

### Create Dummy Numpy Array

Let's start by creating a dummy Numpy array to work with. We'll use `np.random.rand` to generate two arrays (one small, one large) and populate both with random numbers.

In [1]:
import numpy as np

In [2]:
# create dummy 2D arrays
array_XS = np.random.rand(3,2)
array_XS

array([[0.42693967, 0.25012671],
       [0.30218665, 0.05726088],
       [0.27580781, 0.18471068]])

In [3]:
# create dummy 3D array
array_L = np.random.rand(1000, 1000, 100)
array_L[:2,:2,:2]

array([[[0.9867651 , 0.3229474 ],
        [0.62364172, 0.36928141]],

       [[0.17005639, 0.45616798],
        [0.94785846, 0.63917985]]])

### Save Numpy Array to .txt

One way to store numpy arrays is as .txt files. This works for 1- and 2-dimensional arrays, but fails for arrays in higher dimensions.

The benefit of a .txt file is that it is human-readable.

In [4]:
np.savetxt('array_XS.txt', array_XS, delimiter=" ")

In [5]:
np.savetxt('array_L.txt', array_L, delimiter=" ")

ValueError: Expected 1D or 2D array, got 3D array instead

Check out [this blog post](https://mungingdata.com/numpy/save-numpy-text-txt/) for a more extensive tutorial of saving NumPy arrays to TXT.

### Save Numpy Array to .csv

You can use the same `np.savetxt` method to save your Numpy array to a CSV file. Make sure to set the `delimiter` keyword to ",".

Just like .txt files, .csv files are human-readable. They also have the added benefit of easy loading into DataFrames using, for example, `pd.read_csv`.

In [6]:
np.savetxt('array_XS.csv', array_XS, delimiter=",")

Similar to .txt files, writing Numpy arrays to CSV only works for 1D and 2D arrays.

In [7]:
np.savetxt('array_L.csv', array_L, delimiter=",")

ValueError: Expected 1D or 2D array, got 3D array instead

You can load a numpy array stored as .CSV into a Pandas DataFrame as follows:

In [8]:
import pandas as pd
df = pd.read_csv('array_XS.csv', header=None)
df

Unnamed: 0,0,1
0,0.42694,0.250127
1,0.302187,0.057261
2,0.275808,0.184711


> NOTE: to load .txt or .csv files back into Numpy arrays correctly, make sure to use the `np.loadtxt()` method and not the `np.load()` or `np.fromfile()` methods as this may cause data-reading errors.

Check out [this blog post](https://crunchcrunchhuman.com/2021/12/25/numpy-save-csv-write/) for a more extensive tutorial on saving NumPy arrays to CSV.

### Save NumPy Array with np.save()

A third way to store NumPy arrays on disk is using the native `np.save()` method which stores the arrays in binary file format.

This format allows you to save NumPy arrays in all dimensions. However, this also means it is not human-readable.

In [9]:
# save small array to binary format
np.save('array_XS.npy', array_XS)

In [10]:
# save large array to binary format
np.save('array_L.npy', array_L)

> NOTE: You can also store NumPy arrays using the `np.ndarray.tofile()` method. However, this encodes the arrays into platform-dependent binary formats and is therefore not widely used. Read more about it [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tofile.html?highlight=tofile).

## 2. The Benefits of Zarr

We've seen that the three most common ways to store numpy arrays each have their own shortcomings:
- TXT and CSV files can only contain 1- or 2-dimensional arrays, and 
- the native NPY binary file format does not support parallel read/write.

This is why we recommend saving your NumPy arrays with Zarr. Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. It was developed as an extension and improvement on the HDF5 format.


The benefits of Zarr:
1. Can **read and write data concurrently*** in n-dimensional compressed chunks
2. Has **multiple compression options** and levels built-in
3. Is safe to use in **multiprocessing and multithreading** setups
4. Stores **metadata within the file**, allowing for flexibility 
5. Supports **multiple backend data stores** (zip, S3, etc.)
6. Has been **widely adopted** across PyData libraries like Dask, TensorStore and xarray

(*) Note that Zarr supports concurrent reads and concurrent writes separately, but not concurrent reads and writes at the same time.

Let’s see Zarr in action. Below, we’ll save the small and large arrays to .zarr and check the resulting file sizes.

In [11]:
import zarr

In [12]:
# save small numpy array to zarr
zarr.save('array_XS.zarr', array_XS)

In [13]:
# save large numpy array to zarr
zarr.save('array_L.zarr', array_L)

In [14]:
# let's get the size (in bytes) of the stored .zarr file
! du -h array_L.zarr

693M	array_L.zarr


Storing the `array_L` as Zarr leads to a significant reduction (~15%) in filesize, even with just the default out-of-the-box compression settings.

### Compression Options

Blosc is the default compressor used for creating Zarr arrays. You can tweak the settings of the Blosc (or any other compatible compressor) by importing it from `numcodecs` and passing it as an object class to the `compressor` keyword argument. Read more about all the compression options in [the Zarr documentation](https://zarr.readthedocs.io/en/stable/tutorial.html#compressors).

In [16]:
# save large numpy array to zarr
zarr.save('array_L.zarr', array_L)

In [17]:
from numcodecs import Blosc

In [18]:
zarr_array = zarr.array(
    data=array_L, 
    chunks=True, #infers chunksize from array
    compressor=Blosc(cname="lz4hc", clevel=9), #set compression algo and level
)

In [19]:
zarr.save('array_L_comp.zarr', zarr_array)

### Loading NumPy Arrays with Zarr

You now load any array stored as .zarr back into your Python session using `zarr.load()`.

In [22]:
# load in array from zarr
array_zarr = zarr.load('array_L.zarr')

When we load the .zarr file back into our Python session, it is loaded in as a regular NumPy array.

In [23]:
type(array_zarr)

numpy.ndarray

Zarr supports multiple backend data stores. This means you can also easily load .zarr files from cloud-based data stores, like Amazon S3:

In [24]:
# load small zarr array from S3
array_S = zarr.load("s3://coiled-datasets/synthetic-data/array-random-390KB.zarr")

In [25]:
array_S[:,0,0]

array([9.97862027e-01, 4.93188723e-01, 8.64042719e-01, 9.53425248e-01,
       5.92869742e-01, 1.98482100e-01, 3.78242997e-01, 9.78501028e-01,
       4.59202482e-01, 8.88982746e-01, 3.58056844e-01, 5.85341283e-01,
       7.85844688e-01, 9.11071794e-01, 5.39329780e-01, 8.61029864e-01,
       4.40726502e-01, 9.75751003e-01, 4.33597238e-01, 9.64823816e-01,
       3.31746564e-01, 2.79358177e-01, 3.08116047e-01, 8.42990623e-01,
       4.14747817e-01, 1.95971922e-01, 4.97401472e-01, 7.74970837e-01,
       6.08517834e-01, 3.06942774e-01, 6.55169935e-01, 3.26379108e-01,
       5.93332939e-01, 7.47182238e-01, 7.71864306e-01, 8.22604316e-01,
       9.17763146e-01, 9.32028668e-01, 2.58655304e-01, 9.09026001e-01,
       4.60414297e-01, 8.97946448e-01, 7.55121515e-01, 5.56243088e-01,
       6.03356205e-01, 7.66650339e-01, 9.65219838e-01, 9.90092537e-01,
       7.87905785e-01, 3.10036232e-01, 9.29806773e-01, 2.96195733e-01,
       7.15712402e-01, 8.75266389e-02, 2.43538328e-01, 9.13177378e-01,
      

## 3. Parallel Read/Write from S3 with Dask

You can  use Dask to read and write your large Zarr arrays in parallel. This is especially useful if you're working with larger-than-memory datasets.


Let's try to load a 370GB .zarr file into our Python session directly.

In [None]:
array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

This will throw a `MemoryError`

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-5-7969a01a46fb> in <module>
----> 1 array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/convenience.py in load(store)
    361     store = normalize_store_arg(store)
    362     if contains_array(store, path=None):
--> 363         return Array(store=store, path=None)[...]
    364     elif contains_group(store, path=None):
    365         grp = Group(store=store, path=None)

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
    671 
    672         fields, selection = pop_fields(selection)
--> 673         return self.get_basic_selection(selection, fields=fields)
    674 
    675     def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
    797         else:
    798             return self._get_basic_selection_nd(selection=selection, out=out,
--> 799                                                 fields=fields)
    800 
    801     def _get_basic_selection_zd(self, selection, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_nd(self, selection, out, fields)
    839         indexer = BasicIndexer(selection, self)
    840 
--> 841         return self._get_selection(indexer=indexer, out=out, fields=fields)
    842 
    843     def get_orthogonal_selection(self, selection, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields)
   1118         # setup output array
   1119         if out is None:
-> 1120             out = np.empty(out_shape, dtype=out_dtype, order=self._order)
   1121         else:
   1122             check_array_shape('out', out, out_shape)

MemoryError: Unable to allocate 373. GiB for an array with shape (10000, 10000, 500) and data type float64

Loading the same 370GB .zarr file into a Dask array works fine:

In [26]:
import dask.array as da

dask_array = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")
dask_array

Unnamed: 0,Array,Chunk
Bytes,372.53 GiB,190.73 MiB
Shape,"(10000, 10000, 500)","(1000, 1000, 25)"
Count,2001 Tasks,2000 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 372.53 GiB 190.73 MiB Shape (10000, 10000, 500) (1000, 1000, 25) Count 2001 Tasks 2000 Chunks Type float64 numpy.ndarray",500  10000  10000,

Unnamed: 0,Array,Chunk
Bytes,372.53 GiB,190.73 MiB
Shape,"(10000, 10000, 500)","(1000, 1000, 25)"
Count,2001 Tasks,2000 Chunks
Type,float64,numpy.ndarray


You can perform some basic computations on this dataset locally. But loading the entire array into local memory will still fail because your machine does not have enough memory.

## 4. Scale to Dask Cluster with Coiled

We'll need to run this on a remote cluster to access additional hardware resources.

To do this, we'll:
1. Spin up a Coiled cluster
2. Connect the cluster to Dask
3. Run computations over the entire 370GB array 

In [29]:
import coiled

In [32]:
# spin up coiled cluster
cluster = coiled.Cluster(
    name="create-synth-array",
    software="coiled-examples/numpy-zarr", #specify Docker image to distribute to all workers
    n_workers=50,
    worker_cpu=4,
    worker_memory='24Gib', #specify worker RAM
    backend_options={'spot': True}, #use AWS Spot instances (cheaper)
)

Found software environment build
Created fw rule: inbound [8786-8787] [0.0.0.0/0] []
Created FW rules: coiled-dask-rrpelgr71-93764-firewall
Created fw rule: cluster [0-65535] [None] [coiled-dask-rrpelgr71-93764-firewall -> coiled-dask-rrpelgr71-93764-firewall]
Created FW rules: coiled-dask-rrpelgr71-93764-cluster-firewall
Created fw rule: cluster [0-65535] [None] [coiled-dask-rrpelgr71-93764-cluster-firewall -> coiled-dask-rrpelgr71-93764-cluster-firewall]
Created scheduler VM: coiled-dask-rrpelgr71-93764-scheduler (type: t3.medium, ip: ['3.236.75.146'])


In [34]:
# connect Dask to cluster
from distributed import Client
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://3.236.75.146:8787,

0,1
Dashboard: http://3.236.75.146:8787,Workers: 44
Total threads: 352,Total memory: 1.33 TiB

0,1
Comm: tls://10.4.9.218:8786,Workers: 44
Dashboard: http://10.4.9.218:8787/status,Total threads: 352
Started: 1 minute ago,Total memory: 1.33 TiB

0,1
Comm: tls://10.4.11.170:39459,Total threads: 8
Dashboard: http://10.4.11.170:44021/status,Memory: 31.01 GiB
Nanny: tls://10.4.11.170:33783,
Local directory: /dask-worker-space/worker-u67u4kxh,Local directory: /dask-worker-space/worker-u67u4kxh

0,1
Comm: tls://10.4.9.59:36005,Total threads: 8
Dashboard: http://10.4.9.59:46001/status,Memory: 31.01 GiB
Nanny: tls://10.4.9.59:40007,
Local directory: /dask-worker-space/worker-qh4z9awj,Local directory: /dask-worker-space/worker-qh4z9awj

0,1
Comm: tls://10.4.6.197:42697,Total threads: 8
Dashboard: http://10.4.6.197:37923/status,Memory: 31.01 GiB
Nanny: tls://10.4.6.197:42473,
Local directory: /dask-worker-space/worker-ihkki9z5,Local directory: /dask-worker-space/worker-ihkki9z5

0,1
Comm: tls://10.4.1.5:43561,Total threads: 8
Dashboard: http://10.4.1.5:37791/status,Memory: 31.01 GiB
Nanny: tls://10.4.1.5:40133,
Local directory: /dask-worker-space/worker-fts1nc5j,Local directory: /dask-worker-space/worker-fts1nc5j

0,1
Comm: tls://10.4.3.244:34703,Total threads: 8
Dashboard: http://10.4.3.244:44499/status,Memory: 31.01 GiB
Nanny: tls://10.4.3.244:40369,
Local directory: /dask-worker-space/worker-_sbdd2b5,Local directory: /dask-worker-space/worker-_sbdd2b5

0,1
Comm: tls://10.4.0.94:44175,Total threads: 8
Dashboard: http://10.4.0.94:38873/status,Memory: 31.01 GiB
Nanny: tls://10.4.0.94:40759,
Local directory: /dask-worker-space/worker-qo1ec70l,Local directory: /dask-worker-space/worker-qo1ec70l

0,1
Comm: tls://10.4.6.151:43025,Total threads: 8
Dashboard: http://10.4.6.151:35545/status,Memory: 31.01 GiB
Nanny: tls://10.4.6.151:36387,
Local directory: /dask-worker-space/worker-ucmljq4a,Local directory: /dask-worker-space/worker-ucmljq4a

0,1
Comm: tls://10.4.5.168:46525,Total threads: 8
Dashboard: http://10.4.5.168:38277/status,Memory: 31.01 GiB
Nanny: tls://10.4.5.168:44317,
Local directory: /dask-worker-space/worker-kb8oylaa,Local directory: /dask-worker-space/worker-kb8oylaa

0,1
Comm: tls://10.4.13.15:44205,Total threads: 8
Dashboard: http://10.4.13.15:43455/status,Memory: 31.01 GiB
Nanny: tls://10.4.13.15:44219,
Local directory: /dask-worker-space/worker-jlmpobmq,Local directory: /dask-worker-space/worker-jlmpobmq

0,1
Comm: tls://10.4.3.163:41845,Total threads: 8
Dashboard: http://10.4.3.163:43113/status,Memory: 31.01 GiB
Nanny: tls://10.4.3.163:40477,
Local directory: /dask-worker-space/worker-gmwgqk8c,Local directory: /dask-worker-space/worker-gmwgqk8c

0,1
Comm: tls://10.4.7.26:46101,Total threads: 8
Dashboard: http://10.4.7.26:33981/status,Memory: 31.01 GiB
Nanny: tls://10.4.7.26:36721,
Local directory: /dask-worker-space/worker-vor1eq3b,Local directory: /dask-worker-space/worker-vor1eq3b

0,1
Comm: tls://10.4.1.185:32923,Total threads: 8
Dashboard: http://10.4.1.185:40645/status,Memory: 31.01 GiB
Nanny: tls://10.4.1.185:36819,
Local directory: /dask-worker-space/worker-wirz2x34,Local directory: /dask-worker-space/worker-wirz2x34

0,1
Comm: tls://10.4.7.80:35377,Total threads: 8
Dashboard: http://10.4.7.80:33931/status,Memory: 31.01 GiB
Nanny: tls://10.4.7.80:37217,
Local directory: /dask-worker-space/worker-xt6gse70,Local directory: /dask-worker-space/worker-xt6gse70

0,1
Comm: tls://10.4.4.124:34791,Total threads: 8
Dashboard: http://10.4.4.124:39345/status,Memory: 31.01 GiB
Nanny: tls://10.4.4.124:33749,
Local directory: /dask-worker-space/worker-mftuvcgh,Local directory: /dask-worker-space/worker-mftuvcgh

0,1
Comm: tls://10.4.14.9:40893,Total threads: 8
Dashboard: http://10.4.14.9:36783/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.9:36859,
Local directory: /dask-worker-space/worker-mm226nij,Local directory: /dask-worker-space/worker-mm226nij

0,1
Comm: tls://10.4.14.231:37035,Total threads: 8
Dashboard: http://10.4.14.231:43953/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.231:33391,
Local directory: /dask-worker-space/worker-pli9lgy_,Local directory: /dask-worker-space/worker-pli9lgy_

0,1
Comm: tls://10.4.3.212:43819,Total threads: 8
Dashboard: http://10.4.3.212:36015/status,Memory: 31.01 GiB
Nanny: tls://10.4.3.212:34133,
Local directory: /dask-worker-space/worker-2ezxa3kr,Local directory: /dask-worker-space/worker-2ezxa3kr

0,1
Comm: tls://10.4.6.141:37401,Total threads: 8
Dashboard: http://10.4.6.141:41601/status,Memory: 31.01 GiB
Nanny: tls://10.4.6.141:41391,
Local directory: /dask-worker-space/worker-ydxwfw79,Local directory: /dask-worker-space/worker-ydxwfw79

0,1
Comm: tls://10.4.7.77:38251,Total threads: 8
Dashboard: http://10.4.7.77:36559/status,Memory: 31.01 GiB
Nanny: tls://10.4.7.77:46067,
Local directory: /dask-worker-space/worker-r6_vmv53,Local directory: /dask-worker-space/worker-r6_vmv53

0,1
Comm: tls://10.4.9.247:35649,Total threads: 8
Dashboard: http://10.4.9.247:35165/status,Memory: 31.01 GiB
Nanny: tls://10.4.9.247:34707,
Local directory: /dask-worker-space/worker-d_c5whbr,Local directory: /dask-worker-space/worker-d_c5whbr

0,1
Comm: tls://10.4.0.243:36195,Total threads: 8
Dashboard: http://10.4.0.243:33205/status,Memory: 31.01 GiB
Nanny: tls://10.4.0.243:40317,
Local directory: /dask-worker-space/worker-dnmhncdf,Local directory: /dask-worker-space/worker-dnmhncdf

0,1
Comm: tls://10.4.13.198:35037,Total threads: 8
Dashboard: http://10.4.13.198:40571/status,Memory: 31.01 GiB
Nanny: tls://10.4.13.198:36535,
Local directory: /dask-worker-space/worker-hiwzghxv,Local directory: /dask-worker-space/worker-hiwzghxv

0,1
Comm: tls://10.4.9.100:45797,Total threads: 8
Dashboard: http://10.4.9.100:44139/status,Memory: 31.01 GiB
Nanny: tls://10.4.9.100:38013,
Local directory: /dask-worker-space/worker-nkrv_82o,Local directory: /dask-worker-space/worker-nkrv_82o

0,1
Comm: tls://10.4.5.22:34883,Total threads: 8
Dashboard: http://10.4.5.22:42583/status,Memory: 31.01 GiB
Nanny: tls://10.4.5.22:37189,
Local directory: /dask-worker-space/worker-pb8dp8_e,Local directory: /dask-worker-space/worker-pb8dp8_e

0,1
Comm: tls://10.4.12.12:40451,Total threads: 8
Dashboard: http://10.4.12.12:35903/status,Memory: 31.01 GiB
Nanny: tls://10.4.12.12:32971,
Local directory: /dask-worker-space/worker-dyvk0r28,Local directory: /dask-worker-space/worker-dyvk0r28

0,1
Comm: tls://10.4.14.208:34843,Total threads: 8
Dashboard: http://10.4.14.208:40335/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.208:38765,
Local directory: /dask-worker-space/worker-47wrtx9j,Local directory: /dask-worker-space/worker-47wrtx9j

0,1
Comm: tls://10.4.7.153:45919,Total threads: 8
Dashboard: http://10.4.7.153:34513/status,Memory: 31.01 GiB
Nanny: tls://10.4.7.153:33579,
Local directory: /dask-worker-space/worker-ti9nhoac,Local directory: /dask-worker-space/worker-ti9nhoac

0,1
Comm: tls://10.4.9.26:46669,Total threads: 8
Dashboard: http://10.4.9.26:38679/status,Memory: 31.01 GiB
Nanny: tls://10.4.9.26:45433,
Local directory: /dask-worker-space/worker-ue6q6h57,Local directory: /dask-worker-space/worker-ue6q6h57

0,1
Comm: tls://10.4.2.243:33939,Total threads: 8
Dashboard: http://10.4.2.243:45127/status,Memory: 31.01 GiB
Nanny: tls://10.4.2.243:45179,
Local directory: /dask-worker-space/worker-otwarb_y,Local directory: /dask-worker-space/worker-otwarb_y

0,1
Comm: tls://10.4.1.1:41029,Total threads: 8
Dashboard: http://10.4.1.1:42165/status,Memory: 31.01 GiB
Nanny: tls://10.4.1.1:43047,
Local directory: /dask-worker-space/worker-qenbagyt,Local directory: /dask-worker-space/worker-qenbagyt

0,1
Comm: tls://10.4.14.60:44533,Total threads: 8
Dashboard: http://10.4.14.60:39445/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.60:38839,
Local directory: /dask-worker-space/worker-22u1ehmg,Local directory: /dask-worker-space/worker-22u1ehmg

0,1
Comm: tls://10.4.15.127:46877,Total threads: 8
Dashboard: http://10.4.15.127:39569/status,Memory: 31.01 GiB
Nanny: tls://10.4.15.127:42173,
Local directory: /dask-worker-space/worker-qel7vnt8,Local directory: /dask-worker-space/worker-qel7vnt8

0,1
Comm: tls://10.4.2.70:40089,Total threads: 8
Dashboard: http://10.4.2.70:39069/status,Memory: 31.01 GiB
Nanny: tls://10.4.2.70:36563,
Local directory: /dask-worker-space/worker-_fakq1uu,Local directory: /dask-worker-space/worker-_fakq1uu

0,1
Comm: tls://10.4.1.31:39165,Total threads: 8
Dashboard: http://10.4.1.31:45047/status,Memory: 31.01 GiB
Nanny: tls://10.4.1.31:33501,
Local directory: /dask-worker-space/worker-z0st1qs8,Local directory: /dask-worker-space/worker-z0st1qs8

0,1
Comm: tls://10.4.2.43:36817,Total threads: 8
Dashboard: http://10.4.2.43:44479/status,Memory: 31.01 GiB
Nanny: tls://10.4.2.43:36125,
Local directory: /dask-worker-space/worker-qi9z_xpz,Local directory: /dask-worker-space/worker-qi9z_xpz

0,1
Comm: tls://10.4.9.152:37067,Total threads: 8
Dashboard: http://10.4.9.152:34823/status,Memory: 31.01 GiB
Nanny: tls://10.4.9.152:40441,
Local directory: /dask-worker-space/worker-3cymjed1,Local directory: /dask-worker-space/worker-3cymjed1

0,1
Comm: tls://10.4.13.106:45441,Total threads: 8
Dashboard: http://10.4.13.106:40065/status,Memory: 31.01 GiB
Nanny: tls://10.4.13.106:35869,
Local directory: /dask-worker-space/worker-v_12w4cw,Local directory: /dask-worker-space/worker-v_12w4cw

0,1
Comm: tls://10.4.0.120:44729,Total threads: 8
Dashboard: http://10.4.0.120:45207/status,Memory: 31.01 GiB
Nanny: tls://10.4.0.120:39957,
Local directory: /dask-worker-space/worker-lhm5k0bo,Local directory: /dask-worker-space/worker-lhm5k0bo

0,1
Comm: tls://10.4.1.11:34101,Total threads: 8
Dashboard: http://10.4.1.11:45749/status,Memory: 31.01 GiB
Nanny: tls://10.4.1.11:38927,
Local directory: /dask-worker-space/worker-pqgrib8j,Local directory: /dask-worker-space/worker-pqgrib8j

0,1
Comm: tls://10.4.14.119:41249,Total threads: 8
Dashboard: http://10.4.14.119:36865/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.119:44191,
Local directory: /dask-worker-space/worker-ohgx3373,Local directory: /dask-worker-space/worker-ohgx3373

0,1
Comm: tls://10.4.2.127:35599,Total threads: 8
Dashboard: http://10.4.2.127:39959/status,Memory: 31.01 GiB
Nanny: tls://10.4.2.127:43617,
Local directory: /dask-worker-space/worker-0sqxahgf,Local directory: /dask-worker-space/worker-0sqxahgf

0,1
Comm: tls://10.4.8.70:32929,Total threads: 8
Dashboard: http://10.4.8.70:36673/status,Memory: 31.01 GiB
Nanny: tls://10.4.8.70:34311,
Local directory: /dask-worker-space/worker-n4_j_as7,Local directory: /dask-worker-space/worker-n4_j_as7

0,1
Comm: tls://10.4.11.248:41003,Total threads: 8
Dashboard: http://10.4.11.248:41819/status,Memory: 31.01 GiB
Nanny: tls://10.4.11.248:40219,
Local directory: /dask-worker-space/worker-4s0zj8k1,Local directory: /dask-worker-space/worker-4s0zj8k1

0,1
Comm: tls://10.4.3.138:38245,Total threads: 8
Dashboard: http://10.4.3.138:43043/status,Memory: 31.01 GiB
Nanny: tls://10.4.3.138:40459,
Local directory: /dask-worker-space/worker-7fadfr5x,Local directory: /dask-worker-space/worker-7fadfr5x


In [35]:
da_1 = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

In [36]:
da_1

Unnamed: 0,Array,Chunk
Bytes,372.53 GiB,190.73 MiB
Shape,"(10000, 10000, 500)","(1000, 1000, 25)"
Count,2001 Tasks,2000 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 372.53 GiB 190.73 MiB Shape (10000, 10000, 500) (1000, 1000, 25) Count 2001 Tasks 2000 Chunks Type float64 numpy.ndarray",500  10000  10000,

Unnamed: 0,Array,Chunk
Bytes,372.53 GiB,190.73 MiB
Shape,"(10000, 10000, 500)","(1000, 1000, 25)"
Count,2001 Tasks,2000 Chunks
Type,float64,numpy.ndarray


In [37]:
# transpose dask array
da_2 = da_1.T

In [38]:
da_2

Unnamed: 0,Array,Chunk
Bytes,372.53 GiB,190.73 MiB
Shape,"(500, 10000, 10000)","(25, 1000, 1000)"
Count,4001 Tasks,2000 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 372.53 GiB 190.73 MiB Shape (500, 10000, 10000) (25, 1000, 1000) Count 4001 Tasks 2000 Chunks Type float64 numpy.ndarray",10000  10000  500,

Unnamed: 0,Array,Chunk
Bytes,372.53 GiB,190.73 MiB
Shape,"(500, 10000, 10000)","(25, 1000, 1000)"
Count,4001 Tasks,2000 Chunks
Type,float64,numpy.ndarray


In [43]:
%%time
# write transposed array to S3 in parallel
da_2.to_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB-T.zarr")

CPU times: user 2.52 s, sys: 264 ms, total: 2.78 s
Wall time: 1min 51s


> Note that for efficient parallel writing, the Dask array chunks should be aligned with the Zarr target.

Our Coiled cluster has 50 Dask workers with 24Gib RAM each, all running a pre-compiled software environment containing the necessary dependencies. This means we have enough resources to comfortably transpose the array and write it back to S3. 

Dask is able to do all of this for us in parallel and without ever loading the array into our local memory. It has loaded, transformed and saved an array of 372GB back to S3 in less than 2 minutes.

## Summary

Let’s recap: 
- There are important limitations to many of the common ways of storing NumPy arrays.
- The Zarr file format offers powerful compression options, supports multiple data store backends, and can read/write your NumPy arrays in parallel.
- Dask allows you to use Zarr's parallel read/write capabilities to their full potential
- Connecting Dask to an on-demand Coiled cluster allows for efficient computations over larger-than-memory datasets.


To get started with Coiled, [create a free account here](https://cloud.coiled.io/) using your Github credentials: