In the L5Kit codebase, we make use of a data format that consists of a set of numpy structured arrays. Conceptually, it is similar to a set of CSV files with records and different columns, only that they are stored as binary files instead of text. Structured arrays can be directly memory mapped from disk.

Structured arrays are stored in memory in an interleaved format, this means that one “row” or “sample” is grouped together in memory. For example, if we are storing colors and whether we like them (as a boolean l), it would be [r,g,b,l,r,g,b,l,r,g,b,l] and not [r,r,r,g,g,g,b,b,b,l,l,l]). Most ML applications require row-based access - column-based operations are much less common - making this a good fit.

In [2]:
import numpy as np

In [5]:
my_arr = np.zeros(3, dtype=[("color", (np.uint8,3)), ("label", bool)])

In [6]:
print(my_arr)

[([0, 0, 0], False) ([0, 0, 0], False) ([0, 0, 0], False)]


In [10]:
my_arr[0]["color"] = [0,218,130]
my_arr[0]["label"] = True
my_arr[1]["color"] = [245, 62, 255]
my_arr[1]["label"] = True

In [11]:
my_arr

array([([  0, 218, 130],  True), ([245,  62, 255],  True),
       ([  0,   0,   0], False)],
      dtype=[('color', 'u1', (3,)), ('label', '?')])

In [12]:
print(my_arr.tobytes())

b'\x00\xda\x82\x01\xf5>\xff\x01\x00\x00\x00\x00'


In [13]:
import zarr

In [14]:
# open the zarr store
z = zarr.open("dataset.zarr", mode="w", shape=(500,), dtype=np.float32, chunks=(100,))

In [15]:
# We can write to it by assigning to it. This gets persisted on disk.
z[0:150] = np.arange(150)

In [16]:
print(z.info)

Type               : zarr.core.Array
Data type          : float32
Shape              : (500,)
Chunk shape        : (100,)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.DirectoryStore
No. bytes          : 2000 (2.0K)
No. bytes stored   : 577
Storage ratio      : 3.5
Chunks initialized : 2/5



Automatically compressed on disk, almost 75%!

Reading from a zarr array is as easy as slicing from it like you would any numpy array. The return value is an ordinary numpy array. Zarr takes care of determining which chunks to read from:

In [17]:
print(z[:10])

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


In [18]:
print(z[::20])

[  0.  20.  40.  60.  80. 100. 120. 140.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]


## Zarr - Dask Docs

https://docs.dask.org/en/stable/array-creation.html#zarr

In [22]:
import dask.array as da

In [39]:
my_arr = da.arange(0,50,chunks=5)

In [30]:
my_arr

Unnamed: 0,Array,Chunk
Bytes,240 B,40 B
Shape,"(30,)","(5,)"
Count,6 Tasks,6 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 240 B 40 B Shape (30,) (5,) Count 6 Tasks 6 Chunks Type int64 numpy.ndarray",30  1,

Unnamed: 0,Array,Chunk
Bytes,240 B,40 B
Shape,"(30,)","(5,)"
Count,6 Tasks,6 Chunks
Type,int64,numpy.ndarray


In [34]:
my_arr.to_zarr('test.zarr')

In [None]:
z = zarr.open("dataset.zarr", mode="w", shape=(500,), dtype=np.float32, chunks=(100,))

In [None]:
zarr.create()

In [40]:
my_arr.shape

(50,)

In [41]:
my_arr.chunksize

(5,)

In [50]:
# create zipped zarr store
z = zarr.create(shape=my_arr.shape, chunks=my_arr.chunksize, dtype=float, store=zarr.ZipStore("zipped.zarr"))

In [51]:
my_arr.to_zarr(z)

In [52]:
# store unzipped
my_arr.to_zarr('unzipped.zarr')