# Data exploration

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### Quick-note on project directory

The main root dir `~/3dcorrection` is structured as follow:
* `data/` contains raw and preprocessed data. 
    * `raw/` is actually a symbolic link to the same repo for all candidates, DO NOT TOUCH IT!
    * `processed/` will be created when data is preprocessed and will contain all transformed data
* 

In [2]:
from kosmoss import CACHE_DATA_PATH, CONFIG, PROCESSED_DATA_PATH
from kosmoss.utils import save_metadata, prime_factors, purgedirs

### The 3D Correction Use-Case

The European Centre for Medium-range Weather Forecasts (ECMWF) has developed a series of model giving the current best accurate parametrization scheme available—among those, SPARTACUS delivers **radiation** prediction over the globe. Because it is demanding in computations, a simpler, degraded model called TRIPLECLOUD is developed to satisfy the production environment constraints. 

Like most climate models, to leverage hardware acceleration, the choice is made to split the globe in blocks—this has the immediate consequence of losing the spatial correlation for a gain in parallelization. 

The unit block is a column that express values throughout the vertical dimension over a set of levels. Each level is

Now let's load the raw data we'll be using throughout this hands-on. Take a look at the [source notebook](https://git.ecmwf.int/projects/MLFET/repos/maelstrom-radiation/browse/climetlab_maelstrom_radiation/radiation.py) for a more info on the variables.

## Download dataset

The data has already been downloaded for you with the `climetlab` library, provided by the ECMWF. We'll just load it.

In [3]:
import climetlab as cml
import dask
import dask.array as da
import numpy as np
import os
import os.path as osp
from pprint import pprint

step = CONFIG['timestep']

cml.settings.set("cache-directory", CACHE_DATA_PATH)
cmlds = cml.load_dataset(
    'maelstrom-radiation', 
    dataset='3dcorrection', 
    raw_inputs=False, 
    timestep=list(range(0, 3501, step)), 
    minimal_outputs=False,
    patch=list(range(0, 16, 1)),
    hr_units='K d-1',
)

By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/licences/general/ If you do not agree with such terms, do not download the data. 


                                                

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,397.50 kiB
Shape,"(1085440, 17)","(16960, 6)"
Count,1729 Tasks,384 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 70.39 MiB 397.50 kiB Shape (1085440, 17) (16960, 6) Count 1729 Tasks 384 Chunks Type float32 numpy.ndarray",17  1085440,

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,397.50 kiB
Shape,"(1085440, 17)","(16960, 6)"
Count,1729 Tasks,384 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.96 GiB,106.36 MiB
Shape,"(1085440, 137, 27)","(16960, 137, 12)"
Count,6080 Tasks,1024 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 14.96 GiB 106.36 MiB Shape (1085440, 137, 27) (16960, 137, 12) Count 6080 Tasks 1024 Chunks Type float32 numpy.ndarray",27  137  1085440,

Unnamed: 0,Array,Chunk
Bytes,14.96 GiB,106.36 MiB
Shape,"(1085440, 137, 27)","(16960, 137, 12)"
Count,6080 Tasks,1024 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.12 GiB,8.93 MiB
Shape,"(1085440, 138, 2)","(16960, 138, 1)"
Count,768 Tasks,128 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.12 GiB 8.93 MiB Shape (1085440, 138, 2) (16960, 138, 1) Count 768 Tasks 128 Chunks Type float32 numpy.ndarray",2  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,1.12 GiB,8.93 MiB
Shape,"(1085440, 138, 2)","(16960, 138, 1)"
Count,768 Tasks,128 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138, 1)","(16960, 138, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138, 1) (16960, 138, 1) Count 320 Tasks 64 Chunks Type float32 numpy.ndarray",1  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138, 1)","(16960, 138, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,563.12 MiB,8.80 MiB
Shape,"(1085440, 136, 1)","(16960, 136, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 563.12 MiB 8.80 MiB Shape (1085440, 136, 1) (16960, 136, 1) Count 320 Tasks 64 Chunks Type float32 numpy.ndarray",1  136  1085440,

Unnamed: 0,Array,Chunk
Bytes,563.12 MiB,8.80 MiB
Shape,"(1085440, 136, 1)","(16960, 136, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 4.14 MiB 66.25 kiB Shape (1085440,) (16960,) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",1085440  1,

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 4.14 MiB 66.25 kiB Shape (1085440,) (16960,) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",1085440  1,

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 567.27 MiB 8.86 MiB Shape (1085440, 137) (16960, 137) Count 1664 Tasks 64 Chunks Type float32 numpy.ndarray",137  1085440,

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 567.27 MiB 8.86 MiB Shape (1085440, 137) (16960, 137) Count 1664 Tasks 64 Chunks Type float32 numpy.ndarray",137  1085440,

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray


Internally, Climatelab checks that all of the requested bits have been downloaded. To process the data, we need to convert it in a usable format. Xarray is a framework built on top of Dask, and widely used in the scientific machine learning. It reads netCDF4 files, which are a layer over the popular HDF5 file format which provides it with a layer of metadata to carry additional information. There's already a lot of tech involved:

* Dask is a Python framework that 'provides advanced parallelism for analytics'. [More info](https://dask.org/)
* Xarray sits on top of Dask and provides an abstraction for HDF5/netCDF4. [More info](https://docs.xarray.dev/en/stable/)
* HDF5 is both a file format and a library to process large, n-dimensional, datasets. More info on [the format initiative](https://www.hdfgroup.org/solutions/hdf5/) and specifically for [the Python library](https://docs.h5py.org/en/stable/)
* netCDF4 is an extension of the HDF5 file format that provides additional metadata. [More info](https://unidata.github.io/netcdf4-python/)

In [None]:
xr_array = cmlds.to_xarray()
xr_array

To give you a sense of scale, we've been taking only 3 instants (snapshots at a particular time), but the data is already quite large for a DL use-case.

In [4]:
print(f"num of instants: {3500 // step} /3500")
print(f"size: {xr_array.nbytes / float(1 << 30):,.0f} GB")

num of instants: 3 /3500
size: 21 GB


Explain the idea of chunks

The returned object is a ClimateLab dataset Xarray Dataset

Let's check the content of the downloaded file

most operations are computed lazily in dask/xarray when needed and if possible on every chunk, treated and seen 'as if' it was a continuous array

Xarray is built on top of Dask, 

Dask has its own structure to carry data.

In [5]:
xr_array.sca_inputs

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,397.50 kiB
Shape,"(1085440, 17)","(16960, 6)"
Count,1729 Tasks,384 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 70.39 MiB 397.50 kiB Shape (1085440, 17) (16960, 6) Count 1729 Tasks 384 Chunks Type float32 numpy.ndarray",17  1085440,

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,397.50 kiB
Shape,"(1085440, 17)","(16960, 6)"
Count,1729 Tasks,384 Chunks
Type,float32,numpy.ndarray


In [6]:
xr_array.col_inputs

Unnamed: 0,Array,Chunk
Bytes,14.96 GiB,106.36 MiB
Shape,"(1085440, 137, 27)","(16960, 137, 12)"
Count,6080 Tasks,1024 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 14.96 GiB 106.36 MiB Shape (1085440, 137, 27) (16960, 137, 12) Count 6080 Tasks 1024 Chunks Type float32 numpy.ndarray",27  137  1085440,

Unnamed: 0,Array,Chunk
Bytes,14.96 GiB,106.36 MiB
Shape,"(1085440, 137, 27)","(16960, 137, 12)"
Count,6080 Tasks,1024 Chunks
Type,float32,numpy.ndarray


In [7]:
features = [
    'sca_inputs',
    'col_inputs',
    'hl_inputs',
    'inter_inputs',
    'flux_dn_sw',
    'flux_up_sw',
    'flux_dn_lw',
    'flux_up_lw',
]

for feat in features:
    print(f'{feat}: {xr_array[feat].data}')

sca_inputs: dask.array<concatenate, shape=(1085440, 17), dtype=float32, chunksize=(16960, 6), chunktype=numpy.ndarray>
col_inputs: dask.array<concatenate, shape=(1085440, 137, 27), dtype=float32, chunksize=(16960, 137, 12), chunktype=numpy.ndarray>
hl_inputs: dask.array<concatenate, shape=(1085440, 138, 2), dtype=float32, chunksize=(16960, 138, 1), chunktype=numpy.ndarray>
inter_inputs: dask.array<transpose, shape=(1085440, 136, 1), dtype=float32, chunksize=(16960, 136, 1), chunktype=numpy.ndarray>
flux_dn_sw: dask.array<concatenate, shape=(1085440, 138), dtype=float32, chunksize=(16960, 138), chunktype=numpy.ndarray>
flux_up_sw: dask.array<concatenate, shape=(1085440, 138), dtype=float32, chunksize=(16960, 138), chunktype=numpy.ndarray>
flux_dn_lw: dask.array<concatenate, shape=(1085440, 138), dtype=float32, chunksize=(16960, 138), chunktype=numpy.ndarray>
flux_up_lw: dask.array<concatenate, shape=(1085440, 138), dtype=float32, chunksize=(16960, 138), chunktype=numpy.ndarray>


## Flattened data

In [8]:
dataset_len = xr_array.dims['column']
print(f"dataset len: {dataset_len}")
print(f"prime factor decomposition: {prime_factors(dataset_len)}")

dataset len: 1085440
prime factor decomposition: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 53]


Sharding is a subtil balance. We're going to split a single file into multiple pieces—from a loading perspective, the more the better, but from a FileSystem perspective and even more so an NFS, storing a large number of files can congest the system.

Sill, for the sake of training, we choose to favor the multiplication of files which will speedup dramatically the data loading process.

In [9]:
num_shards = 53 * 2 ** 7
shard_size = dataset_len // num_shards

Configure Dask to execute in a multithreated environment, as opposed to multiprocessed. Just making the implicit explicit here.

In [10]:
dask.config.set(scheduler='threads')

<dask.config.set at 0x7f8fc0a5d310>

In [11]:
data = {}
for feat in features:
    array = xr_array[feat].data
    array = da.reshape(array, shape=(array.shape[0], -1))
    data.update({feat: array})

In [12]:
x = da.concatenate([
    data['hl_inputs'],
    data['inter_inputs'],
    data['sca_inputs'],
    data['col_inputs']
], axis=-1)

y = da.concatenate([
    data['flux_dn_sw'],
    data['flux_up_sw'],
    data['flux_dn_lw'],
    data['flux_up_lw'],
], axis=-1)

In [13]:
x

Unnamed: 0,Array,Chunk
Bytes,16.69 GiB,29.70 MiB
Shape,"(1085440, 4128)","(16960, 459)"
Count,15425 Tasks,1600 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 16.69 GiB 29.70 MiB Shape (1085440, 4128) (16960, 459) Count 15425 Tasks 1600 Chunks Type float32 numpy.ndarray",4128  1085440,

Unnamed: 0,Array,Chunk
Bytes,16.69 GiB,29.70 MiB
Shape,"(1085440, 4128)","(16960, 459)"
Count,15425 Tasks,1600 Chunks
Type,float32,numpy.ndarray


In [14]:
x_ = da.rechunk(x, chunks=(shard_size, *x.shape[1:]))
y_ = da.rechunk(y, chunks=(shard_size, *y.shape[1:]))

In [15]:
x_

Unnamed: 0,Array,Chunk
Bytes,16.69 GiB,2.52 MiB
Shape,"(1085440, 4128)","(160, 4128)"
Count,43009 Tasks,6784 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 16.69 GiB 2.52 MiB Shape (1085440, 4128) (160, 4128) Count 43009 Tasks 6784 Chunks Type float32 numpy.ndarray",4128  1085440,

Unnamed: 0,Array,Chunk
Bytes,16.69 GiB,2.52 MiB
Shape,"(1085440, 4128)","(160, 4128)"
Count,43009 Tasks,6784 Chunks
Type,float32,numpy.ndarray


**Looks better!**

In [16]:
out_dir = osp.join(PROCESSED_DATA_PATH, f'flattened-{step}')

x_path, y_path = purgedirs([
    osp.join(out_dir, 'x'), 
    osp.join(out_dir, 'y')
])
    
da.to_npy_stack(x_path, x_, axis=0)
da.to_npy_stack(y_path, y_, axis=0)

In [17]:
metadata_flattened = {
    "dtype": x_.dtype.name,
    "dataset_len": len(x_),
    "num_shards": len(x_.chunks[0]),
    "x_shape": x_.chunksize,
    "y_shape": y_.chunksize,
}
pprint(metadata_flattened)

save_metadata(
    step, 
    metadata_flattened, 
    'flattened'
)

{'dataset_len': 1085440,
 'dtype': 'float32',
 'num_shards': 6784,
 'x_shape': (160, 4128),
 'y_shape': (160, 552)}


## Feature engineering

In [18]:
data = {}
for feat in features:
    array = xr_array[feat].data
    array = da.rechunk(array, chunks=(shard_size, *array.shape[1:]))
    data.update({feat: array})

In [19]:
def broadcast_features(array: da.Array):
    a = da.repeat(array, 138, axis=-1)
    a = da.moveaxis(a, -2, -1)
    return a

def pad_tensor(array: da.Array):
    a = da.pad(array, ((0, 0), (1, 1), (0, 0)))
    return a

In [20]:
# still lazy
x = da.concatenate([
    data['hl_inputs'],
    pad_tensor(data['inter_inputs']),
    broadcast_features(data['sca_inputs'][..., np.newaxis])
], axis=-1)

y = da.concatenate([
    data['flux_dn_sw'][..., np.newaxis],
    data['flux_up_sw'][..., np.newaxis],
    data['flux_dn_lw'][..., np.newaxis],
    data['flux_up_lw'][..., np.newaxis],
], axis=-1)

edge = data['col_inputs']

print(f"x of shape: {x.shape}")
print(f"y of shape: {y.shape}")
print(f"edge of shape: {edge.shape}")

x of shape: (1085440, 138, 20)
y of shape: (1085440, 138, 4)
edge of shape: (1085440, 137, 27)


In [21]:
x

Unnamed: 0,Array,Chunk
Bytes,11.16 GiB,1.41 MiB
Shape,"(1085440, 138, 20)","(160, 136, 17)"
Count,240322 Tasks,61056 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 11.16 GiB 1.41 MiB Shape (1085440, 138, 20) (160, 136, 17) Count 240322 Tasks 61056 Chunks Type float32 numpy.ndarray",20  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,11.16 GiB,1.41 MiB
Shape,"(1085440, 138, 20)","(160, 136, 17)"
Count,240322 Tasks,61056 Chunks
Type,float32,numpy.ndarray


### To a single HDF5 file

Saving to a single HDF5 file will allow us to experiment later with a toy-example in MPI to shard the file differently. 

In [22]:
x_ = da.rechunk(x, chunks=(shard_size, *x.shape[1:]))
y_ = da.rechunk(y, chunks=(shard_size, *y.shape[1:]))
edge_ = da.rechunk(edge, chunks=(shard_size, *edge.shape[1:]))

In [23]:
out_file = osp.join(PROCESSED_DATA_PATH, f'features-{step}.h5')
if osp.isfile(out_file): os.remove(out_file)
    
x_.to_hdf5(out_file, '/x')
y_.to_hdf5(out_file, '/y')
edge_.to_hdf5(out_file, '/edge')

### To a stack of NumPY files

In [24]:
out_dir = osp.join(PROCESSED_DATA_PATH, f'features-{step}')

x_path, y_path, edge_path = purgedirs([
    osp.join(out_dir, 'x'), 
    osp.join(out_dir, 'y'), 
    osp.join(out_dir, 'edge')
])
    
da.to_npy_stack(x_path, x, axis=0)
da.to_npy_stack(y_path, y, axis=0)
da.to_npy_stack(edge_path, edge, axis=0)

### Saving parameters for later use

In [25]:
metadata_features = {
    "dtype": x_.dtype.name,
    "dataset_len": len(x_),
    "num_shards": len(x_.chunks[0]),
    "x_shape": x_.chunksize,
    "y_shape": y_.chunksize,
    "edge_shape": edge_.chunksize,
}
pprint(metadata_features)

save_metadata(
    step, 
    metadata_features, 
    'features'
)

{'dataset_len': 1085440,
 'dtype': 'float32',
 'edge_shape': (160, 137, 27),
 'num_shards': 6784,
 'x_shape': (160, 138, 20),
 'y_shape': (160, 138, 4)}
