****************************************************************

# Data exploration

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### Quick-note on project directory

The main root dir `~/3dcorrection` is structured as follow:
* `data/` contains raw and preprocessed data. 
    * `raw/` is actually a symbolic link to the same repo for all candidates, DO NOT TOUCH IT!
    * `processed/` will be created when data is preprocessed and will contain all transformed data
* 

In [2]:
import os

root_path = os.path.join('/', 'root', 'bootcamps')

data_path = os.path.join(root_path, 'data')
raw_data_path = os.path.join(data_path, 'raw')
processed_data_path = os.path.join(data_path, 'processed')

### The 3D Correction Use-Case

The European Centre for Medium-range Weather Forecasts (ECMWF) has developed a series of model giving the current best accurate parametrization scheme available—among those, SPARTACUS delivers **radiation** prediction over the globe. Because it is demanding in computations, a simpler, degraded model called TRIPLECLOUD is developed to satisfy the production environment constraints. 

Like most climate models, to leverage hardware acceleration, the choice is made to split the globe in blocks—this has the immediate consequence of losing the spatial correlation for a gain in parallelization. 

The unit block is a column that express values throughout the vertical dimension over a set of levels. Each level is

In [3]:
import os

# proxy setup
os.environ['http_proxy'] = 'http://129.183.4.13:8080'
os.environ['https_proxy'] = 'http://129.183.4.13:8080'
os.environ['no_proxy'] = 'yoda,129.183.101.5,172.16.118.13,naboo0,naboo5,nwadmin,172.16.118.137,172.16.118.134'

Now let's load the raw data we'll be using throughout this hands-on. Take a look at the [source notebook](https://git.ecmwf.int/projects/MLFET/repos/maelstrom-radiation/browse/climetlab_maelstrom_radiation/radiation.py) for a more info on the variables.

In [4]:
import climetlab as cml
import numpy as np 

# setting up the cache directory so there is no need to re-download the data
cml.settings.set("cache-directory", raw_data_path)

step = 250

# loading the dataset. take a look at [the source]
cmlds = cml.load_dataset(
    # use-case
    'maelstrom-radiation', 
    
    # dataset name
    dataset='3dcorrection', 
    
    # make feature engineering
    raw_inputs=False, 
    
    # sample over time
    timestep=list(range(0, 3501, step)), 
    
    # full output
    minimal_outputs=False,
    
    # sample geographically
    patch=list(range(0, 16, 1)),
    
    # units
    hr_units='K d-1',
)

CliMetLab cache: orphan found: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-1000.nc
CliMetLab cache: orphan found: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-250.nc
CliMetLab cache: orphan found: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/.ipynb_checkpoints
CliMetLab cache: trying to free 325.1 GiB
Deleting entry {
    "path": "/home/research/research/modeling/weather-forecast/tripleclouds/data/raw/url-bb00f35901fb668827703fce316b80ec05e427cf424cfdd77e953bedc2e74829.nc",
    "owner": "url",
    "args": {
        "url": "https://storage.ecmwf.europeanweather.cloud/MAELSTROM_AP3/rad4NN_inputs_2020010100_1125c1.nc",
        "parts": null
    },
    "creation_date": "2022-01-28 08:55:02.270383",
    "flags": 0,
    "owner_data": {
        "content-length": "298869871",
        "accept-ranges": "bytes",
        "last-modified": "Wed, 17 Nov 2021 09:25:29 GMT",
        "x-rgw-object-type": "Normal"

By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/licences/general/ If you do not agree with such terms, do not download the data. 


  0%|          | 0/240 [00:00<?, ?it/s]

  0%|          | 0/240 [00:00<?, ?it/s]

In [5]:
type(cmlds)

climetlab_maelstrom_radiation.radiation.radiation

The returned object is a ClimateLab dataset Xarray Dataset

Let's check the content of the downloaded file

In [7]:
import netCDF4
from pprint import pprint
step = 1000
path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(path, "r", format="NETCDF4") as file:
    print(f"file: {path}")
    print(f"size: {os.path.getsize(path)/float(1<<30):,.0f} GB")
    
    print("variables: ")
    pprint(list(file.variables.keys()))
    
    print("dimensions: ")
    pprint(file.dimensions)
    
    print(f"dataset size: {file.dimensions['column'].size}")
    
    print(file.variables['sca_inputs'].shape)

file: /root/bootcamps/data/raw/data-1000.nc
size: 21 GB
variables: 
['sca_inputs',
 'col_inputs',
 'hl_inputs',
 'pressure_hl',
 'inter_inputs',
 'lat',
 'lon',
 'flux_dn_sw',
 'flux_up_sw',
 'flux_dn_lw',
 'flux_up_lw',
 'hr_sw',
 'hr_lw']
dimensions: 
{'col_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'col_variable', size = 27,
 'column': <class 'netCDF4._netCDF4.Dimension'>: name = 'column', size = 1085440,
 'half_level': <class 'netCDF4._netCDF4.Dimension'>: name = 'half_level', size = 138,
 'hl_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'hl_variable', size = 2,
 'inter_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'inter_variable', size = 1,
 'level': <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 137,
 'level_interface': <class 'netCDF4._netCDF4.Dimension'>: name = 'level_interface', size = 136,
 'p_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'p_variable', size = 1,
 'sca_variable': <class 'netCDF4._netCDF4.Dimension'>: 

In [16]:
da.pad(np.arange(10).reshape(2, -1), (0, 1))

Unnamed: 0,Array,Chunk
Bytes,144 B,80 B
Shape,"(3, 6)","(2, 5)"
Count,11 Tasks,4 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 144 B 80 B Shape (3, 6) (2, 5) Count 11 Tasks 4 Chunks Type int64 numpy.ndarray",6  3,

Unnamed: 0,Array,Chunk
Bytes,144 B,80 B
Shape,"(3, 6)","(2, 5)"
Count,11 Tasks,4 Chunks
Type,int64,numpy.ndarray


In [27]:
da.pad(np.arange(10).reshape(2, -1), 1)

Unnamed: 0,Array,Chunk
Bytes,224 B,80 B
Shape,"(4, 7)","(2, 5)"
Count,18 Tasks,9 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 224 B 80 B Shape (4, 7) (2, 5) Count 18 Tasks 9 Chunks Type int64 numpy.ndarray",7  4,

Unnamed: 0,Array,Chunk
Bytes,224 B,80 B
Shape,"(4, 7)","(2, 5)"
Count,18 Tasks,9 Chunks
Type,int64,numpy.ndarray


In [48]:
np.repeat(np.arange(30).reshape(2, 5, -1), (1, 1, 138), axis=-1)

array([[[ 0,  1,  2, ...,  2,  2,  2],
        [ 3,  4,  5, ...,  5,  5,  5],
        [ 6,  7,  8, ...,  8,  8,  8],
        [ 9, 10, 11, ..., 11, 11, 11],
        [12, 13, 14, ..., 14, 14, 14]],

       [[15, 16, 17, ..., 17, 17, 17],
        [18, 19, 20, ..., 20, 20, 20],
        [21, 22, 23, ..., 23, 23, 23],
        [24, 25, 26, ..., 26, 26, 26],
        [27, 28, 29, ..., 29, 29, 29]]])

most operations are computed lazily in dask/xarray when needed and if possible on every chunk, treated and seen 'as if' it was a continuous array

In [61]:
d = dict()
d.update({'hello': 'world'})
d

{'hello': 'world'}

In [3]:
x = da.from_npy_stack(osp.join(config.processed_data_path, 'feats_npy', 'x'))
x

Unnamed: 0,Array,Chunk
Bytes,11.16 GiB,107.81 MiB
Shape,"(1085440, 138, 20)","(10240, 138, 20)"
Count,106 Tasks,106 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 11.16 GiB 107.81 MiB Shape (1085440, 138, 20) (10240, 138, 20) Count 106 Tasks 106 Chunks Type float32 numpy.ndarray",20  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,11.16 GiB,107.81 MiB
Shape,"(1085440, 138, 20)","(10240, 138, 20)"
Count,106 Tasks,106 Chunks
Type,float32,numpy.ndarray


In [1]:
import climetlab as cml
import dask
import dask.array as da
from glob import glob
import numpy as np
import os.path as osp
import xarray as xr

import config

cml.settings.set("cache-directory", config.cache_data_path)

cmlds = cml.load_dataset(
    'maelstrom-radiation', 
    dataset='3dcorrection', 
    raw_inputs=False, 
    timestep=list(range(0, 3501, config.params['timestep'])), 
    minimal_outputs=False,
    patch=list(range(0, 16, 1)),
    hr_units='K d-1',
)

xr_array = cmlds.to_xarray()
xr_array

By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/licences/general/ If you do not agree with such terms, do not download the data. 


                                   

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,397.50 kiB
Shape,"(1085440, 17)","(16960, 6)"
Count,1729 Tasks,384 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 70.39 MiB 397.50 kiB Shape (1085440, 17) (16960, 6) Count 1729 Tasks 384 Chunks Type float32 numpy.ndarray",17  1085440,

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,397.50 kiB
Shape,"(1085440, 17)","(16960, 6)"
Count,1729 Tasks,384 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.96 GiB,106.36 MiB
Shape,"(1085440, 137, 27)","(16960, 137, 12)"
Count,6080 Tasks,1024 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 14.96 GiB 106.36 MiB Shape (1085440, 137, 27) (16960, 137, 12) Count 6080 Tasks 1024 Chunks Type float32 numpy.ndarray",27  137  1085440,

Unnamed: 0,Array,Chunk
Bytes,14.96 GiB,106.36 MiB
Shape,"(1085440, 137, 27)","(16960, 137, 12)"
Count,6080 Tasks,1024 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.12 GiB,8.93 MiB
Shape,"(1085440, 138, 2)","(16960, 138, 1)"
Count,768 Tasks,128 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.12 GiB 8.93 MiB Shape (1085440, 138, 2) (16960, 138, 1) Count 768 Tasks 128 Chunks Type float32 numpy.ndarray",2  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,1.12 GiB,8.93 MiB
Shape,"(1085440, 138, 2)","(16960, 138, 1)"
Count,768 Tasks,128 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138, 1)","(16960, 138, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138, 1) (16960, 138, 1) Count 320 Tasks 64 Chunks Type float32 numpy.ndarray",1  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138, 1)","(16960, 138, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,563.12 MiB,8.80 MiB
Shape,"(1085440, 136, 1)","(16960, 136, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 563.12 MiB 8.80 MiB Shape (1085440, 136, 1) (16960, 136, 1) Count 320 Tasks 64 Chunks Type float32 numpy.ndarray",1  136  1085440,

Unnamed: 0,Array,Chunk
Bytes,563.12 MiB,8.80 MiB
Shape,"(1085440, 136, 1)","(16960, 136, 1)"
Count,320 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 4.14 MiB 66.25 kiB Shape (1085440,) (16960,) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",1085440  1,

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 4.14 MiB 66.25 kiB Shape (1085440,) (16960,) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",1085440  1,

Unnamed: 0,Array,Chunk
Bytes,4.14 MiB,66.25 kiB
Shape,"(1085440,)","(16960,)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 571.41 MiB 8.93 MiB Shape (1085440, 138) (16960, 138) Count 192 Tasks 64 Chunks Type float32 numpy.ndarray",138  1085440,

Unnamed: 0,Array,Chunk
Bytes,571.41 MiB,8.93 MiB
Shape,"(1085440, 138)","(16960, 138)"
Count,192 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 567.27 MiB 8.86 MiB Shape (1085440, 137) (16960, 137) Count 1664 Tasks 64 Chunks Type float32 numpy.ndarray",137  1085440,

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 567.27 MiB 8.86 MiB Shape (1085440, 137) (16960, 137) Count 1664 Tasks 64 Chunks Type float32 numpy.ndarray",137  1085440,

Unnamed: 0,Array,Chunk
Bytes,567.27 MiB,8.86 MiB
Shape,"(1085440, 137)","(16960, 137)"
Count,1664 Tasks,64 Chunks
Type,float32,numpy.ndarray


In [2]:
def broadcast_features(tensor):
    t = da.repeat(tensor, 138, axis=-1)
    t = da.moveaxis(t, -2, -1)
    return t

def pad_tensor(tensor):
    return da.pad(tensor, ((0, 0), (1, 1), (0, 0)))

features = [
    'sca_inputs',
    'col_inputs',
    'hl_inputs',
    'inter_inputs',
    'flux_dn_sw',
    'flux_up_sw',
    'flux_dn_lw',
    'flux_up_lw',
]

dataset_size = xr_array.dims['column']
shards = 53 * 2 ** 3
batch_size = dataset_size // shards

# all this is lazy
data = {}
for feat in features:
    shape = xr_array[feat].shape
    print(f"{feat}, shape: {shape}\n")
    array = da.rechunk(xr_array[feat].data, chunks=(batch_size, *shape[1:]))
    print(array)
    # var = da.reshape(data, (shape[0] * shape[1], *shape[2:]), merge_chunks=False)
    data.update({feat: array})

# still lazy
x = da.concatenate([
    data['hl_inputs'],
    pad_tensor(data['inter_inputs']),
    broadcast_features(data['sca_inputs'][..., np.newaxis])
], axis=-1)

y = da.concatenate([
    data['flux_dn_sw'][..., np.newaxis],
    data['flux_up_sw'][..., np.newaxis],
    data['flux_dn_lw'][..., np.newaxis],
    data['flux_up_lw'][..., np.newaxis],
], axis=-1)

# x.mean(axis=0).compute()

x.shape

# array.to_netcdf(osp.join(config.raw_data_path, f'data-{config.params["timestep"]}.nc'))

sca_inputs, shape: (1085440, 17)

dask.array<rechunk-merge, shape=(1085440, 17), dtype=float32, chunksize=(2560, 17), chunktype=numpy.ndarray>
col_inputs, shape: (1085440, 137, 27)

dask.array<rechunk-merge, shape=(1085440, 137, 27), dtype=float32, chunksize=(2560, 137, 27), chunktype=numpy.ndarray>
hl_inputs, shape: (1085440, 138, 2)

dask.array<rechunk-merge, shape=(1085440, 138, 2), dtype=float32, chunksize=(2560, 138, 2), chunktype=numpy.ndarray>
inter_inputs, shape: (1085440, 136, 1)

dask.array<rechunk-merge, shape=(1085440, 136, 1), dtype=float32, chunksize=(2560, 136, 1), chunktype=numpy.ndarray>
flux_dn_sw, shape: (1085440, 138)

dask.array<rechunk-merge, shape=(1085440, 138), dtype=float32, chunksize=(2560, 138), chunktype=numpy.ndarray>
flux_up_sw, shape: (1085440, 138)

dask.array<rechunk-merge, shape=(1085440, 138), dtype=float32, chunksize=(2560, 138), chunktype=numpy.ndarray>
flux_dn_lw, shape: (1085440, 138)

dask.array<rechunk-merge, shape=(1085440, 138), dtype=float32

(1085440, 138, 20)

In [18]:
from dask import delayed
import torch
import torch_geometric as pyg


_directed_index = np.array([[*range(1, 138)], [*range(137)]])
_undirected_index = np.hstack((
    _directed_index, 
    _directed_index[[1, 0], :]
))
undirected_index = torch.tensor(_undirected_index, dtype=torch.long)

feats = da.concatenate((x, y), axis=-1)

# @delayed
def build_graphs(tensor):
    print(tensor)
    print('started building graphs')
    x, y = tensor[:20], tensor[20:]
    print(x, y)
    print(x.shape, y.shape)
    data_list = []
    for idx in range(x.shape[0]):
        x_ = torch.squeeze(torch.tensor(x[idx, ...]))
        y_ = torch.squeeze(torch.tensor(y[idx, ...]))

        # edge_attr = torch.squeeze(sca_inputs_[idx, ...])

        data = pyg.data.Data(
            x=x_,
            # edge_attr=edge_attr,
            edge_index=undirected_index,
            y=y_,
        )

        data_list.append(data)
    return data_list

In [19]:
feats.map_blocks(build_graphs)

[]
started building graphs
[] []
(0, 0, 0) (0, 0, 0)
[[[1.]]]
started building graphs
[[[1.]]] []
(1, 1, 1) (0, 1, 1)


ValueError: `dtype` inference failed in `map_blocks`.

Please specify the dtype explicitly using the `dtype` kwarg.

Original error is below:
------------------------
IndexError('index 0 is out of bounds for axis 0 with size 0')

Traceback:
---------
  File "/usr/local/lib/python3.8/dist-packages/dask/array/core.py", line 449, in apply_infer_dtype
    o = func(*args, **kwargs)
  File "/tmp/ipykernel_234415/2391466746.py", line 25, in build_graphs
    y_ = torch.squeeze(torch.tensor(y[idx, ...]))


In [None]:
da.map_blocks(

In [4]:
graphs = build_graphs(x, y)

In [5]:
graphs.compute()

started building graphs


[Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138, 4]),
 Data(x=[138, 20], edge_index=[2, 274], y=[138

In [None]:
graphs

In [None]:
x

In [4]:
a = array.to_array()

  return da_func(*args, **kwargs)


In [5]:
a

Unnamed: 0,Array,Chunk
Bytes,118.33 PiB,11.42 TiB
Shape,"(13, 1085440, 17, 137, 27, 138, 2, 1, 136, 1)","(1, 16960, 6, 137, 12, 138, 1, 1, 136, 1)"
Count,495809 Tasks,159744 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 118.33 PiB 11.42 TiB Shape (13, 1085440, 17, 137, 27, 138, 2, 1, 136, 1) (1, 16960, 6, 137, 12, 138, 1, 1, 136, 1) Count 495809 Tasks 159744 Chunks Type float32 numpy.ndarray",13  1  137  17  1085440  2  138  27  1  136  1,

Unnamed: 0,Array,Chunk
Bytes,118.33 PiB,11.42 TiB
Shape,"(13, 1085440, 17, 137, 27, 138, 2, 1, 136, 1)","(1, 16960, 6, 137, 12, 138, 1, 1, 136, 1)"
Count,495809 Tasks,159744 Chunks
Type,float32,numpy.ndarray


In [None]:
import dask.array as da
from glob import glob
import numpy as np
import os.path as osp
import xarray as xr

import config

shards = glob(osp.join(config.processed_data_path, 'shards_h5', '*.h5'))

dataset = xr.open_mfdataset(shards, chunks=-1, combine="nested", concat_dim="concat_dim", parallel=True)

def broadcast_features(tensor):
    t = da.repeat(tensor, 138, axis=-1)
    t = da.moveaxis(t, 1, -1)
    return t

def pad_tensor(tensor):
    return da.pad(tensor, ((0, 0), (1, 1), (0, 0)))

features = [
    'sca_inputs',
    'col_inputs',
    'hl_inputs',
    'inter_inputs',
    'flux_dn_sw',
    'flux_up_sw',
    'flux_dn_lw',
    'flux_up_lw',
]

# all this is lazy
data = {}
for feat in features:
    shape = dataset[feat].shape
    var = da.reshape(dataset[feat].data, (shape[0] * shape[1], *shape[2:]), merge_chunks=False)
    data.update({feat: var})
    
# still lazy
x = da.concatenate([
    data['hl_inputs'],
    pad_tensor(data['inter_inputs'][..., np.newaxis]),
    broadcast_features(data['sca_inputs'][..., np.newaxis])
], axis=-1)
y = da.concatenate([
    data['flux_dn_sw'][..., np.newaxis],
    data['flux_up_sw'][..., np.newaxis],
    data['flux_dn_lw'][..., np.newaxis],
    data['flux_up_lw'][..., np.newaxis],
], axis=-1)

# still...
x_mean = da.mean(x, axis=0)
y_mean = da.mean(y, axis=0)
x_std = da.std(x, axis=0)
y_std = da.std(y, axis=0)

In [13]:
import dask.array as da
from glob import glob
import numpy as np
import os.path as osp
import xarray as xr

import config

shards = glob(osp.join(config.processed_data_path, 'shards_h5', '*.h5'))

dataset = xr.open_mfdataset(shards, chunks=-1, combine="nested", concat_dim="concat_dim", parallel=True)

def broadcast_features(tensor):
    t = da.repeat(tensor, 138, axis=-1)
    t = da.moveaxis(t, 1, -1)
    return t

def pad_tensor(tensor):
    return da.pad(tensor, ((0, 0), (1, 1), (0, 0)))

features = [
    'sca_inputs',
    'col_inputs',
    'hl_inputs',
    'inter_inputs',
    'flux_dn_sw',
    'flux_up_sw',
    'flux_dn_lw',
    'flux_up_lw',
]

# all this is lazy
data = {}
for feat in features:
    shape = dataset[feat].shape
    print(shape)
    var = da.reshape(dataset[feat].data, (shape[0] * shape[1], *shape[2:]), merge_chunks=False)
    print(str(var.shape) + '\n')
    data.update({feat: var})
    
# still lazy
x = da.concatenate([
    data['hl_inputs'],
    pad_tensor(data['inter_inputs'][..., np.newaxis]),
    broadcast_features(data['sca_inputs'][..., np.newaxis])
], axis=-1)
y = da.concatenate([
    data['flux_dn_sw'][..., np.newaxis],
    data['flux_up_sw'][..., np.newaxis],
    data['flux_dn_lw'][..., np.newaxis],
    data['flux_up_lw'][..., np.newaxis],
], axis=-1)

# still...
x_mean = da.mean(x, axis=0)
y_mean = da.mean(y, axis=0)
x_std = da.std(x, axis=0)
y_std = da.std(y, axis=0)

(106, 10240, 17)
(1085440, 17)

(106, 10240, 137, 27)
(1085440, 137, 27)

(106, 10240, 138, 2)
(1085440, 138, 2)

(106, 10240, 136)
(1085440, 136)

(106, 10240, 138)
(1085440, 138)

(106, 10240, 138)
(1085440, 138)

(106, 10240, 138)
(1085440, 138)

(106, 10240, 138)
(1085440, 138)



    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array.reshape(shape)

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    >>> array.reshape(shape, limit='128 MiB')
  coro.send(None)


In [59]:
x_mean.compute()

array([[1.9744342e+02, 0.0000000e+00, 0.0000000e+00, ..., 9.9013603e-01,
        9.7981578e-01, 1.4058698e+03],
       [2.0987785e+02, 2.0001588e+00, 2.9141670e-01, ..., 9.9013603e-01,
        9.7981578e-01, 1.4058698e+03],
       [2.1395828e+02, 3.1019697e+00, 2.9696614e-01, ..., 9.9013603e-01,
        9.7981578e-01, 1.4058698e+03],
       ...,
       [2.8695569e+02, 9.8171164e+04, 9.8881650e-01, ..., 9.9013603e-01,
        9.7981578e-01, 1.4058698e+03],
       [2.8706107e+02, 9.8426625e+04, 9.8975897e-01, ..., 9.9013603e-01,
        9.7981578e-01, 1.4058698e+03],
       [2.8745920e+02, 9.8660430e+04, 0.0000000e+00, ..., 9.9013603e-01,
        9.7981578e-01, 1.4058698e+03]], dtype=float32)

In [7]:
data

{'sca_inputs': dask.array<reshape, shape=(1085440, 17), dtype=float32, chunksize=(10240, 17), chunktype=numpy.ndarray>,
 'col_inputs': dask.array<reshape, shape=(1085440, 137, 27), dtype=float32, chunksize=(10240, 137, 27), chunktype=numpy.ndarray>,
 'hl_inputs': dask.array<reshape, shape=(1085440, 138, 2), dtype=float32, chunksize=(10240, 138, 2), chunktype=numpy.ndarray>,
 'inter_inputs': dask.array<reshape, shape=(1085440, 136), dtype=float32, chunksize=(10240, 136), chunktype=numpy.ndarray>,
 'flux_dn_sw': dask.array<reshape, shape=(1085440, 138), dtype=float32, chunksize=(10240, 138), chunktype=numpy.ndarray>,
 'flux_up_sw': dask.array<reshape, shape=(1085440, 138), dtype=float32, chunksize=(10240, 138), chunktype=numpy.ndarray>,
 'flux_dn_lw': dask.array<reshape, shape=(1085440, 138), dtype=float32, chunksize=(10240, 138), chunktype=numpy.ndarray>,
 'flux_up_lw': dask.array<reshape, shape=(1085440, 138), dtype=float32, chunksize=(10240, 138), chunktype=numpy.ndarray>}

In [13]:
print(col_inputs.shape)

(1085440, 137, 27)


In [14]:
dataset.sca_inputs

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,680.00 kiB
Shape,"(106, 10240, 17)","(1, 10240, 17)"
Count,424 Tasks,106 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 70.39 MiB 680.00 kiB Shape (106, 10240, 17) (1, 10240, 17) Count 424 Tasks 106 Chunks Type float32 numpy.ndarray",17  10240  106,

Unnamed: 0,Array,Chunk
Bytes,70.39 MiB,680.00 kiB
Shape,"(106, 10240, 17)","(1, 10240, 17)"
Count,424 Tasks,106 Chunks
Type,float32,numpy.ndarray


In [6]:
dataset['hl_inputs'].reshape

Unnamed: 0,Array,Chunk
Bytes,1.12 GiB,10.78 MiB
Shape,"(106, 10240, 138, 2)","(1, 10240, 138, 2)"
Count,424 Tasks,106 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.12 GiB 10.78 MiB Shape (106, 10240, 138, 2) (1, 10240, 138, 2) Count 424 Tasks 106 Chunks Type float32 numpy.ndarray",106  1  2  138  10240,

Unnamed: 0,Array,Chunk
Bytes,1.12 GiB,10.78 MiB
Shape,"(106, 10240, 138, 2)","(1, 10240, 138, 2)"
Count,424 Tasks,106 Chunks
Type,float32,numpy.ndarray


In [None]:
# Feature engineering, build an input x with 20 features
x = torch.cat([
    hl_inputs,
    inter_inputs_,
    sca_inputs_
], dim=-1)

# Feature engineering, build ground truth with 4 features
y = torch.cat([
    torch.unsqueeze(flux_dn_sw, -1),
    torch.unsqueeze(flux_up_sw, -1),
    torch.unsqueeze(flux_dn_lw, -1),
    torch.unsqueeze(flux_up_lw, -1),
], dim=-1)

In [4]:
array = da.to_numpy()
array

array([[[[3.56076725e-06, 3.13430775e-07, 4.00370947e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
         [3.95448751e-06, 2.29984124e-07, 4.00370918e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
         [4.26555061e-06, 3.08398512e-07, 4.00370947e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
         ...,
         [5.37940208e-03, 3.28712026e-08, 4.07382002e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
         [5.39439404e-03, 3.27458629e-08, 4.07385436e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
         [5.43346442e-03, 3.17613740e-08, 4.07380401e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05]],

        [[3.55422776e-06, 3.13292418e-07, 4.00370947e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
         [3.96148562e-06, 2.28677138e-07, 4.00370918e-04, ...,
          0.00000000e+00, 3.99999999e-06, 5.19615969e-05],
        

In [None]:
dataset_len = Parameter('dataset_len',
                      help="Dataset total length.",
                      default=2 ** 12 * 5 * 53)

num_shards = Parameter('num_shards',
                      help=f"Desired number of shards. Should be a divider of {dataset_len}",
                      default=2 * 53)

timestep = Parameter('timestep',
                     help="Dataset timestep.",
                     default=1000)

metaflow: well built, 
simple-use library
not bound to a tech (data format, framework)
enforces code quality at dev time, yet allows to build on top

1 file

n files advantages:
allows quick iteration
heavily parallelizable
works with large 10TB+ datasets
do not lock memory for training
works when memory is 'limited'
requires preprocessing to shard the file

on-the-fly data preprocessing

In [8]:
import cProfile

profile = cProfile.Profile()
profile.enable()
path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(path, "r", format="NETCDF4") as file:
    print(f"file: {path}")
    print(f"size: {os.path.getsize(path)/float(1<<30):,.0f} GB")
    
    print("variables: ")
    pprint(list(file.variables.keys()))
    
    print("dimensions: ")
    pprint(file.dimensions)
profile.disable()

file: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-250.nc
size: 77 GB
variables: 
['sca_inputs',
 'col_inputs',
 'hl_inputs',
 'pressure_hl',
 'inter_inputs',
 'lat',
 'lon',
 'flux_dn_sw',
 'flux_up_sw',
 'flux_dn_lw',
 'flux_up_lw',
 'hr_sw',
 'hr_lw']
dimensions: 
{'col_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'col_variable', size = 27,
 'column': <class 'netCDF4._netCDF4.Dimension'>: name = 'column', size = 4070400,
 'half_level': <class 'netCDF4._netCDF4.Dimension'>: name = 'half_level', size = 138,
 'hl_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'hl_variable', size = 2,
 'inter_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'inter_variable', size = 1,
 'level': <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 137,
 'level_interface': <class 'netCDF4._netCDF4.Dimension'>: name = 'level_interface', size = 136,
 'p_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'p_variable', size = 1,
 'sca_var

In [13]:
int(len(os.sched_getaffinity(0)) * .8)

76

In [9]:
profile.print_stats()

         1738 function calls (1685 primitive calls) in 0.012 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 2611381482.py:15(<module>)
        1    0.000    0.000    0.000    0.000 2611381482.py:5(<module>)
        1    0.009    0.009    0.011    0.011 2611381482.py:6(<module>)
        3    0.000    0.000    0.000    0.000 codeop.py:142(__call__)
        6    0.000    0.000    0.000    0.000 compilerop.py:166(extra_flags)
        3    0.000    0.000    0.000    0.000 contextlib.py:108(__enter__)
        3    0.000    0.000    0.000    0.000 contextlib.py:117(__exit__)
        3    0.000    0.000    0.000    0.000 contextlib.py:238(helper)
        3    0.000    0.000    0.000    0.000 contextlib.py:82(__init__)
        1    0.000    0.000    0.000    0.000 genericpath.py:48(getsize)
        3    0.000    0.000    0.000    0.000 hooks.py:103(__call__)
        3    0.000    0.000 

In [16]:
step = 1000
path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(path, "r", format="NETCDF4") as file:
    print(f"file: {path}")
    print(f"size: {os.path.getsize(path)/float(1<<30):,.0f} GB")
    
    print("variables: ")
    pprint(list(file.variables.keys()))
    
    print("dimensions: ")
    pprint(file.dimensions)
    sca_inputs = file['sca_inputs'][:]
    
len(sca_inputs)

file: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-1000.nc
size: 21 GB
variables: 
['sca_inputs',
 'col_inputs',
 'hl_inputs',
 'pressure_hl',
 'inter_inputs',
 'lat',
 'lon',
 'flux_dn_sw',
 'flux_up_sw',
 'flux_dn_lw',
 'flux_up_lw',
 'hr_sw',
 'hr_lw']
dimensions: 
{'col_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'col_variable', size = 27,
 'column': <class 'netCDF4._netCDF4.Dimension'>: name = 'column', size = 1085440,
 'half_level': <class 'netCDF4._netCDF4.Dimension'>: name = 'half_level', size = 138,
 'hl_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'hl_variable', size = 2,
 'inter_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'inter_variable', size = 1,
 'level': <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 137,
 'level_interface': <class 'netCDF4._netCDF4.Dimension'>: name = 'level_interface', size = 136,
 'p_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'p_variable', size = 1,
 'sca_va

1085440

In [17]:
import torch

data, slices = torch.load(os.path.join(processed_data_path, f'data-{step}.pt'))

In [19]:
data

Data(x=[149790720, 20], edge_index=[2, 297410560], edge_attr=[149790720, 17], y=[149790720, 4])

In [23]:
import torch.nn.functional as F

def broadcast_features(tensor):
    t = torch.unsqueeze(tensor, -1)
    t = t.repeat((1, 1, 138))
    t = t.moveaxis(1, -1)
    return t

def pad_tensor(tensor):
    return F.pad(tensor, (0, 0, 1, 1, 0, 0))
        
raw_path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(raw_path, "r", format="NETCDF4") as file:
    sca_inputs = torch.tensor(file['sca_inputs'][:])
    col_inputs = torch.tensor(file['col_inputs'][:])
    hl_inputs = torch.tensor(file['hl_inputs'][:])
    inter_inputs = torch.tensor(file['inter_inputs'][:])

    flux_dn_sw = torch.tensor(file['flux_dn_sw'][:])
    flux_up_sw = torch.tensor(file['flux_up_sw'][:])
    flux_dn_lw = torch.tensor(file['flux_dn_lw'][:])
    flux_up_lw = torch.tensor(file['flux_up_lw'][:])

inter_inputs_ = pad_tensor(inter_inputs)
sca_inputs_ = broadcast_features(sca_inputs)
print(len(inter_inputs_))

x = torch.cat([
    hl_inputs,
    inter_inputs_,
    sca_inputs_
], dim=-1)

y = torch.cat([
    torch.unsqueeze(flux_dn_sw, -1),
    torch.unsqueeze(flux_up_sw, -1),
    torch.unsqueeze(flux_dn_lw, -1),
    torch.unsqueeze(flux_up_lw, -1),
], dim=-1)

1085440


In [None]:
e following additional packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-9 dpkg-dev fakeroot g++ g++-9 gcc gcc-9 gcc-9-base
  libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl libasan5 libatomic1 libbinutils
  libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libfakeroot
  libfile-fcntllock-perl libgcc-9-dev libgdbm-compat4 libgdbm6 libgomp1 libisl22 libitm1 liblocale-gettext-perl
  liblsan0 libmpc3 libmpfr6 libperl5.30 libquadmath0 libstdc++-9-dev libtsan0 libubsan1 linux-libc-dev make
  manpages manpages-dev netbase patch perl perl-modules-5.30 xz-utils
Suggested packages:
  binutils-doc cpp-doc gcc-9-locales debian-keyring g++-multilib g++-9-multilib gcc-9-doc gcc-multilib autoconf
  automake libtool flex bison gdb gcc-doc gcc-9-multilib glibc-doc git bzr gdbm-l10n libstdc++-9-doc make-doc
  man-browser ed diffutils-doc perl-doc libterm-readline-gnu-perl | libterm-readline-perl-perl libb-debug-perl
  liblocale-codes-perl
The following NEW packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu build-essential cpp cpp-9 dpkg-dev fakeroot g++ g++-9 gcc
  gcc-9 gcc-9-base libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl libasan5 libatomic1
  libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libfakeroot
  libfile-fcntllock-perl libgcc-9-dev libgdbm-compat4 libgdbm6 libgomp1 libisl22 libitm1 liblocale-gettext-perl
  liblsan0 libmpc3 libmpfr6 libperl5.30 libquadmath0 libstdc++-9-dev libtsan0 libubsan1 linux-libc-dev make
  manpages manpages-dev netbase patch perl perl-modules-5.30 xz-utils