MIT License

Copyright (c) 2022 alxyok

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

****************************************************************

### Quick-note on project directory

The main root dir `~/3dcorrection` is structured as follow:
* `data/` contains raw and preprocessed data. 
    * `raw/` is actually a symbolic link to the same repo for all candidates, DO NOT TOUCH IT!
    * `processed/` will be created when data is preprocessed and will contain all transformed data
* 

In [6]:
import os

root_path = os.path.join('/', 'home', 'mluser', 'bootcamp')

data_path = os.path.join(root_path, 'data')
raw_data_path = os.path.join(data_path, 'raw')
processed_data_path = os.path.join(data_path, 'processed')

### The 3D Correction Use-Case

The European Centre for Medium-range Weather Forecasts (ECMWF) has developed a series of model giving the current best accurate parametrization scheme available—among those, SPARTACUS delivers **radiation** prediction over the globe. Because it is demanding in computations, a simpler, degraded model called TRIPLECLOUD is developed to satisfy the production environment constraints. 

Like most climate models, to leverage hardware acceleration, the choice is made to split the globe in blocks—this has the immediate consequence of losing the spatial correlation for a gain in parallelization. 

The unit block is a column that express values throughout the vertical dimension over a set of levels. Each level is

In [3]:
import os

# proxy setup
os.environ['http_proxy'] = 'http://129.183.4.13:8080'
os.environ['https_proxy'] = 'http://129.183.4.13:8080'
os.environ['no_proxy'] = 'yoda,129.183.101.5,172.16.118.13,naboo0,naboo5,nwadmin,172.16.118.137,172.16.118.134'

Now let's load the raw data we'll be using throughout this hands-on. Take a look at the [source notebook](https://git.ecmwf.int/projects/MLFET/repos/maelstrom-radiation/browse/climetlab_maelstrom_radiation/radiation.py) for a more info on the variables.

In [4]:
import climetlab as cml
import numpy as np 

# setting up the cache directory so there is no need to re-download the data
cml.settings.set("cache-directory", raw_data_path)

step = 250

# loading the dataset. take a look at [the source]
cmlds = cml.load_dataset(
    # use-case
    'maelstrom-radiation', 
    
    # dataset name
    dataset='3dcorrection', 
    
    # make feature engineering
    raw_inputs=False, 
    
    # sample over time
    timestep=list(range(0, 3501, step)), 
    
    # full output
    minimal_outputs=False,
    
    # sample geographically
    patch=list(range(0, 16, 1)),
    
    # units
    hr_units='K d-1',
)

CliMetLab cache: orphan found: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-1000.nc
CliMetLab cache: orphan found: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-250.nc
CliMetLab cache: orphan found: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/.ipynb_checkpoints
CliMetLab cache: trying to free 325.1 GiB
Deleting entry {
    "path": "/home/research/research/modeling/weather-forecast/tripleclouds/data/raw/url-bb00f35901fb668827703fce316b80ec05e427cf424cfdd77e953bedc2e74829.nc",
    "owner": "url",
    "args": {
        "url": "https://storage.ecmwf.europeanweather.cloud/MAELSTROM_AP3/rad4NN_inputs_2020010100_1125c1.nc",
        "parts": null
    },
    "creation_date": "2022-01-28 08:55:02.270383",
    "flags": 0,
    "owner_data": {
        "content-length": "298869871",
        "accept-ranges": "bytes",
        "last-modified": "Wed, 17 Nov 2021 09:25:29 GMT",
        "x-rgw-object-type": "Normal"

By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/licences/general/ If you do not agree with such terms, do not download the data. 


  0%|          | 0/240 [00:00<?, ?it/s]

  0%|          | 0/240 [00:00<?, ?it/s]

In [5]:
type(cmlds)

climetlab_maelstrom_radiation.radiation.radiation

The returned object is a ClimateLab dataset Xarray Dataset

Let's check the content of the downloaded file

In [12]:
import netCDF4
from pprint import pprint
step = 1000
path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(path, "r", format="NETCDF4") as file:
    print(f"file: {path}")
    print(f"size: {os.path.getsize(path)/float(1<<30):,.0f} GB")
    
    print("variables: ")
    pprint(list(file.variables.keys()))
    
    print("dimensions: ")
    pprint(file.dimensions)
    
    print(f"dataset size: {file.dimensions['column'].size}")

file: /home/mluser/bootcamp/data/raw/data-1000.nc
size: 21 GB
variables: 
['sca_inputs',
 'col_inputs',
 'hl_inputs',
 'pressure_hl',
 'inter_inputs',
 'lat',
 'lon',
 'flux_dn_sw',
 'flux_up_sw',
 'flux_dn_lw',
 'flux_up_lw',
 'hr_sw',
 'hr_lw']
dimensions: 
{'col_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'col_variable', size = 27,
 'column': <class 'netCDF4._netCDF4.Dimension'>: name = 'column', size = 1085440,
 'half_level': <class 'netCDF4._netCDF4.Dimension'>: name = 'half_level', size = 138,
 'hl_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'hl_variable', size = 2,
 'inter_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'inter_variable', size = 1,
 'level': <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 137,
 'level_interface': <class 'netCDF4._netCDF4.Dimension'>: name = 'level_interface', size = 136,
 'p_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'p_variable', size = 1,
 'sca_variable': <class 'netCDF4._netCDF4.Dimensi

In [None]:
dataset_len = Parameter('dataset_len',
                      help="Dataset total length.",
                      default=2 ** 12 * 5 * 53)

num_shards = Parameter('num_shards',
                      help=f"Desired number of shards. Should be a divider of {dataset_len}",
                      default=2 * 53)

timestep = Parameter('timestep',
                     help="Dataset timestep.",
                     default=1000)

metaflow: well built, 
simple-use library
not bound to a tech (data format, framework)

1 file

n files advantages:
allows quick iteration
heavily parallelizable
works with large 10TB+ datasets
do not lock memory for training
works when memory is 'limited'
requires preprocessing to shard the file

In [8]:
import cProfile

profile = cProfile.Profile()
profile.enable()
path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(path, "r", format="NETCDF4") as file:
    print(f"file: {path}")
    print(f"size: {os.path.getsize(path)/float(1<<30):,.0f} GB")
    
    print("variables: ")
    pprint(list(file.variables.keys()))
    
    print("dimensions: ")
    pprint(file.dimensions)
profile.disable()

file: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-250.nc
size: 77 GB
variables: 
['sca_inputs',
 'col_inputs',
 'hl_inputs',
 'pressure_hl',
 'inter_inputs',
 'lat',
 'lon',
 'flux_dn_sw',
 'flux_up_sw',
 'flux_dn_lw',
 'flux_up_lw',
 'hr_sw',
 'hr_lw']
dimensions: 
{'col_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'col_variable', size = 27,
 'column': <class 'netCDF4._netCDF4.Dimension'>: name = 'column', size = 4070400,
 'half_level': <class 'netCDF4._netCDF4.Dimension'>: name = 'half_level', size = 138,
 'hl_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'hl_variable', size = 2,
 'inter_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'inter_variable', size = 1,
 'level': <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 137,
 'level_interface': <class 'netCDF4._netCDF4.Dimension'>: name = 'level_interface', size = 136,
 'p_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'p_variable', size = 1,
 'sca_var

In [13]:
int(len(os.sched_getaffinity(0)) * .8)

76

In [9]:
profile.print_stats()

         1738 function calls (1685 primitive calls) in 0.012 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 2611381482.py:15(<module>)
        1    0.000    0.000    0.000    0.000 2611381482.py:5(<module>)
        1    0.009    0.009    0.011    0.011 2611381482.py:6(<module>)
        3    0.000    0.000    0.000    0.000 codeop.py:142(__call__)
        6    0.000    0.000    0.000    0.000 compilerop.py:166(extra_flags)
        3    0.000    0.000    0.000    0.000 contextlib.py:108(__enter__)
        3    0.000    0.000    0.000    0.000 contextlib.py:117(__exit__)
        3    0.000    0.000    0.000    0.000 contextlib.py:238(helper)
        3    0.000    0.000    0.000    0.000 contextlib.py:82(__init__)
        1    0.000    0.000    0.000    0.000 genericpath.py:48(getsize)
        3    0.000    0.000    0.000    0.000 hooks.py:103(__call__)
        3    0.000    0.000 

In [16]:
step = 1000
path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(path, "r", format="NETCDF4") as file:
    print(f"file: {path}")
    print(f"size: {os.path.getsize(path)/float(1<<30):,.0f} GB")
    
    print("variables: ")
    pprint(list(file.variables.keys()))
    
    print("dimensions: ")
    pprint(file.dimensions)
    sca_inputs = file['sca_inputs'][:]
    
len(sca_inputs)

file: /home/research/research/modeling/weather_forecast/3dcorrection/data/raw/data-1000.nc
size: 21 GB
variables: 
['sca_inputs',
 'col_inputs',
 'hl_inputs',
 'pressure_hl',
 'inter_inputs',
 'lat',
 'lon',
 'flux_dn_sw',
 'flux_up_sw',
 'flux_dn_lw',
 'flux_up_lw',
 'hr_sw',
 'hr_lw']
dimensions: 
{'col_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'col_variable', size = 27,
 'column': <class 'netCDF4._netCDF4.Dimension'>: name = 'column', size = 1085440,
 'half_level': <class 'netCDF4._netCDF4.Dimension'>: name = 'half_level', size = 138,
 'hl_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'hl_variable', size = 2,
 'inter_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'inter_variable', size = 1,
 'level': <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 137,
 'level_interface': <class 'netCDF4._netCDF4.Dimension'>: name = 'level_interface', size = 136,
 'p_variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'p_variable', size = 1,
 'sca_va

1085440

In [17]:
import torch

data, slices = torch.load(os.path.join(processed_data_path, f'data-{step}.pt'))

In [19]:
data

Data(x=[149790720, 20], edge_index=[2, 297410560], edge_attr=[149790720, 17], y=[149790720, 4])

In [23]:
import torch.nn.functional as F

def broadcast_features(tensor):
    t = torch.unsqueeze(tensor, -1)
    t = t.repeat((1, 1, 138))
    t = t.moveaxis(1, -1)
    return t

def pad_tensor(tensor):
    return F.pad(tensor, (0, 0, 1, 1, 0, 0))
        
raw_path = os.path.join(raw_data_path, f'data-{step}.nc')
with netCDF4.Dataset(raw_path, "r", format="NETCDF4") as file:
    sca_inputs = torch.tensor(file['sca_inputs'][:])
    col_inputs = torch.tensor(file['col_inputs'][:])
    hl_inputs = torch.tensor(file['hl_inputs'][:])
    inter_inputs = torch.tensor(file['inter_inputs'][:])

    flux_dn_sw = torch.tensor(file['flux_dn_sw'][:])
    flux_up_sw = torch.tensor(file['flux_up_sw'][:])
    flux_dn_lw = torch.tensor(file['flux_dn_lw'][:])
    flux_up_lw = torch.tensor(file['flux_up_lw'][:])

inter_inputs_ = pad_tensor(inter_inputs)
sca_inputs_ = broadcast_features(sca_inputs)
print(len(inter_inputs_))

x = torch.cat([
    hl_inputs,
    inter_inputs_,
    sca_inputs_
], dim=-1)

y = torch.cat([
    torch.unsqueeze(flux_dn_sw, -1),
    torch.unsqueeze(flux_up_sw, -1),
    torch.unsqueeze(flux_dn_lw, -1),
    torch.unsqueeze(flux_up_lw, -1),
], dim=-1)

1085440


In [None]:
e following additional packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-9 dpkg-dev fakeroot g++ g++-9 gcc gcc-9 gcc-9-base
  libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl libasan5 libatomic1 libbinutils
  libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libfakeroot
  libfile-fcntllock-perl libgcc-9-dev libgdbm-compat4 libgdbm6 libgomp1 libisl22 libitm1 liblocale-gettext-perl
  liblsan0 libmpc3 libmpfr6 libperl5.30 libquadmath0 libstdc++-9-dev libtsan0 libubsan1 linux-libc-dev make
  manpages manpages-dev netbase patch perl perl-modules-5.30 xz-utils
Suggested packages:
  binutils-doc cpp-doc gcc-9-locales debian-keyring g++-multilib g++-9-multilib gcc-9-doc gcc-multilib autoconf
  automake libtool flex bison gdb gcc-doc gcc-9-multilib glibc-doc git bzr gdbm-l10n libstdc++-9-doc make-doc
  man-browser ed diffutils-doc perl-doc libterm-readline-gnu-perl | libterm-readline-perl-perl libb-debug-perl
  liblocale-codes-perl
The following NEW packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu build-essential cpp cpp-9 dpkg-dev fakeroot g++ g++-9 gcc
  gcc-9 gcc-9-base libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl libasan5 libatomic1
  libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libfakeroot
  libfile-fcntllock-perl libgcc-9-dev libgdbm-compat4 libgdbm6 libgomp1 libisl22 libitm1 liblocale-gettext-perl
  liblsan0 libmpc3 libmpfr6 libperl5.30 libquadmath0 libstdc++-9-dev libtsan0 libubsan1 linux-libc-dev make
  manpages manpages-dev netbase patch perl perl-modules-5.30 xz-utils