# Computing basic Stats with the CPU

Dask is a powerful tool.

We're only interested in [NumPy](https://numpy.org/) (with `np.ndarray`) for that use-case, but Dask is also compatible with [Pandas](https://pandas.pydata.org/) and its `pd.DataFrame` structure, with no extra tooling, so the approach applies for both.

We can perform simple operations such as basic stats, or more complex ones such as *Fast Fourier Transforms* using the concept of chunks. Comprehensive list in the [Dask Reference API](https://docs.dask.org/en/latest/array-api.html).

In [1]:
from kosmoss import CONFIG, DATA_PATH, PROCESSED_DATA_PATH
from kosmoss.utils import timing

In [2]:
import dask
import dask.array as da
import numpy as np
import os.path as osp
import torch
from typing import Dict, List, Text

step = CONFIG['timestep']
num_workers = CONFIG['num_workers']
features_path = osp.join(PROCESSED_DATA_PATH, f'features-{step}')

Start by loading the data lazily with Dask.

In [3]:
x = da.from_npy_stack(osp.join(features_path, 'x'))
y = da.from_npy_stack(osp.join(features_path, 'y'))
edge = da.from_npy_stack(osp.join(features_path, 'edge'))

Let's recap the data chunking.

In [4]:
x

Unnamed: 0,Array,Chunk
Bytes,11.16 GiB,1.68 MiB
Shape,"(1085440, 138, 20)","(160, 138, 20)"
Count,6784 Tasks,6784 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 11.16 GiB 1.68 MiB Shape (1085440, 138, 20) (160, 138, 20) Count 6784 Tasks 6784 Chunks Type float32 numpy.ndarray",20  138  1085440,

Unnamed: 0,Array,Chunk
Bytes,11.16 GiB,1.68 MiB
Shape,"(1085440, 138, 20)","(160, 138, 20)"
Count,6784 Tasks,6784 Chunks
Type,float32,numpy.ndarray


## Loading in a single thread with pure NumPy

The `.npy` stacks are pure NumPy files, so we could load them with NumPy directly.

Still, there are several cons to using this method:

* Slow, monothreaded data loading
* Requires to load the entire file content at once into memory
* Can be a limiting depending on the amount of available CPU memory

In [5]:
@timing
def compute_stats_mono(arrays: List[da.Array]) -> Dict[Text, np.ndarray]:
    
    # Simulate pure NumPy
    num_workers = 1
    
    stats = {}
    for a in arrays:
        
        # Load data into memory
        a_ = a.compute(num_workers=num_workers)
        
        # Compute mean and standard-deviation for array
        a_mean = np.mean(a_, axis=0)
        a_std = np.std(a_, axis=0)
        
        name = a.name.split("/")[-1]
        stats.update({
            f'{name}_mean': torch.tensor(a_mean),
            f'{name}_std': torch.tensor(a_std)
        })
        
    return stats

Open an `htop` in a side terminal, and watch the memory grow

In [6]:
stats = compute_stats_mono([x, y, edge])

OSError: [Errno 24] Too many open files

## Multithreaded loading with Dask

Again, most of the process in *Dask* is handled in lazy evaluation mode. Dask builds a computational graph called a *Directed Acyclic Graph* (DAG) and executes the command only if needed, proceeding with optimizations along the way, if any.

Moreover, the `compute()` method executes the DAG on each data chunk by using Math formula to distribute computations when possible.

In [None]:
@timing
def compute_stats_multi(arrays: List[da.Array]) -> Dict[Text, np.ndarray]:
    
    # Scaling computation by increasing default number of workers
    num_workers = 16
    
    stats = {}
    for a in arrays:
        
        # Lazy evaluation
        a_mean = da.mean(a, axis=0)
        a_std = da.std(a, axis=0)
        
        # Compute mean and standard-deviation for current array
        m = a_mean.compute(num_workers=num_workers)
        s = a_std.compute(num_workers=num_workers)
        
        name = a.name.split("/")[-1]
        stats.update({
            f'{name}_mean': torch.tensor(m),
            f'{name}_std': torch.tensor(s)
        })
        
    return stats

In [None]:
stats = compute_stats_multi([x, y, edge])

You should observe a substantial gain in computational time.

## Saving the Stats for later use

We will use this data to perform on-the-fly input normalization within the model itself with a Normalization layer.

`torch.save` uses the Python Pickle format to save data. You can save anything pickable, which is not exactly a limitation since many pure Python code is pickle-serializable.

In [7]:
stats_path = osp.join(DATA_PATH, f"stats-features-{step}.pt")
torch.save(stats, stats_path)

NameError: name 'stats' is not defined

## Same for the Flattened dataset

We'll also need the stats for the flattened data. No need to compare computational time here though, just perform the stats and save the data for later use.

In [8]:
flattened_path = osp.join(PROCESSED_DATA_PATH, f'flattened-{step}')

x = da.from_npy_stack(osp.join(flattened_path, 'x'))
y = da.from_npy_stack(osp.join(flattened_path, 'y'))

stats = compute_stats_multi([x, y])

stats_path = osp.join(DATA_PATH, f"stats-flattened-{step}.pt")
torch.save(stats, stats_path)

134908.61 ms
