### Tutorial 2 - How to extend DASF Datasets

In this tutorial, we show how you can extend DASF datasets to be loaded dynamically inpendent of the host architecture.

For this specific scenario we will use DASF Array Dataset class to show you how you can create a dataset like this using a simple NPY file.
An array dataset is a dataset stored in numpy-like format. It is a simple file that contains a numpy array serialized in a file. Note that, if you are using GPU, the data will be loaded as cupy array, else it will be loaded as numpy array.

In [1]:
# Common imports
import numpy as np

# Dasf imports
from dasf.datasets import DatasetArray

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------



To start, the first step is create an numpy array and save it to a numpy file (`.npy`). Therefore, we will create a simple cube with random data, with shape (20, 20, 20) and save it to a file called `data.npy`. 

In [5]:
# Here we generate a random array and save it in a numpy-like file
data = np.random.random((20, 20, 20))
np.save("data.npy", data)

Once we have the file saved in a numpy-like file format, we can load it using `DatasetArray` class. This class is a simple class that loads a numpy file and returns a numpy array. Note that, if you are using GPU, the data will be loaded as cupy array, else it will be loaded as numpy array.

The `name` parameter is optional and specifies the symbolic dataset name. This name will be used to identify the dataset in the DASF framework. The `root` parameter specifies the location of the file to be loaded.

In [6]:
dataset = DatasetArray(name="My Saved NPY", root="data.npy")
dataset

Dataset: name=My Saved NPY, root=, loaded=False, data shape=None

Dasf datasets are lazy, which means that it is not loaded immediatly. The `load` function will load the dataset and return the data. Note that, if you are using GPU, the data will be loaded as cupy array, else it will be loaded as numpy array.

In [4]:
dataset.load()

True False
Dataset: name=My Saved NPY, root=, loaded=False, data=None <function DatasetArray.load at 0x7f5a77cb5b40> True False _load_gpu


Dataset: name=My Saved NPY, root=, loaded=True, data=(20, 20, 20)

Once it is loaded, we can slice the dataset and see the type of the data. If you have gpus in your machine, you can see that the data is a cupy array else it is a numpy array.

In [5]:
type(dataset[:])

cupy.ndarray

What should I do if I'm using a GPU but I want to load a Numpy array?

All the datasets have a protected load wrapper for each platform. The code discovers which platform you are in and bind the method to its respective protected mathod.

In other words, if you are using `load` in a GPU environment as we are doing here, in fact you are executing the protected method called `_load_gpu`.

Then to load Numpy arrays, all you need to do is call directly `_load_cpu`.

In [6]:
dataset._load_cpu()

type(dataset[:2, :2, :2])

numpy.ndarray

If you need to handle a Dask array in a multi clustered environment, you can use the protected lazy methods called `_lazy_*`.

For datasets, the respective methods for `load` are `_lazy_load_cpu` and `_lazy_load_gpu`. Both returns a Dask Array but with different metadata.

Let's see how it looks like.

In [7]:
dataset._lazy_load_cpu()

type(dataset[:2, :2, :2])

dask.array.core.Array

See how the internal array of this Dask dataset looks.

In [8]:
type(dataset[:2, :2, :2]._meta)

numpy.ndarray