# lardon playground

**lardon** is a front-end for dynamic data import of large files, using the numpy.memmap interface to easily index large memory arrays without the entire import of the corresponding file. It is designed to be compatible with every format, using callback functions / environments how to convert given files into numpy arrays. It also provides machine learning-oriented features such as random indexing, data/metadata, and scattering. 

## Simple parsing / loading

Here we will briefly explain how to parse and load data. Regarding parsing, `lardon` propose two different features 
- the `compile` function, that will list all the files to parse according a valid extension or a valid regexp, and parse it in the destination folder
- the `LardonParser` environment, where you can register data and files progressively : whether by generating the files and registring it with the `register` functions, or by iterating in the files detected by the parser.

### Using `compile`

In [9]:
import os, numpy as np

# first, generate some dumb data
data_path = "tests/dumb_dataset"
original_data_path = "tests/dumb_dataset/data"
n_examples = 10
data_shape = (5,7,13)
if not os.path.isdir(data_path):
    os.makedirs(data_path)
if not os.path.isdir(f"{original_data_path}"):
    os.makedirs(f"{original_data_path}")
for n in range(n_examples): 
    data = np.reshape(np.arange(np.prod(data_shape)), data_shape)
    np.save(f"{original_data_path}/dumb_{n}.npy", data)

In [10]:
import random
from lardon import compile

def dumb_callback(filepath: str):
    data = np.load(filepath)
    metadata = {'label': random.randrange(10)}
    # a callback retutns the original data, plus optional metadata as a dictionary.
    # if you don't want any metadata, return an empty dictionary with dict()
    return data, metadata

parsed_path = data_path + "/parsing"
offline_list = compile(original_data_path, parsed_path, valid_exts = ['.npy'], callback=dumb_callback)

# the lardon package provides an `OfflineDataList`, that contains a list of elements called `OfflineEntry`.
# OfflineEntry imports the corresponding data when called.
offline_entry = offline_list.entries[0]
data = offline_entry()
print(data.shape)

# Indexing the OfflineDataList will dynamically call the OfflineEntry, with the targeted indices such that
# only relevant part of the files are loaded using memmap.
print(offline_list[0].shape, offline_list[0, 1:2].shape, offline_list[0, 1:2, 3:5].shape)


exporting files...: 100%|██████████| 10/10 [00:00<00:00, 14.04it/s]


(5, 7, 13)
(5, 7, 13) (1, 7, 13) (1, 2, 13)


### Using `LardonParser`

In [11]:
from lardon import LardonParser

with LardonParser(original_data_path, parsed_path, force=True, valid_exts=[".npy"], callback=dumb_callback) as parser:
    for f in parser.files:
        data = parser.callback(f)
        metadata = {'label': random.randrange(10)}
        parser.register(data, metadata, filename=f)


  return array(a, dtype, copy=False, order='C', ndmin=1)


In [12]:
from lardon import LardonParser
import numpy as np
# with LardonParser, you can also parse data generated on-the-fly. In this case,
# pass None as original_data_path (parser.files will then be empty.)

generated_path = "tests/generated_dataset"
with LardonParser(None, generated_path, force=True) as parser:
    for freq in range(100,1000,100):
        data = np.sin( 2 * np.pi * freq * np.linspace(0., 1., 44100))
        parser.register(data, {'freq': freq}, filename=f"sin_{freq}")

### Loading data

In [20]:
from lardon import parse_folder

offline_data_list = parse_folder(parsed_path)
x, y = offline_data_list[0]
print(x.shape)

# you can also filter the loaded files with the `files` keyword argument.
files = ["dumb_3.npy", "dumb_4.npy"]
offline_data_list = parse_folder(parsed_path, files=files)
print(offline_data_list)


(5, 7, 13)
[
	OfflineEntry(selector: Selector(), file: tests/dumb_dataset/parsing/dumb_3.npy),
	OfflineEntry(selector: Selector(), file: tests/dumb_dataset/parsing/dumb_4.npy),
]


## Additional features

### Data batching

`OfflineDataList` can have entries of different size, such that calls like `offline_data_list[:2]` can have ambiguous meanings. The behavior of `OfflineDataList` can be set at loading with the `batch_mode` argument as follows :  

In [29]:
import random, numpy as np
from lardon import parse_folder, LardonParser

path = "tests/various_size_dataset"
# generate fake dataset with various sizes

with LardonParser(None, path, force=True) as parser:
    for i in range(10):
        data = np.random.rand(random.randrange(1, 4), random.randrange(1, 4), 9)
        parser.register(data, {})

offline_list = parse_folder(path)
# by default, if data cannot be stacked, offline_list will return a list
print("without pad: ")
print([x.shape for x in offline_list[:4]])

# the batch_mode keyword allows to set how the offline list will return its
# values in case of unconsistent shape.
offline_list = parse_folder(path, batch_mode="pad",
                            batch_args={'mode':'constant', 'constant_values':2})
batched_import = offline_list[:]
print("with pad : ")
print(batched_import.shape)

# batch_mode can also be set to crop, where in this case data shape will be cropped
# to the smallest element
offline_list = parse_folder(path, batch_mode="crop",
                            batch_args={'mode':'constant', 'constant_values':2})
batched_import = offline_list[:]
print("with crop : ")
print(batched_import.shape)


without pad: 
[(1, 2, 9), (3, 2, 9), (3, 3, 9), (2, 3, 9)]
with pad : 
(10, 3, 3, 9)
with crop : 
(10, 1, 1, 9)


## Random slices and scattering

`lardon` also implements ways of randomly picking into files, automatizing boring routines in data handling. Also, `OfflineDataList` can be scattered among a given axis, allowing to "flatten" a dataset.

In [36]:
import random, numpy as np
from lardon import parse_folder, LardonParser, randslice

path = "tests/long_datasets"

# generate fake dataset with long shapes 
with LardonParser(None, path, force=True) as parser:
    for i in range(2):
        data = np.arange(44100)
        parser.register(data, {})

offline_list = parse_folder(path)
for i in range(4):
    x = offline_list[0, randslice(4)]
    print(x)

(44100,)
[14706 14707 14708 14709]
[22131 22132 22133 22134]
[23096 23097 23098 23099]
[10269 10270 10271 10272]


In [42]:
import random, numpy as np
from lardon import parse_folder, OfflineDataList

path = "tests/dumb_dataset/parsing"
offline_list = parse_folder(path)
print(offline_list.shape)
offline_entries = offline_list.entries
scattered_entries = sum([x.scatter(0) for x in offline_entries], [])
scattered_list = OfflineDataList(scattered_entries)
print(scattered_list.shape)


(10, 2)
(20,)
