## Data Generation Process

The first step in the End2You pipeline is to generate `.hdf5` of the raw input data. 
The main file that needs to defined is the `input_file.csv` which is a comma separated file with paths of the raw modality information (e.\g. .wav, .mp4 etc.), and the corresponding label, with a header `raw_file,label_file`. The raw file and the corresponding label file, must have the same name. Example of an `input_file.csv`:

``` 
raw_file,label_file
/path/to/data/file1.wav,/path/to/labels/file1.csv
/path/to/data/file2.wav,/path/to/labels/file2.csv
```

The label_file is a file containing a header of the form timestamp,label_1,label_2, where timestamp is the time segment of the raw sequence with the corresponding labels (e.g. label_1, label_2,...).

```
*Label File example - file1.csv*

time,label1,label2
0.00,0.24,0.14
0.04,0.20,0.18
```

Let's see how we can create the file if we have the path to the audio raw files and the labels.


#### Create `input_file.csv`

In [None]:
import numpy as np

from pathlib import Path

In [None]:
audio_path = Path('/path/to/raw_files')
label_path = Path('/path/to/label/files')

In [None]:
audio_files = np.array([str(x) for x in sorted(list(audio_path.glob('*')))]).reshape(-1, 1)
label_files = np.array([str(x) for x in sorted(list(label_path.glob('*')))]).reshape(-1, 1)

In [None]:
files_array = np.hstack([audio_files, label_files])

In [None]:
path2save_file = Path('/path/to/save/input_file.csv')
np.savetxt(str(path2save_file), files_array, delimiter=',', fmt='%s', header='raw_file,label_file')

#### Run generator 

In [None]:
from end2you.generation_process import GenerationProcess
from end2you.utils import Params

We use the class `Params` to elegantly define the parameters required for the generation process.
The parameters for the generation process are the following:
```
    - save_data_folder: The batch size to load.
    - modality        : Modality to be used for. Values [`audio`, `visual`, `audiovisual`].
    - input_file      : Path to input_file.csv.
    - delimiter       : Label file delimiter.
    - fieldnames      : If not provider, it assumes label_files have a header, otherwise provide a header with 
                        this parameter.
    - exclude_cols    : If columns needs to be excluded from process. Takes a string with comma separated 
                        integers of columns to be excluded (starting from 0) e.g. '0, 2' - exclude first and 
                        third columns.
    - root_dir        : Path to save the output files of end2you.
    - log_file        : Name of log file.
```

Example:

In [None]:
generator_params = Params(dict_params={
            'save_data_folder': '/path/to/save/hdf5/files',
            'modality': 'audio',
            'input_file': '/path/to/input_file.csv',
            'exclude_cols': '0',
            'delimiter': ';',
            'fieldnames': 'file, timestamp, arousal, valence, liking',
            'log_file': 'generation.log',
            'root_dir': '/path/to/save/output/files/of/end2you'
        })

In [None]:
generation = GenerationProcess(generator_params)
generation.start()