# Loading Data as a pytorch DataLoader

## Basic Dataloader
- Single Speaker
- 3 Modalities

In [2]:
from data import Data
from tqdm import tqdm

In [3]:
common_kwargs = dict(path2data = 'pats/data',
                     speaker = ['bee'],
                     modalities = ['pose/data', 'audio/log_mel_512', 'text/bert'],
                     fs_new = [15, 15, 15],
                     batch_size = 4,
                     window_hop = 5)

In [4]:
data = Data(**common_kwargs)

100%|██████████| 547/547 [01:42<00:00,  7.18it/s]
100%|██████████| 65/65 [00:11<00:00,  6.01it/s]
100%|██████████| 84/84 [00:14<00:00,  4.19it/s]


`Data` has 3 DataLoader objets, `data.train`, `data.dev` and `data.test`. Let's sample a batch from `data.train`.

In [5]:
for batch in data.train:
  break

All elements of the dictionary have a "batch x time x feature" order. Let's look at the shapes of all the elements of the dictionary `batch`.

In [6]:
for key in batch.keys():
  if key != 'meta':
    print('{}: {}'.format(key, batch[key].shape))

pose/data: torch.Size([4, 64, 104])
audio/log_mel_512: torch.Size([4, 64, 128])
text/bert: torch.Size([4, 64, 768])
text/token_duration: torch.Size([4, 15])
text/token_count: torch.Size([4])
style: torch.Size([4, 64])
idx: torch.Size([4])


"pose/data" has 104 dimensions which is the same as 52 joints with XY coordinates. Let's reshape it to a more obvious format.

In [7]:
pose = batch['pose/data']
pose = pose.reshape(pose.shape[0], pose.shape[1], 2, -1)
print(pose.shape)

torch.Size([4, 64, 2, 52])


Apart from the requested modalities -i.e. pose, audio and text- we get some extra elements. Let's quickly gloss throught them.
- shape of "text/bert" along time is the same as "pose/data", hence they are temporally aligned.
- shape of "text/token_duration" implies the maximum length of a sentence in this mini-batch is 17
- "idx" refers to the idx of the object of the `Data` class
- "style" is the relative style id of the speakers in the dataset. In this case, all the values will be 0

## Multi-Speaker DataLoader

In [8]:
common_kwargs.update(dict(speaker=['bee', 'maher']))

In [9]:
data = Data(**common_kwargs)

100%|██████████| 1472/1472 [03:48<00:00,  9.90it/s]
100%|██████████| 186/186 [00:28<00:00,  7.14it/s]
100%|██████████| 226/226 [00:34<00:00,  7.22it/s]


In [10]:
for batch in data.train:
  break

This is the same as Basic DataLoader, except data from both speakers will be sampled allowing to train a multi-speaker model.

## Other text features
In case we do not want to use fixed pre-trained embeddings, we can use "text/tokens" as a modality. These tokens represent the indices extracted by `BertTokenizer` from [HuggingFace](https://huggingface.co) and can be used to fine-tune transformer based embeddings. In this example, we use `repeat_text=0` which does not repeat the text/tokens modality to align it with pose and/or audio.

In [11]:
common_kwargs.update(dict(modalities = ['pose/data', 'audio/log_mel_512', 'text/tokens'],
                         repeat_text = 0))

In [12]:
data = Data(**common_kwargs)

100%|██████████| 1472/1472 [02:53<00:00,  8.96it/s]
100%|██████████| 186/186 [00:21<00:00,  8.15it/s]
100%|██████████| 226/226 [00:27<00:00,  6.75it/s]


In [13]:
for batch in data.train:
  break

In [14]:
batch['text/tokens'].shape

torch.Size([4, 17])

In [15]:
batch['text/tokens']

tensor([[ 1005.,  1055.,  2256.,  2120.,  4676.,  2374.,  2272.,  2006.,  2017.,
          2293.,  1996.,  3565.,  4605.,  2017.,     0.,     0.,     0.],
        [ 2023.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
             0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.],
        [ 8394.,  1996.,  2279.,  2270., 13109.,  8649.,  4726.,  2045.,  2001.,
          1037.,  8448.,  2055.,  2023.,  2045.,  2001.,  1037., 28205.],
        [ 2552.,  2029.,  2003.,  3492., 11703., 22048.,  2516.,  1998.,  4171.,
          7659.,  1998.,  2024., 10892.,     0.,     0.,     0.,     0.]],
       dtype=torch.float64)

## DataLoaders with Samplers

In [16]:
common_kwargs.update(dict(style_iters=100))

In [17]:
data = Data(**common_kwargs)

100%|██████████| 1472/1472 [02:43<00:00,  7.95it/s]
100%|██████████| 186/186 [00:20<00:00,  8.44it/s]
100%|██████████| 226/226 [00:25<00:00,  8.77it/s]


In [18]:
for batch in tqdm(data.train):
  continue

100%|██████████| 100/100 [00:00<00:00, 428.28it/s]


This is the same as the Multi-Speaker Dataloader, except the "style" element will now have 0 or 1 based on which speaker's data it it. We can be sure that every batch will have both styles as we use the `style_iters` argument. The number of iterations per epoch is 100 which is the value of style_iters

In [19]:
batch['style']

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1.,

## Working with h5 files
In case these dataloaders do not suit your needs, it is possible to read individual interval files. We have created a class `HDF5` with many static methods to load data from these h5 files. 

**Caution** - Not closing h5 files properly can give persistent errors and may require a system restart.

**Caution-2** - It is recommended to ignore intervals in `missing_intervals.h5` as those intervals do not have complate data. The DataLoaders take care of that, but manually accessing h5 files does not

In [20]:
from data import HDF5

In [21]:
h5 = HDF5.h5_open('pats/data/processed/bee/cmu0000025735.h5', 'r')
print(h5.keys())
for key in h5.keys():
  print('{}: {}'.format(key, h5[key].keys()))
h5.close()

<KeysViewHDF5 ['audio', 'pose', 'text']>
audio: <KeysViewHDF5 ['log_mel_400', 'log_mel_512', 'silence']>
pose: <KeysViewHDF5 ['confidence', 'data', 'normalize']>
text: <KeysViewHDF5 ['bert', 'meta', 'tokens', 'w2v']>


## Loading a key

In [22]:
data, h5 = HDF5.load('pats/data/processed/bee/cmu0000025735.h5', key='pose/data')
data = data[()]
h5.close()

In [23]:
data.shape

(292, 104)

## Loading missing_intervals.h5

In [24]:
missing, h5 = HDF5.load('pats/data/missing_intervals.h5', key='intervals')
missing = missing[()]
h5.close()

In [25]:
missing

array(['115309', '147056', 'cmu0000022349', ..., '5227', '13510', '25204'],
      dtype=object)

## Loading Transcripts as a DataFrame

In [26]:
import pandas as pd

In [27]:
pd.read_hdf('pats/data/processed/bee/cmu0000025735.h5', key='text/meta')

Unnamed: 0,Word,start_frame,end_frame
0,do,0.0,7.0
1,you,7.0,9.0
2,have,9.0,10.0
3,to,10.0,12.0
4,be,12.0,15.0
5,on,15.0,21.0
6,Sunday,21.0,27.0
7,candidate,27.0,36.0
8,Pete,36.0,40.0
9,buttigieg,40.0,49.0
