# Loading Data as a pytorch DataLoader

## Basic Dataloader
- Single Speaker
- 3 Modalities

In [64]:
from data import Data
from tqdm import tqdm

In [52]:
common_kwargs = dict(path2data = 'pats/data',
                     speaker = ['lec_cosmic'],
                     modalities = ['pose/data', 'audio/log_mel_512', 'text/bert'],
                     fs_new = [15, 15, 15],
                     batch_size = 4,
                     window_hop = 5)

In [24]:
data = Data(**common_kwargs)

100%|██████████| 65/65 [00:04<00:00, 16.23it/s]
100%|██████████| 9/9 [00:00<00:00, 14.09it/s]
100%|██████████| 7/7 [00:00<00:00, 18.26it/s]


`Data` has 3 DataLoader objets, `data.train`, `data.dev` and `data.test`. Let's sample a batch from `data.train`.

In [25]:
for batch in data.train:
  break

All elements of the dictionary have a "batch x time x feature" order. Let's look at the shapes of all the elements of the dictionary `batch`.

In [32]:
for key in batch.keys():
  if key != 'meta':
    print('{}: {}'.format(key, batch[key].shape))

pose/data: torch.Size([4, 64, 104])
audio/log_mel_512: torch.Size([4, 64, 128])
text/bert: torch.Size([4, 64, 768])
text/token_duration: torch.Size([4, 17])
text/token_count: torch.Size([4])
style: torch.Size([4, 64])
idx: torch.Size([4])


"pose/data" has 104 dimensions which is the same as 52 joints with XY coordinates. Let's reshape it to a more obvious format.

In [31]:
pose = batch['pose/data']
pose = pose.reshape(pose.shape[0], pose.shape[1], 2, -1)
print(pose.shape)

torch.Size([4, 64, 2, 52])


Apart from the requested modalities -i.e. pose, audio and text- we get some extra elements. Let's quickly gloss throught them.
- shape of "text/bert" along time is the same as "pose/data", hence they are temporally aligned.
- shape of "text/token_duration" implies the maximum length of a sentence in this mini-batch is 17
- "idx" refers to the idx of the object of the `Data` class
- "style" is the relative style id of the speakers in the dataset. In this case, all the values will be 0

## Multi-Speaker DataLoader

In [53]:
common_kwargs.update(dict(speaker=['lec_cosmic', 'lec_evol']))

In [54]:
data = Data(**common_kwargs)

100%|██████████| 197/197 [00:03<00:00, 50.41it/s]
100%|██████████| 49/49 [00:01<00:00, 47.31it/s]
100%|██████████| 151/151 [00:02<00:00, 60.29it/s]


In [55]:
for batch in data.train:
  break

This is the same as Basic DataLoader, except data from both speakers will be sampled allowing to train a multi-speaker model.

## Other text features
In case we do not want to use fixed pre-trained embeddings, we can use "text/tokens" as a modality. These tokens represent the indices extracted by `BertTokenizer` from [HuggingFace](https://huggingface.co) and can be used to fine-tune transformer based embeddings. In this example, we use `repeat_text=0` which does not repeat the text/tokens modality to align it with pose and/or audio.

In [56]:
common_kwargs.update(dict(modalities = ['pose/data', 'audio/log_mel_512', 'text/tokens'],
                         repeat_text = 0))

In [57]:
data = Data(**common_kwargs)

100%|██████████| 197/197 [00:06<00:00, 32.74it/s]
100%|██████████| 49/49 [00:01<00:00, 32.11it/s]
100%|██████████| 151/151 [00:03<00:00, 39.60it/s]


In [58]:
for batch in data.train:
  break

In [59]:
batch['text/tokens'].shape

torch.Size([4, 13])

In [60]:
batch['text/tokens']

tensor([[12761.,  7366.,  2008.,  2003.,  2428.,  4187.,  2005.,  4824.,     0.,
             0.,     0.,     0.,     0.],
        [ 2172.,  2062.,  5294.,  2292.,  1005.,  1055.,  2360.,  2702.,  2335.,
          2062.,  5294.,  2009.,  2052.],
        [ 2428.,  3147.,  2437.,  2235.,  2515.,  1996.,  2590.,     0.,     0.,
             0.,     0.,     0.,     0.],
        [ 5072.,  1998., 27226.,  2021.,  2036.,  1996.,  4195.,     0.,     0.,
             0.,     0.,     0.,     0.]], dtype=torch.float64)

## DataLoaders with Samplers

In [61]:
common_kwargs.update(dict(style_iters=100))

In [62]:
data = Data(**common_kwargs)

100%|██████████| 197/197 [00:03<00:00, 59.95it/s]
100%|██████████| 49/49 [00:00<00:00, 68.05it/s]
100%|██████████| 151/151 [00:02<00:00, 69.11it/s]


In [66]:
for batch in tqdm(data.train):
  continue

100%|██████████| 100/100 [00:00<00:00, 405.04it/s]


This is the same as the Multi-Speaker Dataloader, except the "style" element will now have 0 or 1 based on which speaker's data it it. We can be sure that every batch will have both styles as we use the `style_iters` argument. The number of iterations per epoch is 100 which is the value of style_iters

In [67]:
batch['style']

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1.,

In [None]:
!pip install torch
!pip install transformers

In [68]:
!pip install tqdm
!pip install joblib
!pip install youtube-dl

Collecting youtube-dl
[?25l  Downloading https://files.pythonhosted.org/packages/61/1c/a86837929eff24827b117d577584cc1a2a85dfdb5a91465d17c8b298f0d0/youtube_dl-2020.7.28-py2.py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 4.0MB/s eta 0:00:01
[?25hInstalling collected packages: youtube-dl
Successfully installed youtube-dl-2020.7.28


In [None]:
import nltk
import h5py
import pandas as pd

## Working with h5 files
In case these dataloaders do not suit your needs, it is possible to read individual interval files. We have created a class `HDF5` with many static methods to load data from these h5 files. 

**Caution** - Not closing h5 files properly can give persistent errors and may require a system restart.

**Caution-2** - It is recommended to ignore intervals in `missing_intervals.h5` as those intervals do not have complate data. The DataLoaders take care of that, but manually accessing h5 files does not

In [69]:
from data import HDF5

In [72]:
h5 = HDF5.h5_open('pats/data/processed/bee/cmu0000025735.h5', 'r')
print(h5.keys())
for key in h5.keys():
  print('{}: {}'.format(key, h5[key].keys()))
h5.close()

<KeysViewHDF5 ['audio', 'pose', 'text']>
audio: <KeysViewHDF5 ['log_mel_400', 'log_mel_512', 'silence']>
pose: <KeysViewHDF5 ['confidence', 'data', 'normalize']>
text: <KeysViewHDF5 ['bert', 'meta', 'tokens', 'w2v']>


## Loading a key

In [75]:
data, h5 = HDF5.load('pats/data/processed/bee/cmu0000025735.h5', key='pose/data')
data = data[()]
h5.close()

In [76]:
data.shape

(292, 104)

## Loading missing_intervals.h5

In [80]:
missing, h5 = HDF5.load('pats/data/missing_intervals.h5', key='intervals')
missing = missing[()]
h5.close()

In [83]:
missing

array(['115309', '147056', 'cmu0000022349', ..., '5227', '13510', '25204'],
      dtype=object)

## Loading Transcripts as a DataFrame

In [78]:
import pandas as pd

In [79]:
pd.read_hdf('pats/data/processed/bee/cmu0000025735.h5', key='text/meta')

Unnamed: 0,Word,start_frame,end_frame
0,do,0.0,7.0
1,you,7.0,9.0
2,have,9.0,10.0
3,to,10.0,12.0
4,be,12.0,15.0
5,on,15.0,21.0
6,Sunday,21.0,27.0
7,candidate,27.0,36.0
8,Pete,36.0,40.0
9,buttigieg,40.0,49.0
