## 1. Dump data for training and evaluation

### 1a. Chunked features

In case, you haven't dumped the features. Go to notebook [old feature extraction](4-feature-extraction.ipynb) section `#Varied-length-videos` (remove the # if you use your browser string matching).

_TODO_ add procedure here to avoid jumping over the repo.

### 1b. JSON files

The format is the same as in notebook [charades notebook](11-charades-sta.ipynb).

We added the field:
    - `annotation_id_didemo` given that didemo provides an annotation id, but is only unique inside a subset.
    
_Implementation details and considerations:_

Given the continuous nature of untrimmed videos, it is a bit trickier to have a 1-to-1 equivalence between this format and the original discrete data of DiDeMo. However, we try our best for replicating the insights from the [MCN paper](https://arxiv.org/pdf/1708.01641.pdf). In particular:

- The video `duration` is set to 30s to approximate the TEF features proposed by [MCN](https://github.com/LisaAnne/LocalizingMoments). Note that even making `duration == 30`, the continous TEF features are different to those of the discrete setup e.g. [5, 10] / 30 != [1, 1] / 6.

- Global features will be computed only for the existing clips of the video. Thus, `num_clips != duration != num_frames`.

- [DiDeMo dataset](https://github.com/LisaAnne/LocalizingMoments/tree/9b453d2af2255c9c7b2a3e5f3d345d2f06c2ec20)

    It corresponds to an older version of the code, but hopefully is the same data.

In [None]:
import json
from copy import deepcopy
from datetime import datetime
from pathlib import Path
import h5py
import numpy as np

import sys
sys.path.append('..')
from utils import get_git_revision_hash

# This time unit (seconds) must match the one in the original DiDeMo
# annotations, link in the description above.
DIDEMO_TIME_UNIT = 5 
MAX_TIME = 30

def update_instances_make_videos_dict(moments, offset=0):
    """Update (in-place) metadata from instances
    
    1. Transform annotations from index to time
    2. Backup annotation-id and create a new-one
    4. Remove unneeded fields `num_segments`, `dl_link`. Note that we can go
       back to them because we preserve the original `annotation_id`.
    3. Add field `time` added 'cause we weren't planning to merge both
       domains, untrimmed & trimmed videos.

    Args:
        moments (list of dict): raw data from DiDeMo
        
    Returns:
        videos (dict) : map information about videos in the subset.
    """
    videos = {}
    for moment_i in moments:
        time_stamps = np.array(moment_i['times'])
        time_stamps *= DIDEMO_TIME_UNIT
        time_stamps[:, 1] += DIDEMO_TIME_UNIT
        moment_i['times'] = time_stamps.tolist()
        # DIDEMO_TIME_UNIT * 6 == 30s, which is the time-span that annotators
        # watched
        assert (time_stamps <= DIDEMO_TIME_UNIT * 6).all()
        
        moment_i['annotation_id_original'] = moment_i['annotation_id']
        moment_i['annotation_id'] = offset
        
        del moment_i['num_segments']
        del moment_i['dl_link']
        moment_i['time'] = None
        offset += 1
        
        video_id = moment_i['video']
        if video_id in videos:
            videos[video_id]['num_instances'] += 1
            continue

        videos[video_id] = {
            'num_instances': 1,
            # This is incorrect, but we follow the ICCV17 recipe for fair
            # comparison. Keep in mind, that we pool features accordingly.
            'num_frames': MAX_TIME * DIDEMO_TIME_UNIT,
            'duration': MAX_TIME
        }
    return videos

In [None]:
%%time
SUBSETS = ['train', 'val', 'test']
MODE = 'x'
CREATOR = 'EscorciaSSGR'
RAW_DATA_FMT = '../data/raw/{}_data.json'
OUTPUT_FMT = '../data/interim/didemo/{}.json'
if MODE == 'w':
    print('are you sure you wanna do this? comment these 3 lines!')
    raise
assert SUBSETS == ['train', 'val', 'test']

offset = 0
for subset in SUBSETS:
    filename = Path(RAW_DATA_FMT.format(subset))
    output_file = Path(OUTPUT_FMT.format(subset))
    with open(filename, 'r') as fid:
        instances = json.load(fid)
        videos = update_instances_make_videos_dict(
            instances, offset)
        offset += len(instances)

    if not output_file.parent.is_dir():
        dirname = output_file.parent
        dirname.mkdir(parents=True)
        print(f'Create dir: {dirname}')

    print('Subset:', subset)
    print('\tNum videos:', len(videos))
    print('\tNum instances:', len(instances))
    with open(output_file, MODE) as fid:
        json.dump({'videos': videos,
                   'moments': instances,
                   'date': datetime.now().isoformat(),
                   'git_hash': get_git_revision_hash(),
                   'responsible': CREATOR,
                  },
                  fid)
    print('\tDumped file:', output_file)

#### 1.b.1 Create partition for hyper-parameter search

Script to sample $p\%$ of the training set to avoid over-fitting during training. Although, it's passed to our implementation as "validation" data, don't get confused. The purposes of this subset are twofold:
- Get an idea of performance in the training set to study over/under-fitting.

- validate that the training scheme translates into good performance.

_Why not doing directly in the training loop?_

We are comparing two randomly sampled intervals of the video. While in inference, we are solving a retrieval over a single video. Given that the scheme is different, we did this hack to simplify the training loop.

In [None]:
import json
import random
from nb_utils import split_moments_dataset

filename = '../data/processed/didemo/train-03.json'
trial = '01'
seed = 1701

random.seed(seed)
_, val = split_moments_dataset(filename, ratio=0.85)

with open(f'../data/processed/didemo/train-03_{trial}.json', 'x') as fid:
    json.dump(val, fid)

#### 1.b.2 Untied JSON and HDF5 inputs

TLDR; reference: minor-detail. Safe to skip unless you have problems loading data for dispatching training.

At some point, there was a undesired tied btw the JSON and HDF5 files (inputs) required by our implementation. 

- root `time_unit`. This is a property of the features, as such it should reside in the HDF5 a not in the JSON.

- `videos/ith-video/num_clips`. This is a property of the ith-video, as such we should grab it from the HDF5 instead of placed it in the JSON.

The following script was use to create the `*-03.json` files with metadata for training and evaluation.

```python
import json
from datetime import datetime

import sys
sys.path.append('..')
from utils import get_git_revision_hash

subsets = ['train', 'val', 'test']

for subset in subsets:
    file_src = f'../data/processed/didemo/{subset}-02.json'
    file_dst = f'../data/processed/didemo/{subset}-03.json'
    with open(file_src, 'r') as fr:
        data = json.load(fr)
    del data['time_unit']
    for video_id in data['videos']:
        del data['videos'][video_id]['num_clips']
    data['date'] = datetime.now().isoformat()
    data['git_hash'] = get_git_revision_hash()
    with open(file_dst, 'x') as fw:
        json.dump(data, fw)
```

We also update the HDF5 such that it contains `metadata` [Group/Folder](http://docs.h5py.org/en/latest/high/group.html).

```bash
!h5ls /home/escorciav/datasets/didemo/features/resnet152_max_cs-5.h5 | grep metadata
```

In case the following line doesn't return anything, it means that you are using an old version of the data.
If you know the `FPS`, `CLIP_LENGTH` and `POOL`ing operation used to get those features, the following snippet will add the metadata required for the most recent version of our code.

```python
FPS = 5
CLIP_LENGTH = 5  # seconds
POOL = 'max'  # pooling operation over time
# verbose
COMMENTS = (f'ResNet152 trained on Imagenet-ILSVRC12, Pytorch model. '
            f'Extracted at {FPS} FPS with an image resolution of 320x240, '
            f'and {POOL} pooled over time every {CLIP_LENGTH} seconds.')
CREATOR = 'EscorciaSSGR'  # please add your name here to sign the file i.e. assign yourself as resposible
filename = f'/home/escorciav/datasets/didemo/features/resnet152_{POOL}_cs-{CLIP_LENGTH}.h5'
from datetime import datetime
import h5py

assert CHUNK_SIZE * FPS >= 1
with h5py.File(filename, 'a') as fw:
    grp = fw.create_group('metadata')
    grp.create_dataset('time_unit', data=CLIP_LENGTH)
    grp.create_dataset('date', data=datetime.now().isoformat(),
                       dtype=h5py.special_dtype(vlen=str))
    grp.create_dataset('responsible', data=CREATOR,
                       dtype=h5py.special_dtype(vlen=str))
    grp.create_dataset('comments', data=COMMENTS,
                       dtype=h5py.special_dtype(vlen=str))
```

## 2 Word vectors

We extracted word vectors for ELMO and FastText. In the interest of time, we will document this step later.

### 2.1 ELMo

We take last output as word representation.

_TODO_: study each layer independently as well as all layers.

In [None]:
%%time
import json
import h5py

for subset in ['train', 'val', 'test']:

    h5_file = f'../data/interim/word-vectors/elmo/elmo_wordvec_didemo_{subset}.hdf5'
    output_file = f'../data/processed/didemo/{subset}_elmo-001.h5'

    txt_file = f'../data/interim/word-vectors/descriptions/didemo_descriptions_{subset}.txt'
    json_file = f'../data/processed/didemo/{subset}.json'
    # with open(json_file, 'r') as f_json, h5py.File(h5_file, 'r') as f_h5, h5py.File(txt_file, 'x') as fw:
    with h5py.File(h5_file, 'r') as f_h5, h5py.File(output_file, 'w') as fw:
        for i in range(len(f_h5) - 1):
            item = f_h5[f'{i}'][:]
            item = item.transpose(1, 0, 2)
            n = item.shape[0]

            # Take last layer
            item = item[:, -1, :].reshape(n, -1)
            fw.create_dataset(str(i), data=item, chunks=True)
        # cross-check
        # data = json.load(f_json)['moments']
        # for i, line in enumerate(f_txt):
            # line = line.strip()
            # if data[i]['description'] != line:
                # if '\n' in data[i]['description']:
                #     continue
                # elif data[i]['description'][-1] == ' ':
                #     continue

    # cross-check
    with open(json_file, 'r') as fid:
        assert len(json.load(fid)['moments']) == (i + 1)