# Workflow for feature extraction

Follow this to extract features and frames.

__Warning__ this notebook is a tutorial/guide, __do NOT__ run all the cells without reading them.

_Note_ this is a more detailed walkthrough for the notebook 4 which must be deprecated at some point.

## 1. Create project folder

In [None]:
trial_name = 'didemo'
root_dir = '/home/escorciav/mnt/marla-scratch/moments-retrieval'

trial_dirname = f'{root_dir}/{trial_name}'
!mkdir -p $trial_dirname

## 2. Create a text file with all the video names

We need a single text-file with all the video names that need to process.

__Note__: In a nutshell, read annotations and gather all the videos of the dataset into a file.

- Do not over do it.
- Keep it simple e.g. no hyper-security, only asssert things that boil up into errors.
- Create as many cell as different datasets.

### 2.a. DiDeMo

Require annotation files provided [here](https://github.com/LisaAnne/LocalizingMoments)

In [None]:
annotation_fmt = 'data/raw/{}_data.json'

import json
import random
import numpy as np
videos = []
for i in ['train', 'val', 'test']:
    annotation_file = annotation_fmt.format(i)
    with open(annotation_file, 'r') as fr:
        for instance in json.load(fr):
            videos.append(instance['video'])
videos = np.unique(videos).tolist()
# randomized to add entropy
random.shuffle(videos)

## 3. Make video list for frame extraction

- Make list of videos used for `video-utils/tools/batch_dump_frames.py`

- Get list of missing videos

__Notes__

1. Use hard symlink to link directly to the file and not to the path.

2. The cell below is an example of the two (sometimes three) outputs of this step:

    1. `missing-videos.txt`
    2. `videos.txt`
    3. make sure that all the videos are placed into a single folder.
    
### 3.a. DiDeMo

Require to download DiDeMo videos

In [None]:
import os
from pathlib import Path
import random

videos_root = Path('/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/videos/')
with open(f'{trial_dirname}/missing-videos.txt', 'x') as fw_m, open(f'{trial_dirname}/videos.txt', 'x') as fw:
    for video_name in videos:
        if (videos_root / video_name).exists():
            fw.write(f'{video_name}\n')
        else:
            fw_m.write(f'{video_name}\n')

__[monitor]__ check what's going on

In [None]:
print('Number of videos missing')
!wc -l '{trial_dirname}/missing-videos.txt'
print('Number of videos')
!wc -l '{trial_dirname}/videos.txt'

__[debug]__

Date: September 23 2018

Apparently there were 31 videos "missing". However, we had the videos in the filesystem but those have a different name in the annotation file. We fixed that in the following way:

1. rename old `videos` folder as `videos_original`.

2. create a new folder called video.

3. hardlink all the videos inside `videos_original` with their name in the JSON.

In [None]:
import glob
from pathlib import Path

videos_root = Path('/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/videos')
videos_root_dirname = videos_root.parent
!mv $videos_root $videos_root_dirname"/videos_original"
!mkdir -p $videos_root
for video_name in videos:
    if (videos_root_dirname / 'videos_original' / video_name).exists():
        !ln $videos_root_dirname/'videos_original'/$video_name $videos_root/$video_name
    else:
        pattern = videos_root_dirname / 'videos_original' / (Path(video_name).name[:-1] + '*')
        files = glob.glob(str(pattern))
        if len(files) != 1:
            # even trickier videos :S
            files = [i for i in files if Path(i).suffix != '.mp4']
        video_name_tricky = Path(files[0]).name
        !ln $videos_root_dirname/'videos_original'/$video_name_tricky $videos_root/$video_name

### [exceptional] 3?. Example for tricky cases

Sample code when videos are placed in multiple folders and we start for a unique video identifier without extension.

Given that we need to do this cruff fast:

- We create symlinks for videos of interest. Basically, a DIY-pre-processing replacement of a missing database connector in `video-utils`.

In [None]:
import os
import random

videos_root = '/home/escorciav/mnt/marla-scratch/datasets/mantis/videos'
video_dirnames = [
    f'{videos_root}/visual_favorable/',
    f'{videos_root}/visual_unfavorable/'
]

map_youtubeid2videos = {}
for dirname in video_dirnames:
    for basename in os.listdir(dirname):
        basename_noext = os.path.splitext(basename)[0]
        map_youtubeid2videos.update(
            [(basename_noext, {'basename': basename, 'root': dirname})]
        )

trial_video_dirname = f'{videos_root}/{trial_name}'
!mkdir -p $trial_video_dirname
with open(video_list_file, 'r') as fr, open(f'{trial_dirname}/missing-videos.txt', 'x') as fw:
    for video_id in fr:
        video_id = video_id.strip()
        file_info = map_youtubeid2videos.get(video_id)
        if file_info:
            # make hard symlink
            basename, dirname = file_info['basename'], file_info['root']
            !ln $dirname/$basename $trial_video_dirname/$basename
        else:
            fw.write(f'{video_id}\n')
            
# video list is equivalent to listing the video folder of the trial ;)
video_list = os.listdir(trial_video_dirname)
# randomized to increase entropy
random.shuffle(video_list)
with open(f'{trial_dirname}/videos.txt', 'x') as fw:
    for i in video_list:
        fw.write(f'{i}\n')

## 4. Dumping frames

__Note__ from this point on going there is no more dataset specific code. Reason, we need to track the changes in code with git and parameters in log files. In that way, we can go back if required.

1) Launch frame extraction with something similar to this:

```bash
python tools/batch_dump_frames.py \
  -i /mnt/scratch/moments-retrieval/didemo/videos.txt \
  -r /mnt/ssdscratch/datasets/didemo/videos/ \
  -o /mnt/ssdscratch/datasets/didemo/frames \
  -s /mnt/scratch/moments-retrieval/didemo/frame-dump.csv \
  -n 1 --verbose 5 --log INFO &> /mnt/scratch/moments-retrieval/didemo/frame-dump.log
```

For details about the meaning of the arguments run this:

```bash
python tools/batch_dump_frames.py -h
```
    
__Runtime notes__:

- currently this command is typed manually over tmux in a machine with 48 cores. Thus, don't forget to update the paths before executing it 😉

- How to fix `ModuleNotFoundError`?

    Devote a minute or two to understand the problem.

    In this case, there are two elegant solutions for this:

    a) Append the root folder of video-utils to the environment variable `PYTHONPATH`.

    ```bash
    cd [video-utils-folder]
    export PYTHONPATH=$PWD
    ```

    b) Send a pull-request to video-utils to make a package out of it ;)

    This is not to say that editing `sys.path` is incorrect. Indeed, that's the best solution to get rid of the problem hiding it under the rug.

2) count tricky videos

In [None]:
!grep "False" $trial_dirname/frame-dump.csv | wc -l

### 4. monitor

__TLDR__: check info to monitor progress

Don't forget to update the variable dirname before executing it 😉

In [None]:
from pathlib import Path
dirname = Path('/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/frames/')
print(len([x for x in dirname.iterdir() if x.is_dir()]))
!wc -l $trial_dirname/videos.txt

### 4. debug

__TLDR__: check info below if you smell something fishy during frame extraction.

small detour due to bug in ffmpeg with multiple threads

1) pick a small subset of 100 videos

In [None]:
!head -n 100 '{trial_dirname}/videos.txt' > '{trial_dirname}/videos-debug.txt'

ensure my unix knowledge is not rusty

In [None]:
!wc -l '{trial_dirname}/videos-debug.txt'
!head '{trial_dirname}/videos-debug.txt'

2) launch with single thread

```
python tools/batch_dump_frames.py \
    -i /mnt/scratch/datasets/mantis/trial-03/videos-debug.txt \
    -r /mnt/scratch/datasets/mantis/videos/trial-03/ \
    -o /mnt/ssdscratch/datasets/mantis/trial-03-debug-1 \
    -s /mnt/scratch/datasets/mantis/trial-03/frame-dump_video-debug_single-thread.csv \
    -n 1 --verbose 5 --log DEBUG
```

3) launch with multiple threads

```
python tools/batch_dump_frames.py \
    -i /mnt/scratch/datasets/mantis/trial-03/videos-debug.txt \
    -r /mnt/scratch/datasets/mantis/videos/trial-03/ \
    -o /mnt/ssdscratch/datasets/mantis/trial-03-debug-2 \
    -s /mnt/scratch/datasets/mantis/trial-03/frame-dump_video-debug_multi-thread.csv \
    -n -1 --verbose 5 --log DEBUG
```

4) compare number of frames per video over multiple runs

In [None]:
import os
from pathlib import Path
dirname1 = Path('/home/escorciav/mnt/marla-ssdscratch/datasets/mantis/trial-03-debug-single-thread-1/')
dirname2 = Path('/home/escorciav/mnt/marla-ssdscratch/datasets/mantis/trial-03-debug-single-thread-2/')
for k in os.listdir(dirname1):
    assert len(os.listdir(dirname1 / k)) == len(os.listdir(dirname2 / k))

# 5. Feature extraction

1) Dump list of videos that were successfully extracted

__Note__: ignore the cell below if you were able to extract frames for all the videos successfully. Only perform the following command:

```bash
cp videos.txt features.txt
```

In [None]:
import os
import random
from pathlib import Path
dirname = Path(f'{trial_dirname}/frames')
video_list = os.listdir(dirname)
# randomized to increase entropy
random.shuffle(video_list)
with open(f'{trial_dirname}/feature.txt', 'x') as fw:
    for i in os.listdir(dirname):
        fw.write(f'{i}\n')

2) Split list above among GPUs

In [None]:
!wc -l '{trial_dirname}/feature.txt'
!split -n l/4 '{trial_dirname}/feature.txt' '{trial_dirname}/feature.txt.part'
# next line is important because sometime ppl use split by bytes instead of lines ;)
!tail -n 2 '{trial_dirname}/feature.txt.part'*

3) Command for feature extraction

```
python pthv_models.py \
  -j 11 -b 512 -if --reduce \
  --arch resnet152 --layer-index -2 -h5dn resnet152 \
  -r /mnt/ssdscratch/datasets/didemo/frames \
  -f /mnt/scratch/moments-retrieval/didemo/feature.txt.partaa \
  -o /mnt/ssdscratch/datasets/didemo/features-partaa &> /mnt/scratch/moments-retrieval/didemo/feature-extraction.log.partaa \
```

__remainders__:

- Don't forget to set the `--resize` optional argument if you did not resize frames with ffmpeg

__Runtime notes__:

- Currently, this command is typed manually with tmux in a single machine with 4 GPUs.

- Don't forget to update the paths before executing it 😉

- Don't forget to set `CUDA_VISIBLE_DEVICES`. The current version hard-coded the device to 0.

    ```bash
    export CUDA_VISIBLE_DEVICES=1
    ```
    
    Executing the above line will trick pytorch to believe that the gpu ID 1 is the device 0.

4) Pack HDF5s

__Warning__: Make sure batch size during this step is the same as in step 3.

```
python pack_features.py -h5dn resnet152 -b 512 \
  -d /mnt/ssdscratch/datasets/didemo/features-partaa \
  -i /mnt/scratch/moments-retrieval/didemo/feature.txt.partaa-img.csv \
  -o /mnt/ssdscratch/datasets/didemo/features-partaa.hdf5 &> /mnt/scratch/moments-retrieval/didemo/packed-features.log.partaa
```

_Note_: It's normal that the log file is empty or with useless warnings. That would mean that you did not use the latest version (or appropriate?) branch.

5) Merge HDF5s

```bash
python merge_hdf5.py \
  --filename /mnt/ssdscratch/datasets/didemo/resnet152_5fps_320x240.h5 \
  --files /mnt/ssdscratch/datasets/didemo/features-parta*.hdf5 &> /mnt/scratch/moments-retrieval/didemo/merge-features.log
```

__[monitor]__ check that everything looks OK.

[credits](https://support.hdfgroup.org/HDF5/Tutor/cmdtoolview.html)

In [None]:
print('Number of videos')
!h5ls '/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/resnet_5fps_320x240.h5' | wc -l
print('Few examples')
!h5ls '/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/resnet_5fps_320x240.h5' | head -n 2
# To inspect the file
!h5dump -H -A 0 '/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/resnet_5fps_320x240.h5' | head
# To inspect a given group
# print('Content of a given group')
# !h5dump -g "/--8xIYGTgEQ" -H -A 0 '{dirname}/resnet_5fps_320x240.hdf5'

__[debug]__ 

TLDR: making sure the merge function did not make a mess.

In [None]:
import glob
import random
import h5py
import numpy as np

file1 = '/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/resnet_5fps_320x240.h5'
files = glob.glob('/home/escorciav/mnt/marla-ssdscratch/datasets/didemo/features-parta*.hdf5')
with h5py.File(file1) as f1:
    videos = list(f1.keys())
    random.shuffle(videos)
    video_name = videos[0]
    feat1 = f1[video_name][:]
for i in files:
    with h5py.File(i) as f2:
        if video_name in f2:
            feat2 = f2[video_name]
            np.testing.assert_array_almost_equal(feat1, feat2)