# Feature extraction from frames

1) Preparing inputs for [video-utils](https://git.corp.adobe.com/escorcia/video-utils) tools

- Create file with all the video names

- Divide and conquer strategy

In [1]:
!split -d -n l/10 data/raw/all_videos.txt data/raw/videos-

- Double check commands

In [3]:
# !head -n 1 data/raw/videos-0*
!wc -l data/raw/videos-0*

  1068 data/raw/videos-00
  1063 data/raw/videos-01
  1065 data/raw/videos-02
  1064 data/raw/videos-03
  1064 data/raw/videos-04
  1065 data/raw/videos-05
  1064 data/raw/videos-06
  1063 data/raw/videos-07
  1063 data/raw/videos-08
  1063 data/raw/videos-09
 10642 total


2) Double check that `adobe/extract_frames.[sh/condor]` are pointitng to the appropriate folders

3) Launch job `condor_submit adobe/extract_frames.condor`

4) Monitor

In [1]:
import os, glob
dirname = 'data/interim/didemo/frame_extraction/'
log_wildcard = os.path.join(dirname, '*_300h.log')
csv_wildcard = os.path.join(dirname, '*_300h.csv')
num_jobs = len(list(glob.glob(log_wildcard)))
num_summary_files = len(list(glob.glob(csv_wildcard)))
print(f'Completed jobs [{num_summary_files}/{num_jobs}]')
!tail -n 2 $log_wildcard

Completed jobs [10/10]
==> data/interim/didemo/frame_extraction/0_300h.log <==
[Parallel(n_jobs=-1)]: Done 1068 out of 1068 | elapsed: 27.6min finished
2018-07-12 07:37:05,155 INFO     [batch_dump_frames.py:53]: Creating summary file...

==> data/interim/didemo/frame_extraction/1_300h.log <==
[Parallel(n_jobs=-1)]: Done 1063 out of 1063 | elapsed: 27.4min finished
2018-07-12 07:36:48,806 INFO     [batch_dump_frames.py:53]: Creating summary file...

==> data/interim/didemo/frame_extraction/2_300h.log <==
[Parallel(n_jobs=-1)]: Done 1065 out of 1065 | elapsed: 29.9min finished
2018-07-12 07:40:33,894 INFO     [batch_dump_frames.py:53]: Creating summary file...

==> data/interim/didemo/frame_extraction/3_300h.log <==
[Parallel(n_jobs=-1)]: Done 1064 out of 1064 | elapsed: 60.4min finished
2018-07-12 08:09:55,599 INFO     [batch_dump_frames.py:53]: Creating summary file...

==> data/interim/didemo/frame_extraction/4_300h.log <==
[Parallel(n_jobs=-1)]: Done 1064 out of 1064 | elapsed: 60.4m

Extracting DiDeMo frames at 5FPS and 320x240 took roughly an hour on ten machines with multiple cores

5) Check if all videos were extracted correctly

In [2]:
import os, glob
import pandas as pd
dirname = 'data/interim/didemo/frame_extraction/'
csv_wildcard = os.path.join(dirname, '*_300h.csv')
df = []
for i in glob.glob(csv_wildcard):
    df.append(pd.read_csv(i, header=None))
df = pd.concat(df, axis=0, ignore_index=True)
print('Number of buggy videos', (df.loc[:, 1] == False).sum())
df.loc[df[1] == False, 0]

Number of buggy videos 0


Series([], Name: 0, dtype: object)

TODO: Apparently the video appears twice in the dataset with different extension. We double check that the video correspond to the same content, thus we ignore this error.

6) Merge all tar-files:

I used the file `adobe/merge_frames.sh` which should do somth close to this:

```bash
output_dir=/mnt/ssd/tmp/didemo_prep
prefix=frames_300h

set -x
output_dir=$1
prefix=$2
if [ -d $output_dir ]; then rm -rf $output_dir; fi
mkdir -p $output_dir/all &&
for f in $(find ~/ -maxdepth 1 -name $prefix"*"); do
  tar -xf $f -C $output_dir;
done  &&
for f in $(find $output_dir -maxdepth 1 -name $prefix"*"); do
  mv $f/* $output_dir/all/;
done  &&
cd $output_dir  &&
for f in $(find . -maxdepth 1 -name $prefix"*"); do
  echo rmdir $f;
done  &&
mv all frames &&
tar -cf ~/didemo_$prefix".tar" frames/ && cd ~ && echo rm -rf $output_dir
```

This took less than TBD mins (once it took 20mins but I changed the code)

7) Sanity checks

__Note__: In case, all the videos were not dummped, it is important that you generate a text-file with the list of video to process.

- Making sure that frames are not empty

In [8]:
from pathlib import Path
import os
dirname = Path('/mnt/ssd/tmp/didemo/frames/frames/')
for i in dirname.iterdir():
    assert i.is_dir
    assert len(os.listdir(i)) > 0

[note] small parentheses. Feel free to ignore this.

In [1]:
import pandas as pd
import os

filename = 'data/interim/didemo/frame_extraction/all_videos.txt'
df = pd.read_csv(filename, header=None)
print('Count videos taking care of extensions')
print(len(df), len(df[0].unique()), len(df) - len(df[0].unique()))
print('Count videos after removing extension')
df2 = df[0].apply(lambda x: os.path.splitext(x)[0])
print(len(df2), len(df2.unique()), len(df2) - len(df2.unique()))
# Fuck the same video appears with different extensions :S
# TODO: check if video length
# TODO: check video content manually

Count videos taking care of extensions
10642 10642 0
Count videos after removing extension
10642 10464 178


[conclusion] @escorcia decided to move with the mass because this is considered as "an engineering practice" by the community. It would be nice to go deeper in this, but we don't have the bandwidth. We decided to replicate lisa's strategy.

# Resnet 152

1) Generate list of videos or images

a. In case, you did not have trouble extracting frames for all videos, you can use the same list

```bash
ln -s $(pwd)/data/interim/didemo/frame_extraction/videos-* data/interim/didemo/resnet_extraction/
```

TODO: sample test of code

2) Double check that `adobe/extract_resnet.[sh/condor]` are pointitng to the appropriate folders

3) Launch job `condor_submit adobe/extract_resnet.condor`

4) Monitor

- Check nodes

!grep "slot" log/[prefix]

- Check progress

In [1]:
!tail -n 2 data/interim/didemo/resnet_extraction/*.log

==> data/interim/didemo/resnet_extraction/0.log <==
2018-07-12 23:27:05,743:INFO:Namespace(batch_size=256, compression='gzip', compression_rate=9, csvfile='/mnt/ilcompf9d1/user/escorcia/moments-retrieval/data/interim/didemo/resnet_extraction/videos-00-img.csv', dataset_name='resnet152', dirname='/mnt/ssd/tmp/didemo/resnet152-0', filename='/mnt/ssd/tmp/didemo/resnet152-0.h5', loglevel='INFO')
2018-07-12 23:28:47,820:INFO:Successful execution

==> data/interim/didemo/resnet_extraction/1.log <==
2018-07-12 23:26:34,063:INFO:Namespace(batch_size=256, compression='gzip', compression_rate=9, csvfile='/mnt/ilcompf9d1/user/escorcia/moments-retrieval/data/interim/didemo/resnet_extraction/videos-01-img.csv', dataset_name='resnet152', dirname='/mnt/ssd/tmp/didemo/resnet152-1', filename='/mnt/ssd/tmp/didemo/resnet152-1.h5', loglevel='INFO')
2018-07-12 23:28:21,375:INFO:Successful execution

==> data/interim/didemo/resnet_extraction/2.log <==
2018-07-12 23:47:16,164:INFO:Namespace(batch_size=256, c

5) Merging HDF5s

In [3]:
import glob
import h5py

filename = '/mnt/ssd/tmp/didemo/features/resnet152_5fps.h5'
persistent_file = 'data/interim/didemo/resnet152/320x240_5fps.h5'
wildcard = '/mnt/ilcompf9d1/user/escorcia/resnet152-*.h5'

with h5py.File(filename, 'w') as fid:
    for file_i in glob.glob(wildcard):
        with h5py.File(file_i, 'r') as fr:
            for _, source_group in fr.items():
                fr.copy(source_group, fid)

6) Coarse average pool per chunk

In [None]:
FPS = 5
CHUNK_SIZE = 5  # seconds
NUM_CHUNKS = 6
filename = '/mnt/ssd/tmp/didemo/features/resnet152_320x240_pooled.h5'
dense_file = f'/mnt/ssd/tmp/didemo/features/resnet152_320x240_{FPS}fps.h5'
import h5py
import numpy as np

with h5py.File(filename, 'w') as fw, h5py.File(dense_file, 'r') as fr:
    for video, group_src in fr.items():
        # ensure compatibility with MCN code
        # MCN code was written for a hdf5 per feature type
        # TODO: deprecate this
        assert len(list(group_src.keys())) == 1
        for name, v in group_src.items():            
            feat = v[:]
            pooled_feat = np.zeros((NUM_CHUNKS, feat.shape[1]), dtype=feat.dtype)
            for i in range(NUM_CHUNKS):
                start_ind = i * CHUNK_SIZE * FPS
                end_ind = min(start_ind + CHUNK_SIZE * FPS, len(feat))
                if start_ind >= len(feat):
                    break
                pooled_feat[i, :] = feat[start_ind:end_ind, :].mean(axis=0)
            fw.create_dataset(video, data=pooled_feat, chunks=True)

[debug] Making sure that copy is correct

In [8]:
import h5py
import numpy as np

file1 = '/mnt/ilcompf9d1/user/escorcia/resnet152-0.h5'
with h5py.File(filename, 'r') as f1, h5py.File(file1, 'r') as f2:
    f1_keys = sorted(list(f1.keys()))
    f2_keys = sorted(list(f2.keys()))
    assert f1_keys == f2_keys
    for i in f1.keys():
        f1_i_keys = sorted(list(f1[i].keys()))
        f2_i_keys = sorted(list(f2[i].keys()))
        assert f1_i_keys == f2_i_keys
        for j, v1 in f1[i].items():
            x1 = v1[:]
            x2 = f2[i][j][:]
            np.testing.assert_array_almost_equal(x1, x2)

In [12]:
import h5py

filename = '/mnt/ilcompf9d1/user/escorcia/resnet152-0.h5'
fid = h5py.File(filename, 'r')
for k, v in fid.items():
    print(k, v['resnet152'][:].mean())

12572907@N00_2920156258_34d144bf1e.avi 0.36933506
12644997@N04_4936603071_9a12b8cc5d.mp4 0.30536875
14284621@N06_3944006339_85416993a7.mov 0.3719792
16483298@N00_4331364236_f8e7cc40e8.avi 0.42610207
16483298@N00_4893184599_197570445d.mp4 0.3697194
26292851@N04_4497646769_c867658047.mp4 0.4500945
50072196@N00_8243844603_e9a8bf01fe.mov 0.39579678
51727341@N00_4913494887_25ba94c153.mp4 0.45705098
56424258@N03_8926842688_91c14724ee. 0.44437027
67211380@N00_2867483360_731aa9cab3.avi 0.33976325


# [legacy] VGG feature extraction

Debugging because features where different to those provided by MCN

In [2]:
import h5py

file1 = '/mnt/ilcompf9d1/user/escorcia/localizing-moments/data/average_fc7.h5'
careful = {}
with h5py.File(file1, 'r') as f1:
    for k, v in f1.items():
        feat = v[:]
        if feat.sum() != 0:
            print(k)
            break

10015567@N08_3655084291_d8b58466fa.mov


Comparing extracted features with public features

In [1]:
import h5py

filename = '/mnt/ilcompf9d1/user/escorcia/tmp_didemo/fc7_subsample10_stock_44971549@N06_8077235126_bc346362b8.mov.h5'
filename = '/mnt/ilcompf9d1/user/escorcia/tmp_didemo/fc7_subsample10_stock_10015567@N08_3655084291_d8b58466fa.mov.h5'
!ls -la $filename
fid = h5py.File(filename)
feat = fid['features'][:]
print(feat.shape)
print(feat.max(), feat.min())

-rw-r--r-- 1 escorcia 5001 4933094 Jul 11 01:52 /mnt/ilcompf9d1/user/escorcia/tmp_didemo/fc7_subsample10_stock_10015567@N08_3655084291_d8b58466fa.mov.h5
(150, 4096)
8.976858139038086 0.0


In [3]:
import h5py
import numpy as np

file1 = '/mnt/ilcompf9d1/user/escorcia/localizing-moments/data/average_fc7.h5'
file2 = '/mnt/ilcompf9d1/user/escorcia/tmp_didemo/average_fc7.h5'

with h5py.File(file1, 'r') as f1, h5py.File(file2, 'r') as f2:
    video_id = list(f2.keys())[0]
    feat2 = f2[video_id][:]
    feat1 = f1[video_id][:]
    np.testing.assert_array_almost_equal(feat1, feat2)

AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 76.025390625%)
 x: array([[0.080325, 0.      , 0.618881, ..., 0.057373, 0.698398, 1.784408],
       [0.781676, 0.014886, 0.20887 , ..., 0.      , 0.32405 , 0.338488],
       [1.154795, 0.534264, 0.447821, ..., 0.      , 0.521238, 0.16278 ],...
 y: array([[0.907936, 0.09756 , 0.31482 , ..., 0.022781, 0.594315, 0.259575],
       [1.023138, 0.467035, 0.362644, ..., 0.15047 , 1.121736, 0.08118 ],
       [0.813921, 0.272168, 0.244465, ..., 0.245621, 1.05744 , 0.053195],...