<figure>
  <img src="https://github.com/v-iashin/video_features/raw/master/docs/_assets/i3d.png" width="300" />
</figure>

The `video_features` library allows you to extract features from
raw videos in parallel with multiple GPUs.
It supports several extractors that capture visual appearance,
optical flow, and audio features. See more details in the
[GitHub repository](https://github.com/v-iashin/video_features).

See more feature extraction examples in colaboratory notebooks:
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Zd7r8uKGLGSxlil4PPnXk_4I3KOsjPpO?usp=sharing) – CLIP
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HUlYcOJf_dArOcAaR9jaQHuM5CAZiNZc?usp=sharing) – S3D
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LKoytZmNxtC-EuCp7pHDM6sFvK1XdwlW?usp=sharing) – I3D
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1csJgkVQ3E2qOyVlcOM-ACHGgPBBKwE2Y?usp=sharing) – R(2+1)D
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18I95Rn1B3a2ISfD9b-o4o93m3XuHbcIY?usp=sharing) – RAFT
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17VLdf4abQT2eoMjc6ziJ9UaRaOklTlP0?usp=sharing) – ResNet
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1r_8OnmwXKwmH0n4RxBfuICVBgpbJt_Fs?usp=sharing) – VGGish
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/16QEwNMqiwqlmBhJCJmNeeEP8gitom0I-?usp=sharing) – [timm](https://huggingface.co/timm) models

In [None]:
! git clone https://github.com/v-iashin/video_features.git
! pip install omegaconf==2.0.6

Cloning into 'video_features'...
remote: Enumerating objects: 1401, done.[K
remote: Counting objects: 100% (512/512), done.[K
remote: Compressing objects: 100% (222/222), done.[K
remote: Total 1401 (delta 311), reused 413 (delta 264), pack-reused 889[K
Receiving objects: 100% (1401/1401), 288.67 MiB | 14.05 MiB/s, done.
Resolving deltas: 100% (725/725), done.
Updating files: 100% (94/94), done.
Collecting omegaconf==2.0.6
  Downloading omegaconf-2.0.6-py3-none-any.whl (36 kB)
Installing collected packages: omegaconf
Successfully installed omegaconf-2.0.6


In [None]:
%cd video_features

/content/video_features


In [2]:
from models.i3d.extract_i3d import ExtractI3D
from utils.utils import build_cfg_path
from omegaconf import OmegaConf
import torch
import os
import glob
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.cuda.get_device_name(0)

'NVIDIA RTX A5000'

In [None]:
# Select the feature type
feature_type = 'i3d'

# Load and patch the config
args = OmegaConf.load(build_cfg_path(feature_type))
args.video_paths = ['./sample/v_GGSY1Qvo990.mp4']
# args.show_pred = True
# args.stack_size = 24
# args.step_size = 24
# args.extraction_fps = 25
args.flow_type = 'raft'
# args.streams = 'flow'

# Load the model
extractor = ExtractI3D(args)

# Extract features
for video_path in args.video_paths:
    print(f'Extracting for {video_path}')
    feature_dict = extractor.extract(video_path)
    [(print(k), print(v.shape), print(v)) for k, v in feature_dict.items()]

Extracting for ./sample/v_GGSY1Qvo990.mp4
rgb
(5, 1024)
[[0.08103889 0.21957842 0.05395147 ... 0.08913269 0.23047689 0.99085313]
 [0.0409274  0.24209626 0.06408894 ... 0.025497   0.29888856 0.7770645 ]
 [0.12468136 0.25410837 0.14176841 ... 0.16713181 0.18788038 0.68860632]
 [0.14245597 0.27374679 0.17478532 ... 0.06249548 0.1518133  0.22951038]
 [0.21149462 0.18290371 0.27646333 ... 0.14340422 0.24316043 0.07378176]]
flow
(5, 1024)
[[2.65111979e-02 3.36236879e-02 7.64859915e-02 ... 4.90684528e-03
  2.16035038e-01 1.64430283e-04]
 [4.73523773e-02 3.64455394e-02 3.65564190e-02 ... 9.21847001e-02
  1.54263407e-01 4.10271697e-02]
 [7.00380579e-02 3.25835533e-02 2.63346955e-02 ... 1.47759795e-01
  4.30818908e-02 2.66177085e-05]
 [5.56781664e-02 2.76811384e-02 4.36474644e-02 ... 2.31218822e-02
  5.60098467e-03 1.34896953e-02]
 [3.38178650e-02 4.64935936e-02 2.61394493e-02 ... 1.90966100e-01
  5.25541417e-02 6.30479259e-03]]
fps
()
19.62
timestamps_ms
(5,)
[ 3261.9775739   6523.95514781  978

In [None]:
! pip freeze

absl-py==1.4.0
aiohttp==3.9.1
aiosignal==1.3.1
alabaster==0.7.16
albumentations==1.3.1
altair==4.2.2
anyio==3.7.1
appdirs==1.4.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array-record==0.5.0
arviz==0.15.1
astropy==5.3.4
astunparse==1.6.3
async-timeout==4.0.3
atpublic==4.0
attrs==23.2.0
audioread==3.0.1
autograd==1.6.2
Babel==2.14.0
backcall==0.2.0
beautifulsoup4==4.11.2
bidict==0.22.1
bigframes==0.19.1
bleach==6.1.0
blinker==1.4
blis==0.7.11
blosc2==2.0.0
bokeh==3.3.3
bqplot==0.12.42
branca==0.7.0
build==1.0.3
CacheControl==0.13.1
cachetools==5.3.2
catalogue==2.0.10
certifi==2023.11.17
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
chex==0.1.7
click==8.1.7
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.2.1
cmake==3.27.9
cmdstanpy==1.2.0
colorcet==3.0.1
colorlover==0.3.0
colour==0.1.5
community==1.0.0b1
confection==0.1.4
cons==0.4.6
contextlib2==21.6.0
contourpy==1.2.0
cryptography==41.0.7
cufflinks==0.17.3
cupy-cuda12x==12.2.0
cvxopt==1.3.2
cvxpy==1.3.2
cycler==0.12.1
c

In [3]:
# --- Configuration ---
VIDEO_FOLDER = './videos'  # Folder containing your .mp4 rally videos
OUTPUT_FOLDER = './rally_features_i3d_stride_4' # Where to save the .npy files
FEATURE_TYPE = 'i3d' # Keep this as 'i3d'

# --- Create Output Directory ---
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# --- Load and Configure Feature Extractor ---

# Load the base config for 'i3d'
args = OmegaConf.load(build_cfg_path(FEATURE_TYPE))

# --- Crucial Settings for ActionFormer Compatibility ---
# 1. stack_size: Use the default 64, as this is standard for I3D Kinetics models.
#    While you mentioned 16 frames *per vector*, stack_size refers to the
#    I3D *input* window. The output stride is controlled by step_size.
#    Changing stack_size might make the pre-trained model perform poorly.
args.stack_size = 16 # Keep default unless you have specific reasons

# 2. step_size: Set to 4 to get 1 feature vector per 4 frames.
args.step_size = 4 # CRITICAL for ActionFormer stride=4 requirement

# 3. streams: Use default (null) for two-stream (RGB + Flow) -> 2048D output
args.streams = None # Ensures both rgb and flow are extracted

# 4. extraction_fps: Use default (null) to keep original video FPS.
args.extraction_fps = None

# 5. flow_type: Default 'raft' is fine.
args.flow_type = 'raft'

# 6. on_extraction: Set to 'print' (or None if that works) so the script
#    *returns* the dictionary, allowing us to manually concatenate and save.
#    DO NOT use 'save_numpy' here, as it might only save one stream.
args.on_extraction = 'print' # Or try None if print is too verbose

# 7. device: Set your computation device
args.device = device

# 8. output_path: This argument might be used internally by the extractor
#    if 'on_extraction' was set to save, but we'll define our own saving path.
#    Set it anyway in case the extractor uses it for temporary files.
args.output_path = os.path.join(OUTPUT_FOLDER, "temp_extractor_out") # Or args.tmp_path

# --- Instantiate the Extractor ---
print("Initializing I3D Extractor...")
extractor = ExtractI3D(args)
print("Extractor Initialized.")

# --- Find Video Files ---
video_extensions = ["*.mp4", "*.avi", "*.mov", "*.mkv"] # Add other extensions if needed
video_paths = []
for ext in video_extensions:
    video_paths.extend(glob.glob(os.path.join(VIDEO_FOLDER, ext)))

if not video_paths:
    print(f"Error: No video files found in {VIDEO_FOLDER}")
    exit()

print(f"Found {len(video_paths)} videos to process.")

# --- Process Each Video ---
for video_path in video_paths:
    print(f"\nProcessing: {video_path}")
    try:
        # Extract features - this returns a dictionary
        feature_dict = extractor.extract(video_path)

        # Check if extraction was successful and returned expected keys
        if feature_dict and 'rgb' in feature_dict and 'flow' in feature_dict:
            rgb_features = feature_dict['rgb']
            flow_features = feature_dict['flow']

            # --- Verification ---
            print(f"  RGB shape: {rgb_features.shape}")   # Should be (T, 1024)
            print(f"  Flow shape: {flow_features.shape}")  # Should be (T, 1024)
            if rgb_features.shape[0] != flow_features.shape[0]:
                 print(f"  Warning: Mismatch in temporal length between RGB ({rgb_features.shape[0]}) and Flow ({flow_features.shape[0]}) for {video_path}. Skipping concatenation.")
                 continue
            if rgb_features.shape[1] != 1024 or flow_features.shape[1] != 1024:
                print(f"  Warning: Unexpected feature dimension for {video_path}. RGB: {rgb_features.shape[1]}, Flow: {flow_features.shape[1]}. Expected 1024. Skipping.")
                continue

            # --- Concatenate RGB and Flow Features ---
            # Result shape: (T, 2048) - This is what ActionFormer needs
            combined_features = np.concatenate((rgb_features, flow_features), axis=1)
            print(f"  Combined shape: {combined_features.shape}") # Should be (T, 2048)

            # --- Save to .npy file ---
            video_basename = os.path.basename(video_path)
            video_filename_no_ext = os.path.splitext(video_basename)[0]
            output_npy_filename = f"{video_filename_no_ext}.npy"
            output_filepath = os.path.join(OUTPUT_FOLDER, output_npy_filename)

            np.save(output_filepath, combined_features)
            print(f"  Successfully saved features to: {output_filepath}")

        else:
            print(f"  Warning: Feature extraction failed or did not return 'rgb' and 'flow' keys for {video_path}.")

    except Exception as e:
        print(f"  Error processing {video_path}: {e}")

print("\nFeature extraction completed.")

Initializing I3D Extractor...
Extractor Initialized.
Found 315 videos to process.

Processing: ./videos/vid_766.mp4
  RGB shape: (62, 1024)
  Flow shape: (62, 1024)
  Combined shape: (62, 2048)
  Successfully saved features to: ./rally_features_i3d_stride_4/vid_766.npy

Processing: ./videos/vid_1020.mp4
  RGB shape: (61, 1024)
  Flow shape: (61, 1024)
  Combined shape: (61, 2048)
  Successfully saved features to: ./rally_features_i3d_stride_4/vid_1020.npy

Processing: ./videos/vid_1019.mp4
  RGB shape: (59, 1024)
  Flow shape: (59, 1024)
  Combined shape: (59, 2048)
  Successfully saved features to: ./rally_features_i3d_stride_4/vid_1019.npy

Processing: ./videos/vid_1018.mp4
  RGB shape: (58, 1024)
  Flow shape: (58, 1024)
  Combined shape: (58, 2048)
  Successfully saved features to: ./rally_features_i3d_stride_4/vid_1018.npy

Processing: ./videos/vid_1017.mp4
  RGB shape: (36, 1024)
  Flow shape: (36, 1024)
  Combined shape: (36, 2048)
  Successfully saved features to: ./rally_featu