<figure>
  <img src="https://github.com/v-iashin/video_features/raw/master/docs/_assets/i3d.png" width="300" />
</figure>

The `video_features` library allows you to extract features from
raw videos in parallel with multiple GPUs.
It supports several extractors that capture visual appearance,
optical flow, and audio features. See more details in the
[GitHub repository](https://github.com/v-iashin/video_features).

See more feature extraction examples in colaboratory notebooks:
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Zd7r8uKGLGSxlil4PPnXk_4I3KOsjPpO?usp=sharing) – CLIP
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HUlYcOJf_dArOcAaR9jaQHuM5CAZiNZc?usp=sharing) – S3D
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LKoytZmNxtC-EuCp7pHDM6sFvK1XdwlW?usp=sharing) – I3D
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1csJgkVQ3E2qOyVlcOM-ACHGgPBBKwE2Y?usp=sharing) – R(2+1)D
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18I95Rn1B3a2ISfD9b-o4o93m3XuHbcIY?usp=sharing) – RAFT
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17VLdf4abQT2eoMjc6ziJ9UaRaOklTlP0?usp=sharing) – ResNet
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1r_8OnmwXKwmH0n4RxBfuICVBgpbJt_Fs?usp=sharing) – VGGish

In [1]:
! git clone https://github.com/v-iashin/video_features.git
! pip install omegaconf==2.0.6

Cloning into 'video_features'...
remote: Enumerating objects: 1299, done.[K
remote: Counting objects: 100% (409/409), done.[K
remote: Compressing objects: 100% (187/187), done.[K
remote: Total 1299 (delta 254), reused 314 (delta 206), pack-reused 890[K
Receiving objects: 100% (1299/1299), 288.63 MiB | 17.72 MiB/s, done.
Resolving deltas: 100% (668/668), done.
Updating files: 100% (177/177), done.
Collecting omegaconf==2.0.6
  Downloading omegaconf-2.0.6-py3-none-any.whl (36 kB)
Installing collected packages: omegaconf
Successfully installed omegaconf-2.0.6


In [2]:
%cd video_features

/content/video_features


In [3]:
from models.i3d.extract_i3d import ExtractI3D
from utils.utils import build_cfg_path
from omegaconf import OmegaConf
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.cuda.get_device_name(0)

'Tesla T4'

In [4]:
# Select the feature type
feature_type = 'i3d'

# Load and patch the config
args = OmegaConf.load(build_cfg_path(feature_type))
args.video_paths = ['./sample/v_GGSY1Qvo990.mp4']
# args.show_pred = True
# args.stack_size = 24
# args.step_size = 24
# args.extraction_fps = 25
args.flow_type = 'raft' # 'pwc' is not supported on Google Colab (cupy version mismatch)
# args.streams = 'flow'

# Load the model
extractor = ExtractI3D(args)

# Extract features
for video_path in args.video_paths:
    print(f'Extracting for {video_path}')
    feature_dict = extractor.extract(video_path)
    [(print(k), print(v.shape), print(v)) for k, v in feature_dict.items()]

Extracting for ./sample/v_GGSY1Qvo990.mp4


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


rgb
(5, 1024)
[[0.081039   0.21957852 0.05395157 ... 0.08913279 0.23047704 0.99085295]
 [0.0409274  0.24209625 0.06408907 ... 0.02549688 0.29888833 0.77706397]
 [0.12468125 0.25410843 0.14176832 ... 0.16713159 0.18787999 0.68860656]
 [0.14245594 0.27374679 0.17478532 ... 0.0624956  0.15181327 0.2295101 ]
 [0.21149459 0.18290374 0.27646333 ... 0.1434042  0.2431604  0.0737819 ]]
flow
(5, 1024)
[[2.65220664e-02 3.38259302e-02 7.63518587e-02 ... 4.82968241e-03
  2.16032773e-01 1.81033640e-04]
 [4.73029651e-02 3.65159996e-02 3.64766978e-02 ... 9.22304541e-02
  1.53801143e-01 4.10896242e-02]
 [7.00272322e-02 3.24257798e-02 2.63161156e-02 ... 1.47356346e-01
  4.26828787e-02 2.54752940e-05]
 [5.56684062e-02 2.77553443e-02 4.36822437e-02 ... 2.31014378e-02
  5.64269954e-03 1.38580808e-02]
 [3.39145809e-02 4.61797379e-02 2.61285193e-02 ... 1.91081539e-01
  5.28680533e-02 6.34978013e-03]]
fps
()
19.62
timestamps_ms
(5,)
[ 3261.9775739   6523.95514781  9785.93272171 13047.91029562
 16309.88786952]

In [6]:
!pip install librosa torch transformers moviepy

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.6 MB/s[0m eta [36m0:00:0

In [7]:
from moviepy.editor import VideoFileClip

video = VideoFileClip('./sample/v_GGSY1Qvo990.mp4')
video.audio.write_audiofile('./sample/v_G_audio.wav')

MoviePy - Writing audio in ./sample/v_G_audio.wav


                                                                    

MoviePy - Done.




In [19]:
import librosa
import torch
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Model

audio_file = './sample/v_G_audio.wav'
sr = librosa.get_samplerate(audio_file)
input_audio, sample_rate = librosa.load(audio_file,  sr=16000)

model_name = "facebook/wav2vec2-large-xlsr-53"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)

i= feature_extractor(input_audio, return_tensors="pt", sampling_rate=sample_rate)
with torch.no_grad():
  o= model(i.input_values)
audio_features = o.last_hidden_state

Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2Model: ['project_hid.bias', 'project_q.weight', 'project_hid.weight', 'quantizer.weight_proj.weight', 'quantizer.codevectors', 'project_q.bias', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [39]:
sequence_length = audio_features.size(1)

# Calculate the number of audio features per video segment
audio_features_per_segment = sequence_length // feature_dict['flow'].shape[0]

# Split the audio features into segments
audio_feature_segments = audio_features.split(audio_features_per_segment, dim=1)


In [41]:
audio_feature_segments

(tensor([[[-0.1380,  0.0635,  0.0506,  ...,  0.2234, -0.1313, -0.0087],
          [-0.0315,  0.2010,  0.0014,  ..., -0.0900,  0.0160, -0.0253],
          [-0.0315,  0.2080,  0.0015,  ..., -0.0890,  0.0154, -0.0257],
          ...,
          [-0.0303,  0.2098,  0.0009,  ..., -0.0877,  0.0145, -0.0254],
          [-0.0301,  0.2110,  0.0007,  ..., -0.0874,  0.0148, -0.0255],
          [-0.0304,  0.2114,  0.0010,  ..., -0.0878,  0.0145, -0.0252]]]),
 tensor([[[-0.0304,  0.2103,  0.0007,  ..., -0.0878,  0.0151, -0.0251],
          [-0.0303,  0.2104,  0.0010,  ..., -0.0881,  0.0148, -0.0252],
          [-0.0303,  0.2109,  0.0007,  ..., -0.0878,  0.0152, -0.0251],
          ...,
          [-0.0304,  0.2114,  0.0015,  ..., -0.0876,  0.0145, -0.0247],
          [-0.0305,  0.2102,  0.0015,  ..., -0.0876,  0.0150, -0.0248],
          [-0.0307,  0.2088,  0.0014,  ..., -0.0876,  0.0155, -0.0253]]]),
 tensor([[[-0.0309,  0.2083,  0.0014,  ..., -0.0878,  0.0151, -0.0246],
          [-0.0303,  0.2098,

In [None]:
! pip freeze

absl-py==1.2.0
aiohttp==3.8.1
aiosignal==1.2.0
alabaster==0.7.12
albumentations==1.2.1
altair==4.2.0
appdirs==1.4.4
arviz==0.12.1
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
async-timeout==4.0.2
asynctest==0.13.0
atari-py==0.2.9
atomicwrites==1.4.1
attrs==22.1.0
audioread==3.0.0
autograd==1.4
Babel==2.10.3
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==5.0.1
blis==0.7.8
bokeh==2.3.3
branca==0.5.0
bs4==0.0.1
CacheControl==0.12.11
cached-property==1.5.2
cachetools==4.2.4
catalogue==2.0.8
certifi==2022.6.15
cffi==1.15.1
cftime==1.6.1
chardet==3.0.4
charset-normalizer==2.1.0
click==7.1.2
clikit==0.6.2
cloudpickle==1.5.0
cmake==3.22.6
cmdstanpy==1.0.4
colorcet==3.0.0
colorlover==0.3.0
community==1.0.0b1
contextlib2==0.5.5
convertdate==2.4.0
crashtest==0.3.1
crcmod==1.7
cufflinks==0.17.3
cupy-cuda111==9.4.0
cvxopt==1.3.0
cvxpy==1.2.1
cycler==0.11.0
cymem==2.0.6
Cython==0.29.32
daft==0.0.4
dask==2022.2.0
datascience==0.17.5
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.7.1
deprecat==2.1.1
de