Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing (To appear in ECCV 2020) [Paper]
Yapeng Tian, Dingzeyu Li, and Chenliang Xu
We define the Audio-Visual Video Parsing as a task to group video segments and parse a video into different temporal audio, visual, and audio-visual events associated with semantic labels.
# LLP dataset annotations
cd data
AVVP_dataset_full.csv: full dataset with weak annotaions
AVVP_train.csv: training set with weak annotaions
AVVP_val_pd.csv: val set with weak annotaions
AVVP_test_pd.csv: test set with weak annotaions
AVVP_eval_audio.csv: audio event dense annotations for videos in val and test sets
AVVP_eval_visual.csv: visual event dense annotations for videos in val and test sets
Note that audio-visual events can be derived from audio and visual events.
We use VGGish, ResNet152, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively. The audio and visual features of videos in the LLP dataset can be download from this Google Drive link. The features are in the "feats" folder.
pip install -r requirements
Testing:
python main_avvp.py --mode test --audio_dir /xx/feats/vggish/ --video_dir /xx/feats/res152/ --st_dir /xx/feats/r2plus1d_18/
Training:
python main_avvp.py --mode train --audio_dir /xx/feats/vggish/ --video_dir /xx/feats/res152/ --st_dir /xx/feats/r2plus1d_18/
Download raw videos in the LLP dataset. The downloaded videos will be in the data/LLP_dataset/video folder. Pandas and FFmpeg libraries are required.
python ./scripts/download_dataset.py
Extract audio waveforms from videos. The extracted audios will be in the data/LLP_dataset/audio folder. moviepy library is used to read videos and extract audios.
python ./scripts/extract_audio.py
Extract video frames from videos. The extracted frames will be in the data/LLP_dataset/frame folder.
python ./scripts/extract_frames.py
Audio feature extractor can be found from here.
2D visual feature. pretrainedmodels library is required.
python ./scripts/extract_rgb_feat.py
3D visual feature.
python ./scripts/extract_3D_feat.py
If you find this work useful, please consider citing it.
@InProceedings{tian2020avvp,
author={Tian, Yapeng and Li, Dingzeyu and Xu, Chenliang},
title={Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing},
booktitle = {ECCV},
year = {2020}
}
This project is released under the GNU General Public License v3.0.