**EGO4D Audio-Visual Diarization Benchmark**
- This notebook allows a quickstart into the [EGO4D Audio Visual Diarization](https://github.com/EGO4D/audio-visual/blob/main/diarization/audio-visual/README.md) and [Transcription](https://github.com/EGO4D/audio-visual/blob/main/transcription/README.md) from the [EGO4D Audio Visual Diarization Benchmark](https://github.com/EGO4D/audio-visual)
- It runs a subset of video clips from the EGO4D dataset in EGO4D's Audio-Visual repo
- Hardware accelerator should be T4 GPU
- Some changes to the code have been made in the forked repo so that it could be compatible with Google Colab

##Install Dependencies

In [None]:
!apt install ffmpeg python3-pip git
!pip install ego4d awscli numpy opencv-python pyqt5 opencv-contrib-python libtorch torchvision torchaudio
!sudo apt-get install libavcodec-dev libavformat-dev libswscale-dev libv4l-dev
!sudo apt-get install libxvidcore-dev libx264-dev
!sudo apt install libgtk2.0-dev liblcm-dev
!sudo apt-get install liblcm-dev
!pip install pydub audiosegment

##Clone Repository & download the videos

In [None]:
# Create the new egocentric directory
!mkdir egocentric
%cd /content/egocentric

# Clone the audio-visualrepo
!git clone https://github.com/ashneet1/audio-visual.git
%cd /content/egocentric/audio-visual
!mkdir data

# List the video uids to download
!touch video_uids.txt
!echo "0b4cacb1-970f-4ef0-85da-371d81f899e0" >> video_uids.txt
!echo "c2413391-7c1b-4fd6-8b1d-98ee7888b9f8" >> video_uids.txt
!echo "fe69a78e-7773-45d1-9e0f-bacee52dac83" >> video_uids.txt
!echo "3b79017c-4d42-40fc-a1bb-4a20bc8ebca7" >> video_uids.txt
!echo "6dbfc053-7899-40d8-9827-0ccd21f3ee0a" >> video_uids.txt
!echo "7e6dfd31-8544-4fad-9e49-0f05516cf8cf" >> video_uids.txt
!echo "56c5af79-f9d4-478d-96ef-6d71e0bbbdfe" >> video_uids.txt
!echo "d97bedc8-72df-43be-a55b-4da1ae42dfd1" >> video_uids.txt
!echo "f0cb79ef-c081-4049-85ef-2623e02c9589" >> video_uids.txt
!echo "08b0935e-6260-4bd6-86ca-f6fc54e388be" >> video_uids.txt
!echo "6b34c327-000c-42b6-b242-d3dca63a7508" >> video_uids.txt
!echo "076bdb81-5c75-4282-9f3a-a387624575f3" >> video_uids.txt

# Configure aws cli to be able to access the ego4d dataset
!aws configure

# Download ego4d model, annotation, and videos
!ego4d -y --output_directory ./data  --datasets av_models clips annotations --benchmarks av --video_uid_file video_uids.txt
!tar xf data/v2/av_models/pretrained_av_models.tar.gz
!mv data/v2/annotations/* utils/ground_truth

In [None]:
# Install libtorch
%cd /content/egocentric
!wget "https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.1.0%2Bcu118.zip"
!unzip "libtorch-cxx11-abi-shared-with-deps-2.1.0+cu118.zip"
!rm -rf "libtorch-cxx11-abi-shared-with-deps-2.1.0+cu118.zip"

#EGO4D Audio-Visual Diarization Baseline

##Preprocess ground truth data

In [None]:
# Preprocess ground truth data
%cd /content/egocentric/audio-visual/utils/ground_truth
!bash init_dirs.sh
!python3 extract_clipnames_and_split_indices.py
!python3 extract_boxes_and_speakers.py
!python3 make_mot_ground_truth.py ../../data/v2/clips val
!mv tracking_evaluation/mot_challenge ../../tracking/tracking_evaluation/data/gt

In [None]:
#Run visualize_ground_truth.py (It downloads the output video to the current directory)
%cd /content/egocentric/audio-visual/utils/ground_truth/
!python3 visualize_ground_truth.py  /content/egocentric/audio-visual/data/v2/clips 0b4cacb1-970f-4ef0-85da-371d81f899e0 #This is 389

##Localization & Tracking

###People Detection Setup

In [None]:
# People detection setup
# https://github.com/EGO4D/audio-visual/blob/main/tracking/README.md#people-detection
%cd /content/egocentric/audio-visual/tracking/people_detection

# Replace the lines in the makefile to use opencv4 instead of opencv3
!sed -i '44s/.*/LDFLAGS+= `pkg-config --libs opencv4` -lstdc++/' Makefile
!sed -i '45s/.*/COMMON+= `pkg-config --cflags opencv4`/' Makefile

#Specifying the arch
!sed -i '14s/.*/ARCH= -gencode arch=compute_75,code=sm_75/' Makefile

# Add missing headers required to build using opencv4
# https://stackoverflow.com/questions/64885148/error-iplimage-does-not-name-a-type-when-trying-to-build-darknet-with-opencv
!sed -i '3 i #include "opencv2/core/core_c.h"' src/image_opencv.cpp
!sed -i '3 i #include "opencv2/videoio/legacy/constants_c.h"' src/image_opencv.cpp
!sed -i '3 i #include "opencv2/highgui/highgui_c.h"' src/image_opencv.cpp

!sed -i '3 i #include "opencv2/core/core_c.h"' src/image_opencv.hpp
!sed -i '3 i #include "opencv2/videoio/legacy/constants_c.h"' src/image_opencv.hpp
!sed -i '3 i #include "opencv2/highgui/highgui_c.h"' src/image_opencv.hpp

!make -j

###Short Term Tracking

In [None]:
# Short term tracking setup
# https://github.com/EGO4D/audio-visual/blob/main/tracking/README.md#short_term_tracking
%cd /content/egocentric/audio-visual/tracking/short_term_tracking

# Modify line 13 in the CMake file to include the opencv4 directory
# https://stackoverflow.com/questions/58478074/how-to-fix-fatal-error-opencv2-core-hpp-no-such-file-or-directory-for-opencv
!sed -i '13s,.*,include_directories( /usr/local/include /usr/local/cuda/include /usr/include/opencv4/ ),' CMakeLists.txt

# Modify line 17 in the CMake file to fix a compilation error
# https://github.com/pytorch/pytorch/issues/103371
!sed -i '17s,.*,set_property(TARGET short_term_tracker PROPERTY CXX_STANDARD 17),' CMakeLists.txt
!mkdir build
%cd build
!cmake -DCMAKE_PREFIX_PATH=/content/egocentric/libtorch ..
!make

###Run Global People Tracking

In [None]:
#Global People Tracking
%cd /content/egocentric/audio-visual/tracking
!python3 single_run.py /content/egocentric/audio-visual/data/v2/clips 438

##Voice Activity Detection (VAD)

In [None]:
#Voice Activity Audio Detection
%cd /content/egocentric/audio-visual/active-speaker-detection/vad
!python3 extract_all_audio.py /content/egocentric/audio-visual/data/v2/clips
!python3 vad.py

##Active Speaker Detection (ASD)

####Mouth Region Classification (MRC)

In [None]:
# Active Speaker Detection (ASD)
# Mouth region classification (MRC)
%cd /content/egocentric/audio-visual/active-speaker-detection/active_speaker/mrc_active_speaker_detection/prediction

# Modify line 15 and 17 in the CMake file to include the opencv4 directory
# https://stackoverflow.com/questions/58478074/how-to-fix-fatal-error-opencv2-core-hpp-no-such-file-or-directory-for-opencv
!sed -i '15s,.*,link_directories( /usr/local/lib /usr/local/cuda/lib64 /usr/include/opencv4/ ),' CMakeLists.txt
!sed -i '17s,.*,include_directories( /usr/local/include /usr/local/cuda/include /usr/local/cuda/targets/x86_64-linux/include /usr/include/opencv4/ ),' CMakeLists.txt

# Modify line 21 in the CMake file to fix a compilation error
# https://github.com/pytorch/pytorch/issues/103371
!sed -i '21s,.*,set_property(TARGET mrc PROPERTY CXX_STANDARD 17),' CMakeLists.txt
# Build MRC tracking code
!mkdir build
%cd build
!cmake -DCMAKE_PREFIX_PATH=/content/egocentric/libtorch ..
!make

In [None]:
#Running the MRC
%cd /content/egocentric/audio-visual/active-speaker-detection/active_speaker/mrc_active_speaker_detection/prediction
!python3 run_once.py /content/egocentric/audio-visual/data/v2/clips ego4d 389

##Audio Embedding

In [None]:
#Voice Embedding
%cd /content/egocentric/audio-visual/active-speaker-detection/audio_embedding/make_audio_embeddings
!python3 batch_audio_embedding.py /content/egocentric/audio-visual/data/v2/clips val

##Device wearer voice activity detection

####Energy Based Method

In [None]:
#Wearer: energy based method
%cd /content/egocentric/audio-visual/active-speaker-detection/wearer/energy_based
!python3 short_time_energy.py /content/egocentric/audio-visual/data/v2/clips val
!python3 match_wearer_audio.py val

##Surrounding people voice matching (MRC)

In [None]:
#Surrounding People Audio Matching: MRC
%cd /content/egocentric/audio-visual/active-speaker-detection/surrounding_people_audio_matching/mrc
!python3 match_audio.py /content/egocentric/audio-visual/active-speaker-detection/active_speaker/mrc_active_speaker_detection/prediction/results val

#Transcription
- Need to move "av_test_unannotated.json","av_train.json", and "av_val.json" from /content/egocentric/audio-visual/utils/ground_truth to /content/egocentric/audio-visual/data


In [None]:
#Install Miniconda
%cd /content/
#https://www.kaggle.com/code/alaajah/creating-virtual-environment-on-google-colab
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!conda install -q -y --prefix /usr/local python=3.8.10 ujson

In [None]:
#Install Sclite
%cd /content/
!git clone https://github.com/usnistgov/SCTK.git
%cd SCTK
! make config
! make all
! make check
! make install
! make doc

In [None]:
#Activate Environment
%cd /content/drive/MyDrive/egocentric/audio-visual/transcription
!conda create --name transcription_env --file requirements_38_10.txt
!pip install soundfile
!pip install torch
!pip install espnet_model_zoo

In [None]:
#Extract 16kHz single channel audio files in wav format from videos
%cd /content/egocentric/audio-visual/data/v2/clips
!mkdir wavs_16000
%cd /content/egocentric/audio-visual/transcription
!chmod +x extract_wav.sh
!./extract_wav.sh /content/egocentric/audio-visual/data/v2/clips /content/egocentric/audio-visual/data/v2/wavs_16000

In [None]:
#Extract transcriptions from the annotation files, decode audio and score the decoding output
%cd /content/egocentric/audio-visual/transcription
!pip install torchaudio
!chmod +x score_asr.sh
!./score_asr.sh /content/egocentric/audio-visual/transcription/output 1