# **VGG-SOUND Datasets**

VGG-SOUND Datasets is Developed by VGG, Department of Engineering Science, University of Oxford, UK Audio VGGSound  Dataset has set a benchmark for audio recognition with visuals. It contains more than  210 k videos with visual and audio. The dataset contains over 310 categorie and 550 hours of video. It is available to download for commercial/research purposes. The VGGSound dataset consists of each video and audio segment being 10 seconds long.

## **Download Dataset**

The dataset is available as a CSV file which contains the ‘youtube URL’ of the audio and video,  click [here](https://www.robots.ox.ac.uk/~vgg/data/vggsound/vggsound.csv) to download locally on your computer.

Download Size: 8 MB

Now, Let’s do some coding to know about the dataset and their category ratio of training and testing data. 

In [None]:
!wget https://www.robots.ox.ac.uk/~vgg/data/vggsound/vggsound.csv

## **Visualization of Data**

Import all these library pandas, matplotlib and seaborn and load the dataset using the code.

In [None]:

!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim tensorflow keras torch torchvision \
    tqdm scikit-image pillow librosa torchaudio apache_beam --user -q --no-warn-script-location

import IPython
IPython.Application.instance().kernel.do_shutdown(True)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("vggsound.csv")

The dataset contains more than 310 classes which is broadly categorised as :

1. People
2. Animals
3. Music
4. Sports
5. Nature
6. Vehicle
7. Tools
8. Instruments
9. Mammals
10. Others

If you plot a pie chart of it then it looks messy like this.

In [None]:
df.groupby('people marching').size().plot(kind='pie', autopct='%.2f')

Now visualize the train and test data.

In [None]:
sns.catplot(x="1", y="test", data=df)

Plot the ratio of training and Test set:

In [None]:
df.groupby('people marching').size().plot(kind='pie', autopct='%.2f')

The dataset contains 92.25 per cent of training data and 7.75 per cent of test data as shown in the pie chart.

## **Implementation of VGG-Sound**

**Using PyTorch:**

In [None]:
import os
import cv2
import json
import torch
import csv
import numpy as np
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
import time
from PIL import Image
import glob
import sys
from scipy import signal
import random
import soundfile as sf
class GetAudioVideoDataset(Dataset):
    def __init__(self, args, mode='train', transforms=None):
        data2path = {}
        classes = []
        classes_ = []
        data = []
        data2class = {}
        with open(args.csv_path + 'stat.csv') as f:
            csv_reader = csv.reader(f)
            for row in csv_reader:
                classes.append(row[0])
        with open(args.csv_path  + args.test) as f:
            csv_reader = csv.reader(f)
            for item in csv_reader:
                if item[1] in classes and os.path.exists(args.data_path + item[0][:-3] + 'wav'):
                    data.append(item[0])
                    data2class[item[0]] = item[1]
        self.audio_path = args.data_path 
        self.mode = mode
        self.transforms = transforms
        self.classes = sorted(classes)
        self.data2class = data2class
        # initialize audio transform
        self._init_atransform()
        #  Retrieve list of audio and video files
        self.video_files = []
        for item in data:
            self.video_files.append(item)
        print('# of audio files = %d ' % len(self.video_files))
        print('# of classes = %d' % len(self.classes))
    def _init_atransform(self):
        self.aid_transform = transforms.Compose([transforms.ToTensor()])
    def __len__(self):
        return len(self.video_files)  
    def __getitem__(self, idx):
        wav_file = self.video_files[idx]
        # Audio
        samples, samplerate = sf.read(self.audio_path + wav_file[:-3]+'wav')
        # repeat in case audio is too short
        resamples = np.tile(samples,10)[:160000]
        resamples[resamples > 1.] = 1.
        resamples[resamples < -1.] = -1.
        frequencies, times, spectrogram = signal.spectrogram(resamples, samplerate, nperseg=512,noverlap=353)
        spectrogram = np.log(spectrogram+ 1e-7)
        mean = np.mean(spectrogram)
        std = np.std(spectrogram)
        spectrogram = np.divide(spectrogram-mean,std+1e-9)
        return spectrogram, resamples,self.classes.index(self.data2class[wav_file]),wav_file

# **Related Articles:**

> * [VGG Sound Datasets](https://analyticsindiamag.com/guide-to-vgg-sound-datasets-for-visual-audio-recognition/)

> * [Voxceleb Datasets](https://analyticsindiamag.com/guide-to-voxceleb-datasets-for-visual-audio-of-human-speech/)

> * [FreeSound Datasets](https://analyticsindiamag.com/datasets-freesound-pytorch-research/)

> * [LibriSpeech Datasets](https://analyticsindiamag.com/librispeech-datasets/)

> * [Simple Transformers](https://analyticsindiamag.com/speech-classification-in-3-minutes/)