# **VoxCeleb Datasets**

Audio-Visual datasets is used in industry such, e.g. Alexa voice service as automatic speech recognition. In health care, the voice is routed through a speech-recognition machine for lip reading of patient, military services as High-performance fighter aircraft and digital detection system, lip-reading without any voice of dumb people, language learning as a second language.

VoxCeleb Dataset is developed by the VGG, Department of Engineering Science, University of Oxford, UK. To visit the VGG Dataset, click [here](https://www.robots.ox.ac.uk/~vgg/data/). These are the primary researchers who work on creating VoxCeleb dataset from youtube. 

## **Dataset**

To download the dataset visit the [following page](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/). You need to send a request form to get access to the Dataset Only for Research purposes.

### **Using PyTorch:**

In [None]:

!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim tensorflow keras torch torchvision \
    tqdm scikit-image pillow librosa --user -q --no-warn-script-location

import IPython
IPython.Application.instance().kernel.do_shutdown(True)


In [None]:
import pandas as pd
import numpy as np
from torch.utils.data import Dataset
import utility as ut
import os
import torch
class AudioDataset(Dataset):
    """Audio dataset"""
    def __init__(self, csv_file, base_audio_path, stft_transform=None):
        self._base_audio_path = base_audio_path
        self._table = pd.read_csv(csv_file)
        self._audio_data = {}
        self._stft_transform = stft_transform
        # removed samples from *.csv whose *.wav files are not available
        indices_to_remove = []
        for idx in range(len(self._table)):
            wav_name = os.path.join(self._base_audio_path, self._table.wav_name[idx],)
            if not os.path.exists(wav_name):
                indices_to_remove.append(idx)
        self._table = self._table.drop(indices_to_remove)
        self._table = self._table.reset_index()
    def __len__(self):
        return len(self._table)
    def __getitem__(self, idx):
        wav_name = os.path.join(self._base_audio_path, self._table.wav_name[idx])
        if wav_name in self._audio_data:
            wav_data = self._audio_data[wav_name]
        else:
            wav_data = ut.load_audio_sample(wav_name)
            self._audio_data[wav_name] = wav_data
        # create sample
        wav_data = ut.create_audio_sample(wav_data)
        audio_stft = ut.extract_spectrum(wav_data)
        audio_stft = np.vstack((audio_stft.real, audio_stft.imag))
        if self._stft_transform:
            audio_stft = self._stft_transform(audio_stft)
        audio_stft = audio_stft.reshape((1, *audio_stft.shape))
        audio_stft = torch.from_numpy(audio_stft.astype(dtype=np.float32))
        labels =torch.from_numpy(np.array(self._table.target[idx]).astype(dtype=np.float32))
        return (audio_stft, labels)


ModuleNotFoundError: ignored

# **Related Articles:**

> * [Voxceleb Datasets](https://analyticsindiamag.com/guide-to-voxceleb-datasets-for-visual-audio-of-human-speech/)

> * [FreeSound Datasets](https://analyticsindiamag.com/datasets-freesound-pytorch-research/)

> * [LibriSpeech Datasets](https://analyticsindiamag.com/librispeech-datasets/)

> * [Simple Transformers](https://analyticsindiamag.com/speech-classification-in-3-minutes/)

> * [Create your own Speech Classifier](https://analyticsindiamag.com/speech-classification-in-3-minutes/)