# Spoken Language Recognition Using Convolutional Neural Networks

_written by Joscha S. Rieber (Fraunhofer IAIS) in 2020_

## Dataset preparation

Please go to the [Mozilla Common Voice Website](https://commonvoice.mozilla.org/) and download the full German and English datasets. In the following scripts we will thin out the datasets to make them more handy and play with the data.
* Download German and English datasets
* Extract them
* Define paths below

In [21]:
train = 'train'
test = 'test'

eng = 'english'
ger = 'german'
swe = 'swedish'

languages = [eng, ger, swe]
categories = [train, test]

original_dataset_paths = {}

original_dataset_paths[eng] = 'D:/voiceData/cv-corpus-10.0-2022-07-04/en/' # TODO: Adapt this folder!
original_dataset_paths[ger] = 'D:/voiceData/cv-corpus-10.0-2022-07-04/de/' # TODO: Adapt this folder!
original_dataset_paths[swe] = 'D:/voiceData/cv-corpus-10.0-2022-07-04/sv/' # TODO: Adapt this folder!

target_root_path = '../data/'

num_files_to_take_for_each_language = 2000 # 20000 is the maximum number of files for each language
train_rate = 0.8  # Use 80 % of the data for training and the rest for testing

In [22]:
target_root_path

'../data/'

In [23]:
import os
for lang in languages:
    if not os.path.isdir(original_dataset_paths[lang]):
        raise
    for category in categories:
        if not os.path.isdir(target_root_path + category + '/' + lang):
            os.makedirs(target_root_path + category + '/' + lang)

RuntimeError: No active exception to reraise

### Check paths

If something goes wrong here, check paths again and read the documentation of the GitHub repository and check how to set-up your environment correctly

In [24]:
import os

for lang in languages:
    if not os.path.isdir(original_dataset_paths[lang]):
        raise
    for category in categories:
        if not os.path.isdir(target_root_path + category + '/' + lang):
            raise

for lang in languages:
    if not os.path.isfile(original_dataset_paths[lang] + 'validated.tsv'):
        raise
    if not os.path.isdir(original_dataset_paths[lang] + 'clips'):
        raise

RuntimeError: No active exception to reraise

Collect only num_files_to_take_for_each_language files which duration is between 7.5 and 10 seconds

<span style="color:red">Note, that this process might take many hours!</span>

In [25]:
# If this goes wrong, check your environment and read the documentation

import librosa as lr
from glob import glob
from random import shuffle
from shutil import copy2
import numpy as np
import pandas as pd
import warnings
from pydub import AudioSegment
import soundfile as sf

ModuleNotFoundError: No module named 'pydub'

In [None]:
def copy_audio_files_for_language(lang):
    
    print('')
    print('Copying files for language ' + lang + '...')
    print('')
    
    # Only take validated speech data
    df = pd.read_csv(original_dataset_paths[lang] + 'validated.tsv', sep='\t', low_memory=False)
    all_filenames = df['path'].tolist()
    shuffle(all_filenames)
    
    counter = 0
    
    category = train    
    
    # Process files
    for filename in all_filenames:
        file = original_dataset_paths[lang] + 'clips/' + filename
        try:
            # sound = AudioSegment.from_file(file, format="mp3")
            # wavSound = file.rstrip(".mp3") + ".wav"
            # sound.export(wavSound, format="wav")
            audio_segment, sample_rate = lr.load(file, sr=None)
            if np.count_nonzero(audio_segment) == 0:
                raise Exception('Audio is silent!')
            if audio_segment.ndim != 1:
                raise Exception('Audio signal has wrong number of dimensions: ' + str(audio_segment.ndim))
            duration_sec = lr.core.get_duration(audio_segment, sr=sample_rate)
        except Exception as e:
            print('WARNING! Error while loading file \"' + file + '\": ' + str(e) + ' - Skipping...')
            continue
        
        # Only copy audio files with a certain minimum duration
        if 7.5 < duration_sec < 10.0:
            copy2(file, target_root_path + category + '/' + lang)
            counter += 1
        
        # Stop after collecting enough files
        if counter == int(num_files_to_take_for_each_language * train_rate):
            category = test
        if counter == num_files_to_take_for_each_language:
            break

Copy files to create the German language train and test datasets

In [64]:
warnings.simplefilter('ignore', UserWarning)

copy_audio_files_for_language(ger)

warnings.simplefilter('default', UserWarning)


Copying files for language german...



KeyboardInterrupt: 

Copy files to create the English language train and test datasets

In [6]:
warnings.simplefilter('ignore', UserWarning)

copy_audio_files_for_language(eng)

warnings.simplefilter('default', UserWarning)


Copying files for language english...



In [None]:
warnings.simplefilter('ignore', UserWarning)

copy_audio_files_for_language(swe)

warnings.simplefilter('default', UserWarning)


Copying files for language swedish...











































































































































































































































































































































































































In [39]:
# import common_voice_de_17298952.mp3 from src using librosa as lr
file = 'D:/voiceData/cv-corpus-10.0-2022-07-04/sv/clips/common_voice_sv-SE_18710562.mp3'

audio_segment, sample_rate = sf.read(file)



RuntimeError: Error opening 'D:/voiceData/cv-corpus-10.0-2022-07-04/sv/clips/common_voice_sv-SE_18710562.mp3': File contains data in an unknown format.

### Check number of collected files

In [7]:
for category in categories:
    
    if category == train:
        num_files = int(num_files_to_take_for_each_language * train_rate)
    else:
        num_files = int(num_files_to_take_for_each_language * (1.0 - train_rate))
        
    for lang in languages:
        folder = target_root_path + category + '/' + lang + '/'
        all_files = glob(folder + '*.mp3')
        
        if len(all_files) < (num_files - 1):
            raise Exception('Folder \"' + folder + '\" only contains ' + str(len(all_files)) + ' files instead of ' + str(num_files) + '!')
            
print('Okay!')

Okay!


Now make yourself familiar with the dataset by listening to some of the files

## Statistics

In [8]:
warnings.simplefilter('ignore', UserWarning)

for category in categories:
    for lang in languages:
        duration_sec = 0.0
        
        folder = target_root_path + category + '/' + lang + '/'
        all_files = glob(folder + '*.mp3')
        
        for file in all_files:
            duration_sec += lr.core.get_duration(filename=file)
            
        duration_h = duration_sec / 60.0 / 60.0
        print('Total duration of ' + lang + ' ' + category + ' is ' + str(round(duration_h, 1)) + ' h')
        
warnings.simplefilter('default', UserWarning)

Total duration of english train is 37.0 h
Total duration of german train is 37.0 h
Total duration of english test is 9.2 h
Total duration of german test is 9.3 h
