# Language-Recognition using ConvNets

_written by Joscha S. Rieber (Fraunhofer IAIS) in 2020_

## Dataset preparation

Please go to the [Mozilla Common Voice Website](https://voice.mozilla.org) and download the full German and English datasets. In the following scripts we will thin out the datasets to make them more handy and play with the data.
* Download German and English datasets
* Extract them
* Define paths below

In [1]:
train = 'train'
test = 'test'

eng = 'english'
ger = 'german'

languages = [eng, ger]
categories = [train, test]

original_dataset_paths = {}

original_dataset_paths[eng] = '/home/jrieber/Downloads/en/clips/' # TODO: Adapt this folder!
original_dataset_paths[ger] = '/home/jrieber/Downloads/de/clips/' # TODO: Adapt this folder!

target_root_path = '../data/'

num_files_to_take_for_each_language = 20000
train_rate = 0.8  # Use 80 % of the data for training and the rest for testing

### Check paths

If something goes wrong here, check paths again and read the documentation of the GitHub repository and check how to set-up your environment correctly

In [2]:
import os

for lang in languages:
    if not os.path.isdir(original_dataset_paths[lang]):
        raise
    for category in categories:
        if not os.path.isdir(target_root_path + category + '/' + lang):
            raise

### Collect only num_files_to_take_for_each_language files which duration is between 5 and 10 seconds

In [3]:
# If this goes wrong, check your environment and read the documentation

import librosa as lr
from glob import glob
from random import shuffle
from shutil import copy2
import numpy as np
import warnings

In [8]:
warnings.simplefilter('ignore', UserWarning)

for lang in languages:    
    print('')
    print('Copying files for language ' + lang + '...')
    print('')
    
    all_files = glob(original_dataset_paths[lang] + '*.mp3')
    shuffle(all_files)
    
    counter = 0
    
    category = train
    
    for file in all_files:
        try:
            audio_segment, sample_rate = lr.load(file)
            if np.count_nonzero(audio_segment) == 0:
                raise Exception('Audio is silent!')
            if audio_segment.ndim != 1:
                raise Exception('Audio signal has wrong number of dimensions: ' + str(audio_segment.ndim))
            duration_sec = lr.core.get_duration(audio_segment, sr=sample_rate)
        except Exception as e:
            print('WARNING! Error while loading file \"' + file + '\": ' + str(e) + ' - Skipping...')
            continue
        
        # Only copy audio files with a certain duration
        if 7.5 < duration_sec < 10.0:
            copy2(file, target_root_path + category + '/' + lang)
            counter += 1
        
        # Stop after collecting enough files
        if counter == int(num_files_to_take_for_each_language * train_rate):
            category = test
        if counter == num_files_to_take_for_each_language:
            break
            
warnings.simplefilter('default', UserWarning)


Copying files for language german...



### Check number of collected files

In [9]:
for category in categories:
    
    if category == train:
        num_files = int(num_files_to_take_for_each_language * train_rate)
    else:
        num_files = int(num_files_to_take_for_each_language * (1.0 - train_rate))
        
    for lang in languages:
        folder = target_root_path + category + '/' + lang + '/'
        all_files = glob(folder + '*.mp3')
        
        if len(all_files) < (num_files - 1):
            raise Exception('Folder \"' + folder + '\" only contains ' + str(len(all_files)) + ' files instead of ' + str(num_files) + '!')
            
print('Okay!')

Okay!


Now make yourself familiar with the dataset by listening to some of the files

## Statistics

In [10]:
warnings.simplefilter('ignore', UserWarning)

for category in categories:
    for lang in languages:
        duration_sec = 0.0
        
        folder = target_root_path + category + '/' + lang + '/'
        all_files = glob(folder + '*.mp3')
        
        for file in all_files:
            duration_sec += lr.core.get_duration(filename=file)
            
        duration_h = duration_sec / 60.0 / 60.0
        print('Total duration of ' + lang + ' ' + category + ' is ' + str(round(duration_h, 1)) + ' h')
        
warnings.simplefilter('default', UserWarning)

Total duration of english train is 37.1 h
Total duration of german train is 37.1 h
Total duration of english test is 9.3 h
Total duration of german test is 9.3 h
