# Preprocess Audio Files
This notebook contains the code for preprocessing the audio files before training.

10 second audio in wav format with a sample rate of 48k was chosen because it matches the audio used to train the l3 model from which I plan to use for feature embeddings.

The sample dataset only includes the training data, so we will split that into training and testing data.  If the mode proves viable we will download more data from xeno-canto.org.

In [1]:
import sox # must install sox locally if you want mp3 support
import os
import glob
import shutil
import multiprocessing
from joblib import Parallel, delayed
from tqdm import tqdm
import numpy as np
import pandas as pd

In [2]:
input_dir = "data/train_audio"
output_dir = 'data/train_10sec'

# create output dir if it does not exist
if not os.path.isdir(output_dir):
    os.mkdir(output_dir)

In [3]:
# get a list of all mp3 files
audio_files = glob.glob(os.path.join(input_dir, '*/*.mp3'))
audio_files[:5]

['data/train_audio/olsfly/XC386256.mp3',
 'data/train_audio/olsfly/XC484154.mp3',
 'data/train_audio/olsfly/XC239498.mp3',
 'data/train_audio/olsfly/XC368006.mp3',
 'data/train_audio/olsfly/XC156193.mp3']

# Convert files to 10 second WAV files at 48khz

The audio files have to be processed in two steps because adding padding to an mp3 then saving as wav results in imprecise final times.  We need files to be exactly 10 seconds long wtih a sample rate of 48khz for a total of 48000 * 10 samples.

The first pass resamples the audio to 48k, cuts the clips to 11 seconds, and saves as wav files.

The second pass pads or crops the files to exactly 10 seconds long.

In [4]:
# get the number of cpu cores available
num_cores = multiprocessing.cpu_count()
print(num_cores)

4


In [5]:
def convert_to_wav(af):
    wav_name = os.path.splitext(os.path.basename(af))[0] + '.wav'
    outfile = os.path.join(output_dir, wav_name)
    
    if(not os.path.exists(outfile)):
        tfm = sox.Transformer()
        tfm.rate(48000, quality='v')

        if duration > 10.0:
            # crop to slightly longer than 10 seconds.
            # cannot crop precicely from mp3 so we will have to have
            # a second round of cropping
            tfm.trim(start_time=0.0, end_time=11.0)
        
        tfm.build(input_filepath=af, output_filepath=outfile)    

In [6]:
# Process the conversion using all cores in parallel to save time
_ = Parallel(n_jobs=num_cores, verbose=1)(delayed(convert_to_wav)(i) for i in audio_files)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 123 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 21375 out of 21375 | elapsed:    1.4s finished


In [7]:
# get a list of all the .wav files
wav_audio_files = glob.glob(os.path.join(output_dir, '*.wav'), recursive=False)
wav_audio_files[:5]

['data/train_10sec/XC172660.wav',
 'data/train_10sec/XC357464.wav',
 'data/train_10sec/XC494106.wav',
 'data/train_10sec/XC406464.wav',
 'data/train_10sec/XC317040.wav']

In [8]:
def crop_pad_audio(af):

    duration = sox.file_info.duration(af) # gets duration in seconds
    basename = os.path.basename(af)
    outfile = os.path.join(output_dir, basename)
    
    
    if duration < 10.0:
        # first move the working file to pwd
        os.rename(af, basename)
        
        tfm = sox.Transformer()
        tfm.pad(start_duration=0.0, end_duration=(10.0 - duration))
        tfm.build(input_filepath=basename, output_filepath=outfile)
        
        # remove the old file
        os.remove(basename)        
    
    elif duration > 10.0:
        # first move the working file to pwd
        os.rename(af, basename)
    
        tfm = sox.Transformer()
        tfm.trim(start_time=0.0, end_time=10.0)
        tfm.build(input_filepath=basename, output_filepath=outfile)
    
        # remove the old file
        os.remove(basename)
      

In [9]:
  
print(f'Starting to process {len(wav_audio_files)} files')
_ = Parallel(n_jobs=num_cores, verbose=1)(delayed(crop_pad_audio)(i) for i in wav_audio_files)

Starting to process 21375 files


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 128 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 728 tasks      | elapsed:    6.3s
[Parallel(n_jobs=4)]: Done 1728 tasks      | elapsed:   14.7s
[Parallel(n_jobs=4)]: Done 3128 tasks      | elapsed:   26.0s
[Parallel(n_jobs=4)]: Done 4928 tasks      | elapsed:   42.1s
[Parallel(n_jobs=4)]: Done 7128 tasks      | elapsed:   60.0s
[Parallel(n_jobs=4)]: Done 9728 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 12728 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done 16128 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 19928 tasks      | elapsed:  3.2min
[Parallel(n_jobs=4)]: Done 21368 out of 21375 | elapsed:  3.5min remaining:    0.1s
[Parallel(n_jobs=4)]: Done 21375 out of 21375 | elapsed:  3.5min finished


In [13]:
# Sanity check
# Make sure all the files are exatcly 10 seconds long

def check_length(pf):
    duration = sox.file_info.duration(pf)
    sample_rate = sox.file_info.sample_rate(pf)
    if duration != 10.0  or sample_rate != 48000:
        return (duration, 
                sox.file_info.sample_rate(pf),
                sox.file_info.bitrate(pf),
                pf)
    return False

print(f'Starting to process {len(wav_audio_files)} files')
errors = Parallel(n_jobs=num_cores, verbose=1)(delayed(check_length)(i) for i in wav_audio_files)

Starting to process 21375 files


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  76 tasks      | elapsed:    1.6s
[Parallel(n_jobs=4)]: Done 376 tasks      | elapsed:    8.2s
[Parallel(n_jobs=4)]: Done 876 tasks      | elapsed:   17.6s
[Parallel(n_jobs=4)]: Done 1576 tasks      | elapsed:   29.0s
[Parallel(n_jobs=4)]: Done 2476 tasks      | elapsed:   44.8s
[Parallel(n_jobs=4)]: Done 3576 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 4876 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 6376 tasks      | elapsed:  1.9min
[Parallel(n_jobs=4)]: Done 8076 tasks      | elapsed:  2.5min
[Parallel(n_jobs=4)]: Done 9976 tasks      | elapsed:  3.0min
[Parallel(n_jobs=4)]: Done 12076 tasks      | elapsed:  3.7min
[Parallel(n_jobs=4)]: Done 14376 tasks      | elapsed:  4.3min
[Parallel(n_jobs=4)]: Done 16876 tasks      | elapsed:  5.1min
[Parallel(n_jobs=4)]: Done 19576 tasks      | elapsed:  5.9min
[Parallel(n_jobs=4)]: Done 21375 out of 21375 | elapsed:

In [14]:
# count errors
filtered_errors = [x for x in errors if x]

if len(filtered_errors) > 0:
    print('There were errors')
    print(filtered_errors)
else:
    print('All files were 10.0 seconds long.  Ready to move on.')

All files were 10.0 seconds long.  Ready to move on.
