# Convert sph to wav

Purpose of this notebook:
1. Convert the TEDLIUM sph files to wav
    - you can pick how much of the sph to convert to wav
2. Use PyAudioAnalysis to sample the audio and extract audio features for all the wav files.
3. Write a dictionary with the audio features for all the wav files using pickle.


The audio files come from TED-LIUM 2, a corpus made from audio talks and their transcriptions available on the TED website in 2014 by the Laboratoire d'Informatique de l'Université du Maine.

Before running this notebook, go to the TEDLIUM II webpage, register, and download the dataset containing all the sph files.



Set segmentLengths to the number of seconds of each sph file to convert to wav. Or, run 

Currently, I'm extracting minutes 2-6 of each talk (240 seconds).

For the 240 second files, the files will be in data/wav240.

In [2]:
import pandas as pd
import glob2 as glob
from sphfile import SPHFile
import aux_code.functions as mfc

In [5]:
def convert_sph_wav(file, start, stop, suffix, newPath):  
    path, name = file.split('sph/')
    name = name.split('.sph')[0]
    sphObj = SPHFile( file )
    sphMeta = sphObj.format 
    sphMeta['title'] = name
    # write out a wav file with content from start to stop seconds
    sphObj.write_wav( newPath+'wav'+suffix+'/'+name+'.wav', start, stop)
    #print('sphMeta',sphMeta)
    return sphMeta

In [6]:
# I manually combined the sph files from the "TEDLIUM_release2/train/sph" and "TEDLIUM_release2/test/sph" folders
# and moved them into "TEDLIUM_release2/sph".
# Create a TEDLIUM_release2/wav240 folder to which wav files will be written.

myPath='data/TEDLIUM_release2/sph/'
sphFiles = glob.glob(myPath+"*.sph") ## reading all the files name to in to a list
print('Sample file name:',sphFiles[0].split('sph/')[0])
print('There are',len(sphFiles), 'TED talk audio files')

Sample file name: /Media/TEDLIUM_release2/
There are 1506 TED talk audio files


In [9]:
# This Converts sph to wav and saving the metadata that pyAudioAnalysis provides, such as sampling rate, ets
# The converted files will go in the corresponding wav folder

newPath = 'data/'

segmentLengths = [240]

for ii in segmentLengths:  # these are various lengths to be extracted
    print('ii=', ii)
    for ix, file in enumerate(sphFiles[0:]): 
        #print(ix, 'Converting ', file.split('sph/')[1])
        suffix = str(ii)
        # Run load_waveform if you wish to convert full audio length
#         Fs, x = mfc.load_waveform(wavName, path=wavPath)
#         length = len(x)/Fs
        meta = convert_sph_wav(file, 60, 60+ii, suffix, newPath)  # These are the seconds that I'm converting
        
#Beep to alert when finished
mfc.beep(4);    

ii= 240


# Use pyAudioAnalysis to get audio features for every available wav file.

Reads from ted_ready, writes to 'meta_audio_train.csv' and 'meta_audio_test.csv'

Run once for each sampling length. Rename the previous folders first.

Audio feature extraction only needs to be run once. (keep it commented out)

The audio files come from TED-LIUM 2, a corpus made from audio talks and their transcriptions available on the TED website in 2014 by the Laboratoire d'Informatique de l' Université du Maine.

This section takes about 40 minutes so avoid re-running unnecessarily.

Skip this section if I can obtain the features from a pickle file.

This can take several minutes so only re-run when necessary
Run this once for each wav+suffix directory

In [39]:
import timeit as ti
import time as time
import pickle 

In [37]:
wavPath= 'data/wav240/'
suffix = '240'
# audio file names
suffix = '240'
wavPath = 'data/wav'+suffix+'/'
wavFiles = glob.glob(wavPath+'*.wav')
allWavs = [wavFile.split(wavPath)[1].rsplit('_')[0] +'_'+ wavFile.replace('.wav','').split(wavPath)[1].rsplit('_')[1] for wavFile in wavFiles]    
print(len(allWavs))
allWavs[0:3]

1506


['911Mothers_2010W', 'AaronHuey_2010X', 'AaronKoblin_2011']

In [38]:
# This cell extracts audio features from all the talks in either allMatches or allWavs
# allMatches is a list of wav filenames for talks that are also included in the metadata set.
# Then it writes (pickles) two dictionaries and a list to the data folder.

skippedList=[]
features = {} # a dictionary of fileName and its audio features
featureTime = {}


print('suffix', suffix)
# for ii in  range(0, int(len(allWavs )/10)):  # process 10 talks at a time
for ii in  range(0, int(len(allWavs )/10)):  # process 10 talks at a time
    print('processing files',ii*10,'-' , ii*10+10)
    namesBatch = allWavs[ii*10 : ii*10+10]
    #print(*namesBatch[0:3], sep='\n')
    startTimer = ti.default_timer()
    for wavName in namesBatch:
        try:
            # Determining segment length
            Fs, x = mfc.load_waveform(wavName, path=wavPath)
            length = len(x)/Fs
            windowSize = int(length*9/10)
            stepSize = length-windowSize
            # Extracting features
            features[wavName], featureTime[wavName] = mfc.calc_audio_features(wavName, wavPath, windowSize, stepSize);       
        except (AttributeError, TypeError) as e:
            print('                                   skipped', wavName)
            skippedList.append(wavName)
            pass
    # write to file after a completed batch
    print('pickling up to ii=', ii*10+10)
    for objName in ['features', 'featureTime', 'skippedList']:
        obj = globals()[objName]
        with open('data/'+objName+suffix+'.pickle', 'wb') as handle:      
            pickle.dump(obj, handle, protocol=pickle.HIGHEST_PROTOCOL)
    time.sleep(5)  #adding this break made process take less time
    stopTimer = ti.default_timer()
    print('  ',round((stopTimer - startTimer)/10*len(allWavs)/60, 1), ' minutes for', len(allWavs),' talks' )   
mfc.beep(3); # Alert when code finishes running

        # more details on calc_audio_features are available in aux_code/functions and in the pyAudioAnalysis files
        # for each audio segment length, windowSize + stepSize <= length
        # For example: for an audio segment of 10 seconds, and a step_size = 1 second, 
        # the largest window size is 9 seconds.
        # Then, two 9 second samplings would happen, the first starting at t=0sec and the 2nd starting at t=1sec.
        
        # In each sampling, such as the 9 second sampling, there will be 10 subsamples. 
        # The output in "features" is the averages of the 10 subsamples. 
        
        # for a 720s audio, a window of 240, and a step_size of 120, I got 6 rows per talk.
        # Features.shape gives (34,z), 34 features for z time steps. 
        # featureTime.shape gives z.

# Sample output
# 2 39.83586265918954  minutes for 500 talks

suffix 240
processing files 0 - 10
pickling up to ii= 0
   60.5  minutes for 1506  talks
processing files 10 - 20
pickling up to ii= 1
   63.6  minutes for 1506  talks
processing files 20 - 30
pickling up to ii= 2
   63.7  minutes for 1506  talks
processing files 30 - 40
pickling up to ii= 3
   61.2  minutes for 1506  talks
processing files 40 - 50
pickling up to ii= 4
   60.6  minutes for 1506  talks
processing files 50 - 60
pickling up to ii= 5
   60.5  minutes for 1506  talks
processing files 60 - 70
pickling up to ii= 6
   60.5  minutes for 1506  talks
processing files 70 - 80
pickling up to ii= 7
   60.8  minutes for 1506  talks
processing files 80 - 90
pickling up to ii= 8
   59.9  minutes for 1506  talks
processing files 90 - 100
pickling up to ii= 9
   59.1  minutes for 1506  talks
processing files 100 - 110
pickling up to ii= 10
   60.9  minutes for 1506  talks
processing files 110 - 120
pickling up to ii= 11
   60.8  minutes for 1506  talks
processing files 120 - 130
pickling

pickling up to ii= 92
   59.7  minutes for 1506  talks
processing files 930 - 940
pickling up to ii= 93
   61.6  minutes for 1506  talks
processing files 940 - 950
pickling up to ii= 94
   67.1  minutes for 1506  talks
processing files 950 - 960
pickling up to ii= 95
   62.5  minutes for 1506  talks
processing files 960 - 970
pickling up to ii= 96
   62.5  minutes for 1506  talks
processing files 970 - 980
pickling up to ii= 97
   62.8  minutes for 1506  talks
processing files 980 - 990
Error: file not found or other I/O error. (DECODING FAILED)
                                   skipped MurrayGellMann_Language
pickling up to ii= 98
   54.5  minutes for 1506  talks
processing files 990 - 1000
pickling up to ii= 99
   60.9  minutes for 1506  talks
processing files 1000 - 1010
pickling up to ii= 100
   61.2  minutes for 1506  talks
processing files 1010 - 1020
pickling up to ii= 101
   60.4  minutes for 1506  talks
processing files 1020 - 1030
pickling up to ii= 102
   60.5  minutes for 