# Data frame optimization

The heaviest task we want to perform on the input data frame consists on appending audio snippets.
This involves opening a `.wav` file for each row.

Some of these rows points to the same `.wav` file, so we'll make sure the file is opened only once.

## Input dataframe

In [18]:
from pandas import read_csv

In [19]:
df = read_csv("test_input.csv")
df.head()
df = df.head(4) # Calculate just a tinier subset
df

Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890


## Auxiliary functions:
This adapter will help us converting our syntax (using keys) into librosa's syntax (using filenames).

In [20]:
def filename_from_key(key, data_folder = "data", ext = ".wav"):
    """ Takes the key, returns the filename """
    return data_folder + key + ext #TODO: consider improving this using os.path

### Examples

In [21]:
filename_from_key("/catalan1/ca_f02a_m05a_und")

'data/catalan1/ca_f02a_m05a_und.wav'

## Extract audio features

In [22]:
import librosa

In [23]:
def audio_from_key(key, sr = None, **kwargs):
    """ Equivalent to librosa.core.load, but works with keys instead of with filenames """
    audio, rate = librosa.core.load(filename_from_key(key), sr=sr, **kwargs)
    audio = audio.astype('float32')
    return audio # We'll ignore the rate in this function

def samplerate_from_key(key, **kwargs):
    """ Equivalent to librosa.get_samplerate, but works with keys instead of with filenames """
    return librosa.get_samplerate(filename_from_key(key), **kwargs)

def audio_from_keys(keys, **kwargs):
    return [audio_from_key(key, **kwargs) for key in keys]

def samplerate_from_keys(keys, **kwargs):
    return [samplerate_from_key(key, **kwargs) for key in keys]

def subset_audio(audio, start_time, end_time, rate):
    return audio[int(start_time * rate) : int(end_time * rate)]

### Example of usage

In [24]:
audio_from_key("/catalan1/ca_f02a_m05a_und", sr = None)

array([-0.00175476, -0.00236511, -0.00218201, ...,  0.        ,
        0.        ,  0.        ], dtype=float32)

In [25]:
df['rate'] = samplerate_from_keys(df['key'])
df


Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid,rate
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960,8000
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870,8000
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100,96000
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890,96000


In [26]:
keys = df['key'].unique()
for key in keys:
    # Open the audio file only once per file (as opposed to once per row)
    audio = audio_from_key(key)

    # Extract and append the relevant audio snippet
    aux = df[df['key'] == key]
    for i, row in aux.iterrows():
        #print(row)
        audio_snippet = subset_audio(audio, row['start_time'], row['end_time'], row['rate'])
        print(key)
        print(audio)
        print(audio_snippet)
    
    #audio_snippets = [subset_audio(audio, row['start_time'], row['end_time'], row['rate']) for _, row in aux.iterrows()]

/german1/5298
[-0.00048828 -0.00024414 -0.00048828 ...  0.00097656 -0.00036621
 -0.00097656]
[-0.00012207 -0.00061035 -0.00085449 ...  0.00012207  0.00012207
  0.00012207]
/german1/5298
[-0.00048828 -0.00024414 -0.00048828 ...  0.00097656 -0.00036621
 -0.00097656]
[-0.00097656 -0.00085449 -0.00109863 ...  0.00274658  0.003479
 -0.00134277]
/sambas1/SBS-20111031
[-0.00515747 -0.0050354  -0.00457764 ...  0.00054932  0.00146484
  0.00146484]
[ 0.00845337  0.00848389  0.00723267 ... -0.02545166 -0.02685547
 -0.02685547]
/sambas1/SBS-20111031
[-0.00515747 -0.0050354  -0.00457764 ...  0.00054932  0.00146484
  0.00146484]
[-0.01296997 -0.01293945 -0.01101685 ...  0.0005188   0.00253296
  0.00265503]


## Listen to some files

In [27]:
df

Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid,rate
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960,8000
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870,8000
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100,96000
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890,96000


In [28]:
from IPython.display import Audio
key = "/german1/5298"
audio = audio_from_key(key)
#Audio(data = audio, rate = samplerate_from_key(key))

In [29]:
start_time = df['start_time'][0]
end_time = df['end_time'][0]
rate = df['rate'][0]

start_index = int(start_time * rate)
end_index = int(end_time * rate)

subset = subset_audio(audio, start_time, end_time, rate)

In [30]:
Audio(data = subset, rate = samplerate_from_key(key))