# Data frame optimization

The heaviest task we want to perform on the input data frame consists on appending audio snippets.
This involves opening a `.wav` file for each row.

Some of these rows points to the same `.wav` file, so we'll make sure the file is opened only once.

## Input dataframe

In [1]:
from pandas import read_csv

In [None]:
df = read_csv("tests/test_input.csv")
#df = df.head(6) # Calculate just a tinier subset
df

Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890
4,341.41,341.83,B,mhm,/german1/4123,german,german-008-097-341410
5,622.02,622.37,A,ja,/german1/4123,german,german-008-223-622020
6,220.343,220.682,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-091-220343
7,266.974,267.346,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-108-266974
8,145.13,145.82,tx@39,yeah,/arapaho1/25b,arapaho,arapaho-22-076-145130
9,417.9,418.31,tx@5,yeah,/arapaho1/25b,arapaho,arapaho-22-206-417900


Please note the times are in seconds.

## Auxiliary functions:
This adapter will help us converting our syntax (using keys) into librosa's syntax (using filenames).

In [3]:
from corpusparser.auxs import filename_from_key

### Examples

In [4]:
filename_from_key("/catalan1/ca_f02a_m05a_und")

'data/catalan1/ca_f02a_m05a_und.wav'

## Extract audio features

In [5]:
from corpusparser.parsers import *

### Example of usage

In [6]:
audio_from_key("/catalan1/ca_f02a_m05a_und")

array([-0.00175476, -0.00236511, -0.00218201, ...,  0.        ,
        0.        ,  0.        ], dtype=float32)

In [7]:
df['rate'] = samplerate_from_keys(df['key'])
df


Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid,rate
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960,8000
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870,8000
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100,96000
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890,96000
4,341.41,341.83,B,mhm,/german1/4123,german,german-008-097-341410,8000
5,622.02,622.37,A,ja,/german1/4123,german,german-008-223-622020,8000
6,220.343,220.682,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-091-220343,16000
7,266.974,267.346,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-108-266974,16000
8,145.13,145.82,tx@39,yeah,/arapaho1/25b,arapaho,arapaho-22-076-145130,44100
9,417.9,418.31,tx@5,yeah,/arapaho1/25b,arapaho,arapaho-22-206-417900,44100


In [8]:
keys = df['key'].unique()
for key in keys:
    # Open the audio file only once per file (as opposed to once per row)
    audio = audio_from_key(key)

    # Extract and append the relevant audio snippet
    aux = df[df['key'] == key]
    for i, row in aux.iterrows():
        #print(row)
        audio_snippet = subset_audio(audio, row['start_time'], row['end_time'], row['rate'])
        print(key)
        print(audio)
        print(audio_snippet)
    
    #audio_snippets = [subset_audio(audio, row['start_time'], row['end_time'], row['rate']) for _, row in aux.iterrows()]

/german1/5298
[-0.00048828 -0.00024414 -0.00048828 ...  0.00097656 -0.00036621
 -0.00097656]
[-0.00012207 -0.00061035 -0.00085449 ...  0.00012207  0.00012207
  0.00012207]
/german1/5298
[-0.00048828 -0.00024414 -0.00048828 ...  0.00097656 -0.00036621
 -0.00097656]
[-0.00097656 -0.00085449 -0.00109863 ...  0.00274658  0.003479
 -0.00134277]
/sambas1/SBS-20111031
[-0.00515747 -0.0050354  -0.00457764 ...  0.00054932  0.00146484
  0.00146484]
[ 0.00845337  0.00848389  0.00723267 ... -0.02545166 -0.02685547
 -0.02685547]
/sambas1/SBS-20111031
[-0.00515747 -0.0050354  -0.00457764 ...  0.00054932  0.00146484
  0.00146484]
[-0.01296997 -0.01293945 -0.01101685 ...  0.0005188   0.00253296
  0.00265503]
/german1/4123
[-0.00048828 -0.00048828 -0.00024414 ...  0.04699707  0.13659668
  0.11120605]
[-0.00201416 -0.00158691 -0.00146484 ... -0.00024414 -0.00085449
  0.00061035]
/german1/4123
[-0.00048828 -0.00048828 -0.00024414 ...  0.04699707  0.13659668
  0.11120605]
[ 0.00024414  0.          0.     

## Listen the recovered audio

### Auxiliary functions

In [9]:
from corpusparser.listeners import listen_audio_from_key


In [10]:
df

Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid,rate
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960,8000
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870,8000
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100,96000
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890,96000
4,341.41,341.83,B,mhm,/german1/4123,german,german-008-097-341410,8000
5,622.02,622.37,A,ja,/german1/4123,german,german-008-223-622020,8000
6,220.343,220.682,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-091-220343,16000
7,266.974,267.346,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-108-266974,16000
8,145.13,145.82,tx@39,yeah,/arapaho1/25b,arapaho,arapaho-22-076-145130,44100
9,417.9,418.31,tx@5,yeah,/arapaho1/25b,arapaho,arapaho-22-206-417900,44100


In [11]:
keys = df.key.unique()
key = keys[3]
print(key)
listen_audio_from_key(df, key = key, row = 0)

/catalan1/ca_f37s_f38s_und
