# Data frame optimization

The heaviest task we want to perform on the input data frame consists on appending audio snippets.
This involves opening a `.wav` file for each row.

Some of these rows points to the same `.wav` file, so we'll make sure the file is opened only once.

## Input dataframe

In [1]:
from pandas import read_csv

In [2]:
df = read_csv("tests/test_input.csv")
df

Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890
4,341.41,341.83,B,mhm,/german1/4123,german,german-008-097-341410
5,622.02,622.37,A,ja,/german1/4123,german,german-008-223-622020
6,220.343,220.682,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-091-220343
7,266.974,267.346,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-108-266974
8,145.13,145.82,tx@39,yeah,/arapaho1/25b,arapaho,arapaho-22-076-145130
9,417.9,418.31,tx@5,yeah,/arapaho1/25b,arapaho,arapaho-22-206-417900


Please note the times are in seconds.

## Auxiliary functions:
This adapter will help us converting our syntax (using keys) into librosa's syntax (using filenames).

In [3]:
from corpusparser.auxs import filename_from_key
filename_from_key("/catalan1/ca_f02a_m05a_und")

'data/catalan1/ca_f02a_m05a_und.wav'

## Extract audio features

In [4]:
from corpusparser.parsers import *

### Example of usage

### Extract all audio

In [5]:
audio_from_key("/catalan1/ca_f02a_m05a_und")

array([-0.00175476, -0.00236511, -0.00218201, ...,  0.        ,
        0.        ,  0.        ], dtype=float32)

### Extract sample rate

In [6]:
samplerate_from_key("/catalan1/ca_f02a_m05a_und")

16000

### Extract an audio snippet

In [7]:
key = "/catalan1/ca_f02a_m05a_und"
df[df["key"] == key].reset_index()

Unnamed: 0,index,start_time,end_time,participant,utterance,key,language,uid
0,10,318.486,318.89,f02lp,sí,/catalan1/ca_f02a_m05a_und,catalan,catalan-08-222-318486
1,11,84.358,84.546,m05lp,sí,/catalan1/ca_f02a_m05a_und,catalan,catalan-08-061-84358


In [8]:
snippet = subset_audio_from_key(df, key, row=0)
snippet

array([-0.00238037, -0.01153564, -0.00408936, ..., -0.01112366,
       -0.01062012, -0.01023865], dtype=float32)

### Append all audio snippets to dataframe

In [9]:
df = extend_dataframe(df)

  audio, rate = librosa.core.load(filename_from_key(key), sr=sr, **kwargs) # sr=None uses the native sampling rate
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  sr = librosa.get_samplerate(filename_from_key(key), **kwargs)
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  sr = librosa.get_samplerate(filename_from_key(key), **kwargs)


In [10]:

df

Unnamed: 0,start_time,end_time,participant,utterance,key,language,uid,audio,rate
0,629.96,630.51,A,aha,/german1/5298,german,german-059-255-629960,"[-0.00012207031, -0.00061035156, -0.0008544922...",8000
1,398.87,399.33,A,aha,/german1/5298,german,german-059-151-398870,"[-0.0009765625, -0.0008544922, -0.0010986328, ...",8000
2,2009.1,2009.5,tx@ADUSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0883-2009100,"[0.008453369, 0.008483887, 0.007232666, 0.0072...",96000
3,1782.89,1783.4,tx@JEPSBS,aoq,/sambas1/SBS-20111031,sambas,sambas-24-0764-1782890,"[-0.012969971, -0.012939453, -0.011016846, -0....",96000
4,341.41,341.83,B,mhm,/german1/4123,german,german-008-097-341410,"[-0.0020141602, -0.0015869141, -0.0014648438, ...",8000
5,622.02,622.37,A,ja,/german1/4123,german,german-008-223-622020,"[0.00024414062, 0.0, 0.0, -0.00024414062, -0.0...",8000
6,220.343,220.682,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-091-220343,"[-0.0016174316, -0.0015258789, -0.0014343262, ...",16000
7,266.974,267.346,f37ln,sí,/catalan1/ca_f37s_f38s_und,catalan,catalan-12-108-266974,"[0.00019836426, -0.00044250488, -0.0014038086,...",16000
8,145.13,145.82,tx@39,yeah,/arapaho1/25b,arapaho,arapaho-22-076-145130,"[0.006210327, 0.0040740967, 0.0043640137, 0.00...",44100
9,417.9,418.31,tx@5,yeah,/arapaho1/25b,arapaho,arapaho-22-206-417900,"[0.023712158, 0.022613525, 0.025222778, 0.0259...",44100


## (Optional) Listen to the snippets

### From key

In [11]:
from corpusparser.listeners import *

key = "/catalan1/ca_f02a_m05a_und"
#listen_audio_from_key(df, key = key, row = 0)

### From data frame index

In [12]:
#listen_snippet_from_df(df, row = 0)