# Data Preprocessing

The dataset has been shared with us (The Earth Species Project) by our collaborators, [ElephantVoices](https://www.elephantvoices.org/).

Here we document the preprocessing steps that were applied to the raw data we received. The raw dataset can be accessed [here](https://console.cloud.google.com/storage/browser/ml-bioacoustics-datasets;tab=objects?forceOnBucketsSortingFiltering=false&authuser=1&project=okapi-274503&prefix=&forceOnObjectsSortingFiltering=false) **[Please note: this is not yet shared publicly, for internal use only]**. Please download the `elephants_raw.zip` file and put in the root of this directory.

We preprocess the data to create two datasets, one with solo callers ready for speaker identification (classification) and one containing all recordings without labels (useful for pretraining, semisupervised training, audio generation, etc).

The reason for this preprocessing is to remove recordings that do not contain just the elephant call. We take this step, as for some machine learning applications (the cocktail party problem for isnstance), being able to access just the call portion of the recording has distinct advantages.

Now, this is a crude way of preprocessing the data which doesn't make full use of the raw dataset. It however is the best that can be achieved programatically. Another approach would be to comb through the data, listen to each of the recordings and single out just the call portion (removing extraneous sounds for instance). This would lead to preserving more of the data, but it would come at the cost of manual labor. While it might be a route we might still need to take, it makes sense to begin with programmatic preprocessing and deploying data augmentation techniques down the road first. This way, we will understand better if manual intervention might be required.

One of the criteria for programmatic preprocessing has been going by annotations. Please find an example below. These are the records that would have been removed.

In [14]:
solo_callers[(solo_callers['Emma only caller?'] != 'yes') & ~solo_callers['Emma only caller?'].isna() ][['Emma only caller?']]

Unnamed: 0,Emma only caller?
1,no
3,no
4,"yes, possibly baby grunt before"
5,"not sure, possibly two callers"
6,"not sure, possibly two callers"
11,no
12,no
13,no
15,no
16,no


Here we begin docummenting all the steps undertaken for preprocessing.

In [1]:
!mkdir data
!cp elephants_raw.zip data/
!cd data && unzip -q elephants_raw.zip

In [9]:
ls data

[0m[01;34mEcho rumbles[0m/      [01;34mEmma rumbles[0m/  [01;34mEudora rumbles[0m/
[01;31melephants_raw.zip[0m  [01;34mEnid rumbles[0m/  [01;34mMultiple callers[0m/
[01;34mElla rumbles[0m/      [01;34mErin rumbles[0m/  [01;34mxlsx spreadsheets for ESP[0m/


In [4]:
import pandas as pd

In [5]:
xlses = []
for name in ['Echo', 'Emma', 'Erin', 'Ella', 'Enid', 'Eudora']:
    xls = pd.read_excel(f'data/xlsx spreadsheets for ESP/{name}.xlsx')
    xlses.append(xls)

In [6]:
solo_callers = pd.concat(xlses)

In [7]:
# removing all entries without a file name
solo_callers = solo_callers[~solo_callers.SndFile.isna()]
solo_callers = solo_callers[solo_callers.SndFile != 'poor quality']

In [8]:
solo_callers

Unnamed: 0,Year,Tape,ElapsedTime,Type,ContextType,QR1,AgeCaller,AgeClassCaller,SexCaller,Callers,...,Directed to,FieldNotes,SndFile,Emma only caller?,sound file found,Erin only caller?,Ella alone,Sound file found,Enid only caller?,Eudora only caller?
0,1999.0,3.0,1552.0,rumble,lets go,A,54.0,5,Female,Echo,...,,note that she give sharp ear flap first - this...,B0301552,,,,,,,
1,1999.0,9.0,2708.0,rumble,female chorus,A,54.0,5,Female,Echo,...,,as tested by Masaku; and another rumble here ...,B0902708,,,,,,,
2,1999.0,20.0,2134.0,rumble,cadenced,A,54.0,5,Female,Echo,...,,,B2002134,,,,,,,
3,1999.0,20.0,2222.0,rumble,cadenced,A,54.0,5,Female,Echo,...,,Echo is in the lead facing OTO,B2002222,,,,,,,
4,1999.0,20.0,2331.0,rumble,cadenced,A,54.0,5,Female,Echo,...,,,B2002331,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,2000.0,34.0,12842.0,rumble,cadenced,C,28.0,3,Female,Eudora,...,,,C3412837,,√,,,,,yes
91,2000.0,34.0,13051.0,rumble,little greeting,C,28.0,3,Female,Eudora,...,,as Ella join them,C3413051,,√,,,,,no
92,2000.0,34.0,13113.0,rumble,unknown,C,28.0,3,Female,Eudora,...,,,C3413113,,√,,,,,yes
93,2000.0,35.0,818.0,rumble,cadenced,C,28.0,3,Female,Eudora,...,,Eud & Enid try to go further west but then giv...,C3500818,,√,,,,,no


In [7]:
solo_callers = solo_callers[solo_callers['Emma only caller?'].isna() | (solo_callers['Emma only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Erin only caller?'].isna() | (solo_callers['Erin only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Ella alone'].isna() | (solo_callers['Ella alone'] == 'yes')]
solo_callers = solo_callers[solo_callers['Enid only caller?'].isna() | (solo_callers['Enid only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Enid only caller?'].isna() | (solo_callers['Enid only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Eudora only caller?'].isna() | (solo_callers['Eudora only caller?'] == 'yes')]

In [8]:
solo_callers.drop_duplicates('SndFile', inplace=True)
solo_callers.drop(columns=['Emma only caller?',
       'sound file found', 'Erin only caller?', 'Ella alone',
       'Sound file found', 'Enid only caller?', 'Eudora only caller?'], inplace=True)

In [9]:
import glob
from shutil import copyfile

In [10]:
mkdir data/audio_all

In [11]:
for path in glob.glob('data/**/**/*.wav') + glob.glob('data/**/*.wav'):
    copyfile(path, f'data/audio_all/{path.split("/")[-1]}')

In [12]:
mkdir data/audio

In [13]:
missing_files = []

for fn in solo_callers.SndFile:
    try:
        copyfile(f'data/audio_all/{fn}.wav', f'data/audio/{fn}.wav')
    except:
        missing_files.append(fn)

In [14]:
solo_callers = solo_callers[~solo_callers.SndFile.isin(missing_files)]
solo_callers.reset_index(inplace=True, drop=True)
solo_callers.to_csv('data/annotations.csv', index=False)

In [15]:
solo_callers.Callers.value_counts()

Ella      104
Echo       84
Enid       62
Eudora     32
Erin       16
Emma       16
Name: Callers, dtype: int64

In [16]:
!cd data && zip -qr elephant_rumbles.zip annotations.csv audio

In [17]:
!rm -rf data/audio
!mv data/audio_all data/audio

In [18]:
fns = [path.split('/')[-1] for path in glob.glob('data/audio/*')]
anno = pd.DataFrame({'SndFile': fns})
anno.to_csv('data/annotations.csv', index=False)

In [19]:
!cd data && zip -qr elephant_rumbles_unlabeled.zip annotations.csv audio