# Data Preprocessing

The dataset has been shared with us (The Earth Species Project) by our collaborators, [ElephantVoices](https://www.elephantvoices.org/).

Here we document the preprocessing steps that were applied to the raw data we received. The raw dataset can be accessed [here](https://console.cloud.google.com/storage/browser/ml-bioacoustics-datasets;tab=objects?forceOnBucketsSortingFiltering=false&authuser=1&project=okapi-274503&prefix=&forceOnObjectsSortingFiltering=false) **[Please note: this is not yet shared publicly, for internal use only]**. Please download the `elephants_raw.zip` file and put in the root of this directory.

We preprocess the data to create two datasets, one with solo callers ready for speaker identification (classification) and one containing all recordings without labels (useful for pretraining, semisupervised training, audio generation, etc).

In [1]:
!mkdir data
!cp elephants_raw.zip data/
!cd data && unzip -q elephants_raw.zip

In [2]:
ls data

[0m[01;34mEcho rumbles[0m/      [01;34mEmma rumbles[0m/  [01;34mEudora rumbles[0m/
[01;31melephants_raw.zip[0m  [01;34mEnid rumbles[0m/  [01;34mMultiple callers[0m/
[01;34mElla rumbles[0m/      [01;34mErin rumbles[0m/  [01;34mxlsx spreadsheets for ESP[0m/


In [3]:
import pandas as pd

In [4]:
xlses = []
for name in ['Echo', 'Emma', 'Erin', 'Ella', 'Enid', 'Eudora']:
    xls = pd.read_excel(f'data/xlsx spreadsheets for ESP/{name}.xlsx')
    xlses.append(xls)

In [5]:
solo_callers = pd.concat(xlses)

In [6]:
# removing all entries without a file name
solo_callers = solo_callers[~solo_callers.SndFile.isna()]
solo_callers = solo_callers[solo_callers.SndFile != 'poor quality']

In [7]:
solo_callers = solo_callers[solo_callers['Emma only caller?'].isna() | (solo_callers['Emma only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Erin only caller?'].isna() | (solo_callers['Erin only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Ella alone'].isna() | (solo_callers['Ella alone'] == 'yes')]
solo_callers = solo_callers[solo_callers['Enid only caller?'].isna() | (solo_callers['Enid only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Enid only caller?'].isna() | (solo_callers['Enid only caller?'] == 'yes')]
solo_callers = solo_callers[solo_callers['Eudora only caller?'].isna() | (solo_callers['Eudora only caller?'] == 'yes')]

In [8]:
solo_callers.drop_duplicates('SndFile', inplace=True)
solo_callers.drop(columns=['Emma only caller?',
       'sound file found', 'Erin only caller?', 'Ella alone',
       'Sound file found', 'Enid only caller?', 'Eudora only caller?'], inplace=True)

In [9]:
import glob
from shutil import copyfile

In [10]:
mkdir data/audio_all

In [11]:
for path in glob.glob('data/**/**/*.wav') + glob.glob('data/**/*.wav'):
    copyfile(path, f'data/audio_all/{path.split("/")[-1]}')

In [12]:
mkdir data/audio

In [13]:
missing_files = []

for fn in solo_callers.SndFile:
    try:
        copyfile(f'data/audio_all/{fn}.wav', f'data/audio/{fn}.wav')
    except:
        missing_files.append(fn)

In [14]:
solo_callers = solo_callers[~solo_callers.SndFile.isin(missing_files)]
solo_callers.reset_index(inplace=True, drop=True)
solo_callers.to_csv('data/annotations.csv', index=False)

In [15]:
solo_callers.Callers.value_counts()

Ella      104
Echo       84
Enid       62
Eudora     32
Erin       16
Emma       16
Name: Callers, dtype: int64

In [16]:
!cd data && zip -qr elephant_rumbles.zip annotations.csv audio

In [17]:
!rm -rf data/audio
!mv data/audio_all data/audio

In [18]:
fns = [path.split('/')[-1] for path in glob.glob('data/audio/*')]
anno = pd.DataFrame({'SndFile': fns})
anno.to_csv('data/annotations.csv', index=False)

In [19]:
!cd data && zip -qr elephant_rumbles_unlabeled.zip annotations.csv audio