**Data License**: [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/)

**Please note:** This notebook documents the steps we took to preprocess the raw dataset. It is for informational purposes only. You do not need to run the code and can safely navigate to the notebook [01_Download_Dataset](01_Download_dataset.ipynb) to start.

# Data Preprocessing

The original dataset has been shared as part of [Distributed acoustic cues for caller identity in macaque vocalization
](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4806230/).

It can be downloaded from [Data Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.7f4p9).

In [1]:
!mkdir data

SyntaxError: invalid syntax (<ipython-input-1-70fbc37acd88>, line 1)

Please download the dataset from https://datadryad.org/stash/dataset/doi:10.5061/dryad.7f4p9 and copy it over into data if you want to run the preprocessing steps outlined below.

Otherwise, navigate to 02_TBD notebook to use the data after preprocessing that we provide.

In [2]:
!cd data && unzip -q doi_10.5061_dryad.7f4p9__v1.zip

In [3]:
!cd data && unzip -q Fukushima2015.zip

The data is stored in the `dataupload` directory where the subdirectory names are the labels (they are codenames for the individuals in the study).

In [4]:
!ls data/dataupload

[0m[01;34mAL[0m/  [01;34mBE[0m/  [01;34mIO[0m/  [01;34mMU[0m/  [01;34mQU[0m/  [01;34mSN[0m/  [01;34mTH[0m/  [01;34mTW[0m/


We will extract the audio files, partition them into train and validation splits and create a csv file with annotations.

In [5]:
!mkdir data/train
!mkdir data/valid

In [6]:
from pathlib2 import Path

paths, codenames = [], []

for directory in Path('data/dataupload').iterdir():
    codename = directory.stem
    for audio_path in directory.iterdir():
        paths.append(audio_path)
        codenames.append(codename)

In [7]:
import pandas as pd
import numpy as np

annotations = pd.DataFrame(data={'path': paths, 'class': codenames})

Let us now partition the dataset into `train` and `valid` splits.

In [8]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True)
split = list(skf.split(np.arange(annotations.shape[0]), annotations['class']))[0]

We have assigned 80% of the data to the `train set` and 20% of the data to the `validation set`, stratified by class. 

In [9]:
split[0].shape, split[1].shape

((5828,), (1457,))

In [10]:
annotations['split'] = 'train'
annotations.loc[split[1], 'split'] = 'valid'

Let's confirm that the split is stratified by class.

In [11]:
annotations.pivot_table(index='class', columns=['split'], aggfunc=np.count_nonzero)

Unnamed: 0_level_0,path,path
split,train,valid
class,Unnamed: 1_level_2,Unnamed: 2_level_2
AL,799,200
BE,382,96
IO,801,201
MU,814,203
QU,780,195
SN,801,200
TH,1076,269
TW,375,93


We can now move the audio files into appropriate directories.

In [12]:
for _, (path, split) in annotations[['path', 'split']].iterrows():
    !cp {path} data/{split}

The annotations file is no longer valid. Let's remove the `column` and replace it with `filename`.

In [13]:
annotations['filename'] = [path.name for path in annotations.path.values]
annotations.drop(columns=['path'], inplace=True)

In [14]:
annotations.to_csv('data/annotations.csv', index=False)

Let us now combine the `train` and `valid` directories, along with the annotations file, into a single archive.

In [15]:
!cd data && zip -qr macaques.zip annotations.csv train valid

The archive with the data has been uploaded to <location_of_upload>.

The [02_Speaker_Identification_With_Deep_Learning.ipynb](https://github.com/earthspecies/library/blob/macaques/macaques/02_Speaker_Identification_With_Deep_Learning.ipynb) notebook demonstrates how to download the data and construct a model that identifies the speaker.