# Data Preprocessing

I donwloaded the data from [Kaggle](https://www.kaggle.com/c/rfcx-species-audio-detection/data) and extracted it into the `data` directory.

In [1]:
ls data/

[0m[01;31mrfcx-species-audio-detection.zip[0m  [01;34mtest[0m/       [01;34mtrain[0m/        train_tp.csv
sample_submission.csv             [01;34mtfrecords[0m/  train_fp.csv


There are several things we will not need anymore - let's get rid of them.

In [2]:
rm -rf data/sample_submission.csv data/tfrecords data/rfcx-species-audio-detection.zip data/test

In [3]:
import pandas as pd

In [4]:
train_tp = pd.read_csv('data/train_tp.csv')
train_tp.head()

Unnamed: 0,recording_id,species_id,songtype_id,t_min,f_min,t_max,f_max
0,003bec244,14,1,44.544,2531.25,45.1307,5531.25
1,006ab765f,23,1,39.9615,7235.16,46.0452,11283.4
2,007f87ba2,12,1,39.136,562.5,42.272,3281.25
3,0099c367b,17,4,51.4206,1464.26,55.1996,4565.04
4,009b760e6,10,1,50.0854,947.461,52.5293,10852.7


The species that is present is identified by an id. The mapping of the id to species has been revelead [here](https://www.kaggle.com/c/rfcx-species-audio-detection/discussion/238216).

Let's capture this information.

In [5]:
id2species = {
    0: "Eleutherodactylus gryllus",
    1: "Eleutherodactylus brittoni",
    2: "leptodactylus albilabris",
    3: "Eleutherodactylus coqui",
    4: "Eleutherodactylus hedricki",
    5: "Dendroica angelae",
    6: "Melanerpes portoricensis",
    7: "Coereba flaveola",
    8: "Eleutherodactylus locustus",
    9: "Margarops fuscatus",
    10: "Loxigilla portoricensis",
    11: "Vireo altiloquus",
    12: "Eleutherodactylus portoricensis",
    13: "Megascops nudipes",
    14: "Eleutherodactylus richmondi",
    15: "Patagioenas squamosa",
    16: "Eleutherodactylus antillensis",
    17: "Turdus plumbeus",
    18: "Eleutherodactylus unicolor",
    19: "Coccyzus vieilloti",
    20: "Todus mexicanus",
    21: "Eleutherodactylus wightmanae",
    22: "Nesospingus speculiferus",
    23: "Spindalis portoricensis",
}

The `train_tp.csv` (true positives - species that can be heard in a given chunk of audio) and `train_fp.csv` (species that the model is likely to detect but that actually do not occur in this particular segment of audio).

These types of annotations (especially given they also contain information that can be used for drawing a bounding box around the signal) can be useful for certain types of models, but for most of the applications, this is not the shape of the data that is most convenient to work with.

Let's create a new file, `train.csv`, that will map recordings to the labels of species that appear in them.

In [6]:
id2species = {k: '_'.join([vv.lower() for vv in v.split()]) for k, v in id2species.items()}

In [7]:
id2species

{0: 'eleutherodactylus_gryllus',
 1: 'eleutherodactylus_brittoni',
 2: 'leptodactylus_albilabris',
 3: 'eleutherodactylus_coqui',
 4: 'eleutherodactylus_hedricki',
 5: 'dendroica_angelae',
 6: 'melanerpes_portoricensis',
 7: 'coereba_flaveola',
 8: 'eleutherodactylus_locustus',
 9: 'margarops_fuscatus',
 10: 'loxigilla_portoricensis',
 11: 'vireo_altiloquus',
 12: 'eleutherodactylus_portoricensis',
 13: 'megascops_nudipes',
 14: 'eleutherodactylus_richmondi',
 15: 'patagioenas_squamosa',
 16: 'eleutherodactylus_antillensis',
 17: 'turdus_plumbeus',
 18: 'eleutherodactylus_unicolor',
 19: 'coccyzus_vieilloti',
 20: 'todus_mexicanus',
 21: 'eleutherodactylus_wightmanae',
 22: 'nesospingus_speculiferus',
 23: 'spindalis_portoricensis'}

In [8]:
from pathlib2 import Path

recording_ids = [path.stem for path in Path('data/train').iterdir()]

species_idxs = []

for recording_id in recording_ids:
    species_idxs.append(train_tp[train_tp.recording_id == recording_id].species_id.tolist())

In [12]:
species_lists = [[id2species[idx] for idx in idxs] for idxs in species_idxs]

In [15]:
anno = pd.DataFrame(data={'recording_id': recording_ids, 'species': species_lists})

There are some instances, where each recording has been assigned multiple labels of the same kind. Let's normalize this information by removing duplicates.

In [23]:
anno.species = anno.species.apply(lambda species_list: list(set(species_list)))

In [24]:
anno.head()

Unnamed: 0,recording_id,species
0,4071f7aa7,[]
1,ee922d946,[]
2,939659635,[]
3,d58429096,"[coereba_flaveola, patagioenas_squamosa]"
4,0608aef17,[]


In [25]:
anno.to_pickle('data/anno.pkl')

Let's remain `train` to audio for consistency going forward.

In [18]:
mv data/train data/audio

Ok, the data looks like we would like it too look. Let's save everything and upload it.

In [26]:
!cd data && zip -qr tropical_soundscape.zip *