**Dataset Name:** Egyptian Fruit Bat <br>
**Paper under which this dataset was shared**: [An annotated dataset of Egyptian fruit bat vocalizations across varying contexts and during vocal ontogeny](https://www.nature.com/articles/sdata2017143) <br>
**Lead Researcher**: [Yossi Yovel](http://www.yossiyovel.com/) <br>
**Dataset License**: [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/) <br>
**Code License**: to be linked to

**Please note**: This notebook documents we took to preprocess the dataset. It is for informational purposes only. You do not need to run the code and can safely navigate to 01_Download_The_Data_And_Construct_Dataloaders.ipynb.

# Data Preprocessing - The Egyptian Fruit Bat Dataset

The original data can be found on [figshare](https://figshare.com/collections/An_annotated_dataset_of_Egyptian_fruit_bat_vocalizations_across_varying_contexts_and_during_vocal_ontogeny/3666502).

It is not provided as a single archive, but rather as a collection of links to 65 webpages from which each piece of the data can be downloaded.

Let us download the audio files along with the provided annotations.

In [18]:
mkdir data

In [19]:
download_urls = [
    'https://ndownloader.figshare.com/files/8879599',
    'https://ndownloader.figshare.com/files/8879602',
    'https://ndownloader.figshare.com/files/8879608',
    'https://ndownloader.figshare.com/files/8879611',
    'https://ndownloader.figshare.com/files/8879617',
    'https://ndownloader.figshare.com/files/8879623',
    'https://ndownloader.figshare.com/files/8879632',
    'https://ndownloader.figshare.com/files/8879641',
    'https://ndownloader.figshare.com/files/8879653',
    'https://ndownloader.figshare.com/files/8879659',
    'https://ndownloader.figshare.com/files/8879662',
    'https://ndownloader.figshare.com/files/8879674',
    'https://ndownloader.figshare.com/files/8879683',
    'https://ndownloader.figshare.com/files/8879179',
    'https://ndownloader.figshare.com/files/8879287',
    'https://ndownloader.figshare.com/files/8879338',
    'https://ndownloader.figshare.com/files/8879392',
    'https://ndownloader.figshare.com/files/8879404',
    'https://ndownloader.figshare.com/files/8879425',
    'https://ndownloader.figshare.com/files/8879428',
    'https://ndownloader.figshare.com/files/8879431',
    'https://ndownloader.figshare.com/files/8879521',
    'https://ndownloader.figshare.com/files/8879533',
    'https://ndownloader.figshare.com/files/8879536',
    'https://ndownloader.figshare.com/files/8879545',
    'https://ndownloader.figshare.com/files/8879548',
    'https://ndownloader.figshare.com/files/8879554',
    'https://ndownloader.figshare.com/files/8879572',
    'https://ndownloader.figshare.com/files/8879578',
    'https://ndownloader.figshare.com/files/8879596',
    'https://ndownloader.figshare.com/files/7379008'
]

In [20]:
for url in download_urls:
    !cd data && curl -O -J -L {url}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2897M  100 2897M    0     0  5365k      0  0:09:12  0:09:12 --:--:-- 21.8M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3498M  100 3498M    0     0  4469k      0  0:13:21  0:13:21 --:--:-- 21.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2862M  100 2862M    0     0  5133k      0  0:09:30  0:09:30 --:--:-- 21.5M
  % Total    % Received % Xferd  Average Speed   Tim

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3067M  100 3067M    0     0  3852k      0  0:13:35  0:13:35 --:--:-- 9974k4:23  274k:35 3738k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3028M  100 3028M    0     0  7784k      0  0:06:38  0:06:38 --:--:-- 6587k9 14.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3078M  100 3078M    0     0  6766k      0  0:07:45  0:07:45 --:--:-- 21.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0 

Now that we have donwloaded the data, let us extract it.

In [9]:
mkdir extracted

In [1]:
!cd data && ls *.zip | parallel 7z e -o../extracted

Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,12 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
1 file, 1100400154 bytes (1050 MiB)

Extracting archive: files106.zip
--
Path = files106.zip
Type = zip
Physical Size = 1100400154

Everything is Ok

Files: 3436
Size:       2880702680
Compressed: 1100400154

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,12 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
1 file, 2959680169 bytes (2823 MiB)

Extracting archive: files213.zip
--
Path = files213.zip
Type = zip
Physical Size = 2959680169

Everything is Ok

Files: 10000
Size:       8724297248
Compressed: 2959680169

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,12 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
1 file, 2991098213 bytes (2853 MiB)

Extracting archive: files214.zip
--
Path = files214.zip
Type = zip
Physical Size = 2991098213

Everything is Ok

Files: 10000
Size:       8682386976
Compressed: 2991098213

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,

Having extracted the date, let us look at the annotations.

In [2]:
import pandas as pd

In [3]:
anno = pd.read_csv('data/Annotations.csv')

anno.head()

Unnamed: 0,FileID,Emitter,Addressee,Context,Emitter pre-vocalization action,Addressee pre-vocalization action,Emitter post-vocalization action,Addressee post-vocalization action,Start sample,End sample
0,7,118,0,9,2,2,3,3,1,336720
1,11,0,0,11,0,0,0,0,1,787280
2,12,118,0,12,2,2,3,3,1,566096
3,15,0,0,12,0,0,0,0,1,402256
4,20,0,0,12,0,0,0,0,1,394064


To go from the `FileID` identifier to an actual filename, we will need the `FielInfo.csv` reference file.

In [11]:
!cd data && curl -O -J -L https://ndownloader.figshare.com/files/8900695

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 30.1M  100 30.1M    0     0   439k      0  0:01:10  0:01:10 --:--:--  475k


The `FileInfo.csv` file is not a proper csv file. But the only bit of information that is important to us, is the mapping from a `FileID` to filename. Let us read this information in.

In [4]:
file_id2filename = {}

with open('data/FileInfo.csv') as f:
    for line in f.readlines():
        file_id, _, file_name, *_ = line.split(',')
        try:
            file_id2filename[int(file_id)] = file_name
        except: pass

In [5]:
import librosa
import glob

Before we move any further, let's confirm all the files indeed were recorded with the same sample rate.

In [51]:
%%time

srs = set()
for path in glob.glob('extracted/*'):
    _, sr = librosa.core.load(paths[0], sr=None)
    srs.add(sr)

srs

CPU times: user 5min 11s, sys: 52.4 s, total: 6min 3s
Wall time: 6min 3s


{250000}

This confirms that all the files have been recorded with a sample rate of 250_000 Hz.

To make the dataset easier to work with, we will iterate over the examples, offset into the relevant part of each, and write it out into a stand alone wav file.

The naming convention we will adopt is that the index of the row corresponding to a given example will form the stem of the file name.

In [7]:
mkdir audio

In [8]:
%%time

for idx, example in anno.iterrows():
    path = file_id2filename[example["FileID"]]
    x, _ = librosa.core.load(f'extracted/{path}', sr=None)
    librosa.output.write_wav(f'audio/{idx}.wav', x[example['Start sample']:example['End sample']], 250_000)

CPU times: user 2min 21s, sys: 2min 50s, total: 5min 11s
Wall time: 30min 11s


Let us now ammend the annotations to include correct file names.

In [23]:
anno['File Name'] = [f'{idx}.wav' for idx in anno.index]

We can remove the columns that are no longer neede.

In [17]:
anno.drop(columns=['FileID', 'Start sample', 'End sample'], inplace=True)

In [24]:
anno.head()

Unnamed: 0,Emitter,Addressee,Context,Emitter pre-vocalization action,Addressee pre-vocalization action,Emitter post-vocalization action,Addressee post-vocalization action,File Name
0,118,0,9,2,2,3,3,0.wav
1,0,0,11,0,0,0,0,1.wav
2,118,0,12,2,2,3,3,2.wav
3,0,0,12,0,0,0,0,3.wav
4,0,0,12,0,0,0,0,4.wav


Let's save the new annotations file.

In [25]:
anno.to_csv('annotations.csv', index=False)

This data has multiple labels that can be used for training. It is not apparent what splits would work best for this data. As such, we will not partition the dataset into a train and validation set.

We can now archive the data for distribution.

In [None]:
!zip -1qr egyptian_fruit_bats.zip annotations.csv audio