In [1]:
!wget https://ndownloader.figshare.com/articles/11905533/versions/1 -O data.zip

--2020-04-14 14:44:27--  https://ndownloader.figshare.com/articles/11905533/versions/1
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 52.51.133.64, 34.249.45.252, 52.212.2.22, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|52.51.133.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190890195 (182M) [application/zip]
Saving to: ‘data.zip’


2020-04-14 14:45:06 (4.67 MB/s) - ‘data.zip’ saved [190890195/190890195]



In [None]:
!unzip data.zip -d data

Archive:  data.zip
replace data/AdultVocalizations.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
%%capture
!unzip data/AdultVocalizations.zip -d data/adult_vocalizations
!unzip data/ChickVocalizations.zip -d data/chick_vocalizations

In [101]:
from IPython.lib.display import Audio
import re
import pathlib
import pandas as pd
from fastcore.all import *

In [102]:
adult_paths = L(pathlib.Path('data/adult_vocalizations/').iterdir())
chick_paths = L(pathlib.Path('data/chick_vocalizations/').iterdir())

In [103]:
len(adult_paths), len(chick_paths), len(adult_paths) + len(chick_paths)

(2969, 464, 3433)

In [104]:
name_pattern = re.compile('(.*)_(.*)-(.*)-(.*)') # should extract name, data of recording, call type, rendition num

m = re.match(name_pattern, adult_paths[0].stem)
m.groups()

('WhiWhi1415', '110405', 'NestC', '06')

In [6]:
Audio(filename=adult_paths[2])

In [105]:
adult_paths[0].name

'WhiWhi1415_110405-NestC-06.wav'

In [106]:
paths = adult_paths + chick_paths

df = pd.DataFrame(data={'fn': [path.name for path in paths]})
df['adult'] = [path.parent.name == 'adult_vocalizations' for path in paths]

In [107]:
df.head()

Unnamed: 0,fn,adult
0,WhiWhi1415_110405-NestC-06.wav,True
1,WhiLbl0010_110502-DC-23.wav,True
2,BluRas61dd_110406-TetC-13.wav,True
3,GraLbl0457_110429-Song-08.wav,True
4,BlaLbl8026_110429-DC-05.wav,True


Unfortunately, there is an issue with naming but it can easily be corrected.There are quite a few names that have an extra underscore: GreRas2400_110615_TetC-28.wav instead of GreRas2400_110615-TetC-28.wav.

We can work around this by modifying our regex pattern.

In [108]:
name_pattern = re.compile('(.*)_(.*)[-_](.*)-(.*)\.wav') # should extract name, data of recording, call type, rendition num

In [109]:
name_date_type_num = L(re.match(name_pattern, path.name).groups() for path in paths)

In [110]:
name, date, call_type, rendition_num = [list(l) for l in zip(*name_date_type_num)]

In [111]:
name[:5], date[:5], call_type[:5], rendition_num[:5]

(['WhiWhi1415', 'WhiLbl0010', 'BluRas61dd', 'GraLbl0457', 'BlaLbl8026'],
 ['110405', '110502', '110406', '110429', '110429'],
 ['NestC', 'DC', 'TetC', 'Song', 'DC'],
 ['06', '23', '13', '08', '05'])

Some manual clean up of infrequently occuring issues

In [112]:
date[923], date[2798]

('110608', '110518')

In [113]:
date[923] = '110608'; date[2798] = '110608'

In [117]:
call_type[923] = 'Ne'; call_type[2798] = 'Ne'; call_type[869] = 'Ne'; call_type[1663] = 'Ne'

In [118]:
date[926], date[1209], date[1884] = [None ] * 3

In [119]:
df['name'] = name
df.name = df.name.str.lower()
df['date_recorded'] = [pd.to_datetime(d, yearfirst=True, errors='coerce') for d in date]
df['call_type'] = call_type
df['rendition_num'] = rendition_num

It also seems that there were some recordings classified as coming both from chicks and adults

In [120]:
df.shape[0], df.fn.nunique()

(3433, 3405)

In [121]:
df[df.adult].name.nunique()

37

In [122]:
df[~df.adult].name.nunique()

18

Looking closer at the situation and comparing with the information in the paper (45 zebra finches were included in the study, 18 chicks and 27 adults) we can assume that the issue is that some calls were erronously assigned the adult label. To clean up the dataset a little bit more, let's remove them

In [123]:
df = df.drop(index=df[df.fn.duplicated(keep=False) & (df.adult)].index)

In [124]:
df.reset_index(drop=True, inplace=True)
df.shape

(3405, 6)

In [125]:
df.groupby('adult').nunique()['name']

adult
False    18
True     36
Name: name, dtype: int64

The situation is still not ideal as there are still more adult zebra finches in the dateset than there should be according to the paper. Also, some zebra finches are still classified as chicks and adults (but maybe they were recorded both when they were chicks and when they were adults).

In [126]:
for name in df[~df.adult].name.unique():
    if name in df[df.adult].name.unique(): print(name)

lblras1800
lblblu2028
graras1500
lblblu1630
lblblu1729


In [127]:
sorted(df.call_type.unique())

['Ag',
 'AggC',
 'BeggSeq',
 'Beggseq',
 'DC',
 'DisC',
 'LTC',
 'Ne',
 'NeArkC',
 'NeKakleC',
 'NeSeq',
 'NekakleC',
 'NestC',
 'NestCSeq',
 'NestCseq',
 'NestSeq',
 'So',
 'Song',
 'Te',
 'Tet',
 'TetC',
 'ThuC',
 'ThuckC',
 'ThukC',
 'TukC',
 'WC',
 'Wh',
 'Whi',
 'WhiC',
 'WhiCNestC',
 'Whine',
 'WhineC',
 'WhineCSeq']

As mentioned in the library notes ('data/Library_notes.pdf') that came with the data, there are issues with the labels.

Some of the malformed labels are ambiguous. For instance, is 'ThuC' a 'Thuk' or 'Tuck'? What about 'Thuck? Is 'DisC' a 'Distance' call or a 'Distress' call?

We have no way of telling unless we listen to the calls ourselves and make educated guesses. Let's do a preliminary, high level clean up as we work on unifying the labels. Later, once we have a trained model we will be able to leverage it to further identify any potential issues.

But how do we go about the initial data clean up? We could load recordings one by one, but that gets tedious quickly and it is also easy to make a mistake. To be efficient and effective at this task, we could use a tool that could give us instantenous access to calls of a particular type, so that we can easily compare them. I hacked something like this together and it can be accessed through this [notebook](application_for_playing_vocalizations_to_help_with_labeling.ipynb). Using this functionality I will now attempt merging the labels.

Let's move all the files into a single directory so that we have an easier time working with the data.

In [128]:
!mkdir -p data/vocalizations

!cp data/adult_vocalizations/* data/vocalizations/
!cp data/chick_vocalizations/* data/vocalizations/

In [130]:
type2name = {
    'Ag': 'Wsst',
    'Be': 'Begging',
    'DC': 'Distance',
    'Di': 'Distress',
    'LT': 'Long Tonal',
    'Ne': 'Nest',
    'So': 'Song',
    'Te': 'Tet',
    'Th': 'Thuk',
    'Tu': 'Tuck',
    'Wh': 'Whine'
}

In [131]:
df.to_csv('data/annotations.csv', index=False) # saving the annotations for use in the application for playing 
                                               # example sounds through a browser

In [132]:
type2labels = {
    'Ag': ['Ag', 'AggC'],
    'Be': ['BeggSeq', 'Beggseq'],
    'DC': ['DC'],
    'Di': ['DisC'],
    'LT': ['LTC'],
    'Ne': ['Ne','NeArkC', 'NeKakleC', 'NeSeq', 'NekakleC', 'NestC', 'NestCSeq', 'NestCseq', 'NestSeq'],
    'So': ['So', 'Song'],
    'Te': ['Te', 'Tet','TetC'],
    'Th': ['ThuC', 'ThuckC', 'ThukC'],
    'Tu': ['TukC'],
    'Wh': ['WC', 'Wh', 'Whi', 'WhiC', 'WhiCNestC', 'Whine', 'WhineC', 'WhineCSeq'],
}

In [133]:
labels2type = {}
for t in type2labels.keys():
    for lbl in type2labels[t]:
        labels2type[lbl] = t
print(labels2type)

{'Ag': 'Ag', 'AggC': 'Ag', 'BeggSeq': 'Be', 'Beggseq': 'Be', 'DC': 'DC', 'DisC': 'Di', 'LTC': 'LT', 'Ne': 'Ne', 'NeArkC': 'Ne', 'NeKakleC': 'Ne', 'NeSeq': 'Ne', 'NekakleC': 'Ne', 'NestC': 'Ne', 'NestCSeq': 'Ne', 'NestCseq': 'Ne', 'NestSeq': 'Ne', 'So': 'So', 'Song': 'So', 'Te': 'Te', 'Tet': 'Te', 'TetC': 'Te', 'ThuC': 'Th', 'ThuckC': 'Th', 'ThukC': 'Th', 'TukC': 'Tu', 'WC': 'Wh', 'Wh': 'Wh', 'Whi': 'Wh', 'WhiC': 'Wh', 'WhiCNestC': 'Wh', 'Whine': 'Wh', 'WhineC': 'Wh', 'WhineCSeq': 'Wh'}


In [134]:
labels2type['DisC']

'Di'

In [135]:
df.call_type = df.call_type.apply(lambda x: labels2type[x])

In [136]:
df.call_type = df.call_type.apply(lambda x: type2name[x])

In [137]:
df.call_type.value_counts()

Tet           613
Distance      607
Nest          581
Thuk          301
Begging       262
Tuck          239
Long Tonal    215
Wsst          200
Song          192
Whine         175
Distress       20
Name: call_type, dtype: int64

In [138]:
df.to_csv('data/annotations.csv', index=False) # saving the cleaned up csv now