In [1]:
!wget https://ndownloader.figshare.com/articles/11905533/versions/1 -O data.zip

--2020-04-13 15:49:17--  https://ndownloader.figshare.com/articles/11905533/versions/1
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 52.212.16.124, 34.249.48.57, 52.51.133.64, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|52.212.16.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190890195 (182M) [application/zip]
Saving to: ‘data.zip’


2020-04-13 15:49:27 (17.8 MB/s) - ‘data.zip’ saved [190890195/190890195]



In [2]:
!unzip data.zip -d data

Archive:  data.zip
 extracting: data/AdultVocalizations.zip  
 extracting: data/ChickVocalizations.zip  
 extracting: data/Library_notes.pdf  


In [3]:
%%capture
!unzip data/AdultVocalizations.zip -d data/adult_vocalizations
!unzip data/ChickVocalizations.zip -d data/chick_vocalizations

In [4]:
from IPython.lib.display import Audio
import re
import pathlib
import pandas as pd
from fastcore.all import *

In [5]:
adult_paths = L(pathlib.Path('data/adult_vocalizations/').iterdir())
chick_paths = L(pathlib.Path('data/chick_vocalizations/').iterdir())

In [6]:
len(adult_paths), len(chick_paths), len(adult_paths) + len(chick_paths)

(2969, 464, 3433)

In [7]:
name_pattern = re.compile('(.*)_(.*)-(.*)-(.*)') # should extract name, data of recording, call type, rendition num

m = re.match(name_pattern, adult_paths[0].stem)
m.groups()

('WhiRas44dd', '110815', 'ThuckC', '37')

In [8]:
Audio(filename=adult_paths[2])

In [9]:
adult_paths[0].name

'WhiRas44dd_110815-ThuckC-37.wav'

In [10]:
paths = adult_paths + chick_paths

df = pd.DataFrame(data={'fn': [path.name for path in paths]})
df['adult'] = [path.parent.name == 'adult_vocalizations' for path in paths]

In [11]:
df.head()

Unnamed: 0,fn,adult
0,WhiRas44dd_110815-ThuckC-37.wav,True
1,WhiBlu5698_110304-TetC-13.wav,True
2,HPiHPi4748_110706-AggC-31.wav,True
3,GraGra0201_110623-NestC-29.wav,True
4,YelGre5275_110622-NestC-29.wav,True


Unfortunately, there is an issue with naming but it can easily be corrected.There are quite a few names that have an extra underscore: GreRas2400_110615_TetC-28.wav instead of GreRas2400_110615-TetC-28.wav.

We can work around this by modifying our regex pattern.

In [12]:
name_pattern = re.compile('(.*)_(.*)[-_](.*)-(.*)\.wav') # should extract name, data of recording, call type, rendition num

In [13]:
name_date_type_num = L(re.match(name_pattern, path.name).groups() for path in paths)

In [14]:
name, date, call_type, rendition_num = [list(l) for l in zip(*name_date_type_num)]

In [15]:
name[:5], date[:5], call_type[:5], rendition_num[:5]

(['WhiRas44dd', 'WhiBlu5698', 'HPiHPi4748', 'GraGra0201', 'YelGre5275'],
 ['110815', '110304', '110706', '110623', '110622'],
 ['ThuckC', 'TetC', 'AggC', 'NestC', 'NestC'],
 ['37', '13', '31', '29', '29'])

Some manual clean up of infrequently occuring issues

In [16]:
date[923], date[2798]

('110706', '110421')

In [17]:
date[923] = '110608'; date[2798] = '110608'

In [18]:
call_type[923] = 'Ne'; call_type[2798] = 'Ne'

In [19]:
date[926], date[1209], date[1884] = [None ] * 3

In [20]:
df['name'] = name
df['date_recorded'] = [pd.to_datetime(d, yearfirst=True, errors='coerce') for d in date]
df['call_type'] = call_type
df['rendition_num'] = rendition_num

It also seems that there were some recordings classified as coming both from chicks and adults

In [21]:
df.fn.nunique()

3405

That's just 14 recordings that have been assigned both labels. Let's remove them from the dataset.

In [22]:
df = df[~df.fn.duplicated()]

In [23]:
df.reset_index(inplace=True, drop=True)

In [24]:
sorted(df.call_type.unique())

['Ag',
 'AggC',
 'BeggSeq',
 'Beggseq',
 'C',
 'DC',
 'DisC',
 'LTC',
 'Ne',
 'NeArkC',
 'NeKakleC',
 'NeSeq',
 'NekakleC',
 'NestC',
 'NestCSeq',
 'NestCseq',
 'NestSeq',
 'So',
 'Song',
 'Te',
 'Tet',
 'TetC',
 'ThuC',
 'ThuckC',
 'ThukC',
 'TukC',
 'WC',
 'Wh',
 'Whi',
 'WhiC',
 'WhiCNestC',
 'Whine',
 'WhineC',
 'WhineCSeq']

As mentioned in the library notes ('data/Library_notes.pdf') that came with the data, there are issues with the labels.

Some of the malformed labels are ambiguous. For instance, is 'ThuC' a 'Thuk' or 'Tuck'? What about 'Thuck? Is 'DisC' a 'Distance' call or a 'Distress' call?

We have no way of telling unless we listen to the calls ourselves and make educated guesses. Let's do a preliminary, high level clean up as we work on unifying the labels. Later, once we have a trained model we will be able to leverage it to further identify any potential issues.

But how do we go about the initial data clean up? We could load recordings one by one, but that gets tedious quickly and it is also easy to make a mistake. To be efficient and effective at this task, we could use a tool that could give us instantenous access to calls of a particular type, so that we can easily compare them.

Let's hack something like this together using ipywidgets.

Let's move all the files into a single directory so that we have an easier time working with the data.

In [25]:
!mkdir data/vocalizations

!cp data/adult_vocalizations/* data/vocalizations/
!cp data/chick_vocalizations/* data/vocalizations/

In [26]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [27]:
def create_expanded_button(description, button_style):
    return widgets.Button(
        description=description,
        button_style=button_style,
        layout=widgets.Layout(height='auto', width='auto')
    )

In [28]:
grid = widgets.GridspecLayout(5, 8)
call_types = sorted(df.call_type.unique())

out = widgets.Output(layout={'border': '1px solid black', 'margin': '20px 0px 0px 0px'})

@out.capture(clear_output=True)
def update_output(button):
    lstrip = button.description.find('[')
    call_type = button.description[:lstrip-1]
    display(widgets.HTML(f'<p>Loaded call of type <strong>{call_type}</strong></p>'))
    fn = df[df.call_type == call_type].sample(1).fn.item()
    path = f'data/vocalizations/{fn}'

    with open(path, 'rb') as fd:
        contents = fd.read()
        audio = widgets.Audio(value=contents, autoplay=True, loop=False, controls=True)
    
    display(audio)

for i in range(5):
    for j in range(8):
        idx = i*8 + j
        if idx < len(call_types):
            n = df[df.call_type == call_types[idx]].shape[0]
            btn = create_expanded_button(f'{call_types[idx]} [{n}]', button_style='warning')
            btn.on_click(update_output)
            grid[i, j] = btn
    
initial_button = random.sample(grid.children, 1)[0] # initialize to some random value
update_output(initial_button)
    
widgets.VBox([widgets.HTML('<h3>Press to load a random recording with a given label</h3>'), grid, out])

VBox(children=(HTML(value='<h3>Press to load a random recording with a given label</h3>'), GridspecLayout(chil…

In [29]:
type2name = {
    'Ag': 'Wsst',
    'Be': 'Begging',
    'DC': 'Distance',
    'Di': 'Distress',
    'LT': 'Long Tonal',
    'Ne': 'Nest',
    'So': 'Song',
    'Te': 'Tet',
    'Th': 'Thuk',
    'Tu': 'Tuck',
    'Wh': 'Whine'
}

In [30]:
df.to_csv('data/annotations.csv', index=False)