In [1]:
#default_exp metrics

First of all, let us get all the data that we need. Through the magic of `nbdev`, we will use the functionality we defined in `01_gettin_started`

In [2]:
from birdcall.data import *

In [3]:
len(train_ds), len(valid_ds)

(5280, 2640)

In [4]:
from fastai2.vision.all import *

In [5]:
BS = 128

dls = DataLoaders(
    DataLoader(dataset=train_ds, bs=BS, num_workers=NUM_WORKERS, shuffle=True),
    DataLoader(dataset=valid_ds, bs=BS, num_workers=NUM_WORKERS)
).cuda()

In [7]:
b = dls.train.one_batch()
b[0].shape, b[1].shape

(torch.Size([128, 160000]), torch.Size([128, 264]))

In [7]:
b[0].mean(), b[0].std()

(tensor(-0.0018, device='cuda:0'), tensor(1.3341, device='cuda:0'))

We need some sort of architecture to get started - the one adapted from this [paper](https://www.groundai.com/project/end-to-end-environmental-sound-classification-using-a-1d-convolutional-neural-network/1) seems like a good place to start. Bear in mind though that it was designed with considering the entire recording and with environment classification in mind - it's performance on a task like that at hand might be subpaar. Revisiting the architecture at a later time might be a good idea.

In [8]:
get_arch = lambda: nn.Sequential(*[
    Lambda(lambda x: x.unsqueeze(1)),
    ConvLayer(1, 16, ks=64, stride=2, ndim=1),
    ConvLayer(16, 16, ks=8, stride=8, ndim=1),
    ConvLayer(16, 32, ks=32, stride=2, ndim=1),
    ConvLayer(32, 32, ks=8, stride=8, ndim=1),
    ConvLayer(32, 64, ks=16, stride=2, ndim=1),
    ConvLayer(64, 128, ks=8, stride=2, ndim=1),
    ConvLayer(128, 256, ks=4, stride=2, ndim=1),
    ConvLayer(256, 256, ks=4, stride=4, ndim=1),
    Flatten(),
    LinBnDrop(5120, 512, p=0.25, act=nn.ReLU()),
    LinBnDrop(512, 512, p=0.25, act=nn.ReLU()),
    LinBnDrop(512, 256, p=0.25, act=nn.ReLU()),
    LinBnDrop(256, len(classes)),
    nn.Sigmoid()
])

A couple of functions to help us calculate metrics for diagnostics

In [9]:
#export

def preds_to_tp_fp_fn(preds, targs):
    positives = preds > 0.5
    true_positives = positives[targs == 1]
    false_positives = positives[targs != 1]
    negatives = ~positives
    false_negatives = negatives[targs == 1]
    return true_positives.sum(), false_positives.sum(), false_negatives.sum()

def precision(preds, targs):
    tp, fp, fn = preds_to_tp_fp_fn(preds, targs)
    return (tp.float() / (tp + fp)).item()

def recall(preds, targs):
    tp, fp, fn = preds_to_tp_fp_fn(preds, targs)
    return (tp.float() / (tp + fn)).item()

def f1(preds, targs, eps=1e-8):
    prec = precision(preds, targs)
    rec = recall(preds, targs)
    return 2 * (prec * rec) / (prec + rec + eps)

In [14]:
learn = Learner(
    dls,
    get_arch(),
    metrics=[AccumMetric(precision), AccumMetric(recall), AccumMetric(f1)],
    loss_func=BCELossFlat()
)

In [15]:
learn.fit(30, 1e-3)

epoch,train_loss,valid_loss,precision,recall,f1,time
0,0.668979,0.592719,0.012128,0.004167,0.006202,00:07
1,0.466117,0.205683,0.002481,0.000379,0.000657,00:09
2,0.24157,0.0605,0.0,0.0,0.0,00:12
3,0.12467,0.03594,0.025641,0.000379,0.000747,00:12
4,0.070555,0.029092,0.0,0.0,0.0,00:12
5,0.045886,0.027578,0.0,0.0,0.0,00:12
6,0.034517,0.025608,0.0,0.0,0.0,00:12
7,0.029238,0.025046,0.0,0.0,0.0,00:12
8,0.026782,0.024864,0.0,0.0,0.0,00:12
9,0.025567,0.024558,,0.0,,00:12


This doesn't look great! Maybe we have a bug somewhere in the training, maybe our code for caclulating metrics is bugged. It might also be that it will be very hard to learn from raw audio or that our architecture and the task are mismatched.

I would guess the issue lies somwhere between training on raw audio and the architecture choice. Nonetheless, this will not stop us! The first order of business is to create an end to end pipeline, all the way to successful submission. Once we have this in place, we will be in a good position to start fiddling with making improvements.

In [14]:
mkdir data/models

In [15]:
torch.save(learn.model.state_dict(), 'data/models/first_model.pth')